MSc. Thesis - Universidade de Lisboa

124
Mind the Gap! Bridging the Reality Gap in Visual Perception and Robotic Grasping with Domain Randomisation João Henrique Pechirra de Carvalho Borrego Thesis to obtain the Master of Science Degree in Electrical Engineering Supervisors: Dr. Plinio Moreno Lopez Prof. José Alberto Rosado dos Santos Vitor Examination Committee Chairperson: Prof. João Fernando Cardoso Silva Sequeira Supervisor: Dr. Plinio Moreno Lopez Members of the Committee: Prof. Luís Manuel Marques Custódio November 2018

Transcript of MSc. Thesis - Universidade de Lisboa

Mind the Gap!

Bridging the Reality Gap in Visual Perception and Robotic Grasping with

Domain Randomisation

João Henrique Pechirra de Carvalho Borrego

Thesis to obtain the Master of Science Degree in

Electrical Engineering

Supervisors: Dr. Plinio Moreno LopezProf. José Alberto Rosado dos Santos Vitor

Examination Committee

Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Dr. Plinio Moreno Lopez

Members of the Committee: Prof. Luís Manuel Marques Custódio

November 2018

Declaration

I declare that this document is an original work of my own authorship and that it fulfills all the require-

ments of the Code of Conduct and Good Practices of the Universidade de Lisboa.

ii

Acknowledgments

First and foremost, I would like to thank my mom, dad, stepmom and all three of my brothers, whom

I love dearly. Your continuous support and care over the years was invaluable to the materialisation of

this project. To my grandmothers, uncles, aunts, cousins, and the like, for the unmoved trust you put on

me.

To my colleagues and friends Atabak and Rui, without whom this work could not have been carried

out. To my dissertation supervisor Prof. José Santos-Victor, and co-supervisor Plinio for their guidance,

encouragement and precious insight into which direction I should pursue with my research. To the whole

of Vislab, which has naturally become a second home for me over the last year. Never have I felt so

welcome and surrounded by bright, interesting individuals, yet also kind, unassuming and approachable.

To all my friends, from different stages of my life, who have led me to grow into the person I am

today. For all the compelling debates, nonsensical banter, entertaining distractions and camaraderie

that helped me through countless working late nights and tough moments.

If this achievement bears any meaning, it is undoubtedly because of each and every one of you.

Thank you.

iii

Abstract

In this work, we focus on Domain Randomisation (DR), which has rapidly gained popularity among the

research community when the task at hand involves knowledge transfer from simulated to the real-world

domain. This is common in the field of robotics, where the cost of performing experiments with real

robots is too unwieldy for acquiring the massive amount of data required for deep learning methods. We

propose to study the impact of introducing random variations in both visual and physical properties in a

virtual robotics environment. This is done by tackling two different problems namely, 1) object detection

and 2) autonomous grasping. Concerning 1), to the best of our knowledge, our research was the first

to extend the application of DR to a multi-class shape detection scenario. We introduce a novel DR

framework for generating synthetic data in a widely popular open-source robotics simulator (Gazebo).

We then train a state–of–the–art object detector with several simulated-image datasets and conclude

that DR can lead to as much as 26% improvement in mAP (mean Average Precision) over a fine-tuning

baseline, which is pre-trained on a huge, domain-generic dataset. Regarding 2) we created a pipeline for

simulating grasp trials, with support for both simple grippers and complex multi-fingered robotic hands.

We extend the aforementioned framework to perform randomisation of physical properties and ultimately

analyse the effect of applying DR to a virtual grasping scenario.

Keywords

Domain Randomisation; Reality Gap; Simulation; Object Detection; Grasping; Robotics; Synthetic Data

v

Resumo

Neste trabalho focamo-nos em Domain Randomisation (DR), que tem ganho popularidade junto da

comunidade investigadora quando a tarefa em estudo envolve transferir conhecimento de um domínio

simulado para o mundo real. Isto é comum na área de robótica, na qual o custo da realização de exper-

iências com um robô real é demasiado elevado para se adquirir a quantidade necessária de dados para

o uso de métodos de deep learning (aprendizagem profunda). Propomos o estudo do impacto de intro-

duzir variações aleatórias em propriedades tanto visuais como físicas, num ambiente robótico virtual.

Isto é feito ao abordar dois problemas distintos, nomeadamente 1) detecção de objectos e 2) grasping

autónomo. No que diz respeito a 1), tanto quanto sabemos, o nosso trabalho foi o primeiro a aplicar

DR à detecção de objectos multi-classe. Introduzimos uma nova ferramenta de DR para a geração de

dados sintéticos num estabelecido simulador de robótica open-source. Posteriormente, treinámos um

detector de objectos estado–de–arte com múltiplos datasets de imagens simuladas e concluímos que

DR pode levar a um aumento de até 26% em mAP (mean Average Precision) relativamente ao uso da

técnica de fine-tuning, que envolve o pré-treino num dataset enorme e genérico. Relativamente a 2),

criámos um procedimento para a simulação de tentativas de grasp, com suporte tanto para grippers

como para mãos robóticas complexas com múltiplos dedos. Estendemos a ferramenta anteriormente

mencionada para possibilitar a variação de propriedades físicas e finalmente analisámos o impacto de

aplicar DR a um cenário de grasping virtual.

Palavras Chave

Domain Randomisation; Reality Gap; Simulação; Detecção de Objectos; Grasping; Robótica; Dados

Sintéticos

vii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Reality gap and domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Object detection and Convolutional Neural Networks . . . . . . . . . . . . . . . . . 6

1.2.2 Grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.2.A Coulomb friction model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.2.B Force-closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Related Work 13

2.1 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Robotic grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Analytic and data-driven grasp synthesis . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Vision-based grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2.A Grasp detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2.B Grasp quality estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.2.C End-to-end reinforcement learning . . . . . . . . . . . . . . . . . . . . . . 23

2.2.2.D Dexterous grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Domain Randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.1 Visual domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Physics domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

ix

3 Domain Randomisation for Detection 33

3.1 Parametric object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.1 GAP: A domain randomisation framework for Gazebo . . . . . . . . . . . . . . . . 35

3.1.1.A Gazebo simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.1.B Desired domain randomisation features . . . . . . . . . . . . . . . . . . . 37

3.1.1.C Offline pattern generation library . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.1.D GAP: A collection of Gazebo plugins for domain randomisation . . . . . . 39

3.2 Proof–of–concept trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Scene generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.2 Preliminary results for Faster R-CNN and SSD . . . . . . . . . . . . . . . . . . . . 42

3.3 Domain randomisation vs fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Single Shot Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Improved scene generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.3 Real images dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.4 Experiment: domain randomisation vs fine-tuning . . . . . . . . . . . . . . . . . . . 47

3.3.5 Ablation study: individual contribution of texture types . . . . . . . . . . . . . . . . 50

4 Domain Randomisation for Grasping 55

4.1 Grasping in simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Overview and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2 Dex-Net pipeline for grasp quality estimation . . . . . . . . . . . . . . . . . . . . . 58

4.1.3 Offline grasp sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.3.A Antipodal grasp sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.3.B Reference Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 GRASP: Dynamic grasping simulation within Gazebo . . . . . . . . . . . . . . . . . . . . . 64

4.2.1 DART physics engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.2 Disembodied robotic manipulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.3 Hand plugin and interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Extending GAP with physics-based domain randomisation . . . . . . . . . . . . . . . . . . 67

4.3.1 Physics randomiser class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4 DexNet: Domain randomisation vs geometry-based metrics . . . . . . . . . . . . . . . . . 68

4.4.1 Novel grasping dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.2 Experiment: grasp quality estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4.2.A Trials without domain randomisation . . . . . . . . . . . . . . . . . . . . . 71

4.4.2.B Trials with domain randomisation . . . . . . . . . . . . . . . . . . . . . . . 72

x

5 Discussion and Conclusions 75

5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A Appendix 89

A.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.2 Intersection over Union (IoU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.3 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.3.A PASCAL VOC Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.1.4 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.1.5 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.1.5.A L1-norm and L2-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.2.1 GAP Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.2.1.A Simultaneous multi-object texture update . . . . . . . . . . . . . . . . . . 93

A.2.1.B Creating a custom synthetic dataset . . . . . . . . . . . . . . . . . . . . . 94

A.2.1.C Randomising robot visual appearance . . . . . . . . . . . . . . . . . . . . 94

A.2.1.D Known issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.2.2 GRASP Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.2.2.A Simulation of underactuated manipulators . . . . . . . . . . . . . . . . . . 95

A.2.2.B Efficient contact retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.2.2.C Adding a custom manipulator . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.2.2.D Adjusting randomiser configuration . . . . . . . . . . . . . . . . . . . . . 97

A.2.2.E Reference frame: Pose and Homogenous Transform matrix . . . . . . . . 97

A.2.2.F Known issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.3 Additional Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

xi

xii

List of Figures

1.1 Real and simulated robotic grasp trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Domain Randomisation of visuyayal properties . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Computer Vision problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Convolutional neural net topology comparison . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Friction cone and pyramid approximation representations . . . . . . . . . . . . . . . . . . 10

2.1 Cornell grasping dataset examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Synthetic scenes employing domain randomisation . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Examples from car detection dataset employing domain randomisation . . . . . . . . . . . 28

2.4 Randomly generated object meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 OpenAI synthetic scenes with Shadow Dexterous Hand . . . . . . . . . . . . . . . . . . . 31

3.1 Gazebo architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Pattern generation library example output . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Example scene created with GAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Synthetic scenes generated using GAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Preliminary results on real image test set . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.6 Real image training set examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.7 Viewpoint candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.8 Per class AP and mAP for different training sets . . . . . . . . . . . . . . . . . . . . . . . . 49

3.9 SSD output on test set for FV + Real . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.10 PR curves for SSD trained with sets of 30k images . . . . . . . . . . . . . . . . . . . . . . 51

3.11 SSD performance on validation set, trained with sets of 6k images . . . . . . . . . . . . . 52

3.12 SSD performance on the test set, trained with sets of 6k images . . . . . . . . . . . . . . 52

3.13 PR curves for SSD trained with sets of 6k images and fine-tuned on real dataset . . . . . 53

4.1 Antipodal grasp candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 Reference frames for Baxter gripper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

xiii

4.3 Time-lapse of successful grasp trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4 Proposed metric comparison with baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A.1 Shadow Dexterous Hand with randomised individual link textures . . . . . . . . . . . . . . 94

xiv

List of Tables

2.1 Randomisation of dynamic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Availability of robotic platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 SSD and Faster R-CNN preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Number of real images in train, validation and test partitions. . . . . . . . . . . . . . . . . 46

3.4 Number of instances and percentage of different classes in the real dataset. . . . . . . . . 46

3.5 SSD performance on the test set, trained with sets of 30k images . . . . . . . . . . . . . . 49

3.6 SSD performance on the test set (6k training sets) . . . . . . . . . . . . . . . . . . . . . . 51

4.1 Grasp trials preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Randomised physical properties for grasp trials . . . . . . . . . . . . . . . . . . . . . . . . 72

A.1 Grasp trials preliminary per-object results . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

xv

xvi

List of Algorithms

4.1 Grasp generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Antipodal grasp sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Dynamic trial run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.1 Perlin noise generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xvii

xviii

Listings

A.1 Randomiser configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

xix

xx

Acronyms

AP Average Precision

API Application Programming Interface

COCO Common Objects In Context

CNN Convolutional Neural Network

CVAE Conditional Variational Auto-Encoder

DART Dynamic Animation and Robotics Toolkit

DOF degrees of freedom

DR Domain Randomisation

fps frames per second

FoV Field of View

GAN Generative Adversarial Network

GPU Graphical Processing Unit

GWS Grasp Wrench Space

IK Inverse Kinematics

ILSVRC Imagenet’s Large Scale Visual

Recognition Challenge

IoU Intersection over Union

mAP mean Average Precision

MDP Markov decision process

ODE Open Dynamic Engine

RL Reinforcement Learning

RoI Region of Interest

RPN Region Proposal Network

sdf signed distance function

SDF Simulation Description Format

SSD Single Shot Detector

SVM Support Vector Machine

VAE Variational Auto-Encoder

URDF Unified Robot Description Format

YOLO You Only Look Once

xxi

xxii

1Introduction

Contents

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Reality gap and domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Object detection and Convolutional Neural Networks . . . . . . . . . . . . . . . . 6

1.2.2 Grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1

Almost in the beginning was curiosity.

Isaac Asimov, 1988

2

1.1 Motivation

Currently, robots thrive in rigidly constrained environments, such as factory plants, where human-

robot interaction is kept to a bare minimum. Our long-term goal as researchers on robotic systems is

to aid in the transition of this field into our daily lives, namely to our households. Ultimately, we strive to

achieve cooperative, effective and meaningful interactions. For this to occur we are met with numerous

challenges and unsolved research problems. In this work, we focus on tackling robotic visual percep-

tion and autonomous grasping, which are crucial skills for achieving any kind of significant synergistic

relationship with humans.

Regarding computer vision, the emergence of deep neural networks, and specifically Convolutional

Neural Networks has led to outstanding results in many challenging tasks. These frameworks are par-

ticularly appealing in that they automate feature extraction from raw image data, thus allowing one to

forego feature engineering, a popular approach that falls into the category of what we now designate as

classical machine learning.

An important drawback of employing deep learning is its requirement of substantial amounts of data.

This is, in turn, due to the typically enormous number of parameters that have to be estimated. In

computer vision, manual data annotation is a slow and tedious process, particularly for large image

datasets such as ImageNet [12]. In robotics, acquiring massive datasets of physical experiments is

generally impracticable, since physical trials are greatly both resource and time-consuming as well as

overall cumbersome to carry out. The work of Levine et al. [39] from Google is an excellent example

of this, as the authors acquired a massive autonomous grasping dataset, employing between 6 to 14

robotic arms over the course of several months (check Figure 1.1(a)).

A common alternative to real-robot experiments is to resort to computer simulation of robotic environ-

ments. This allows substantially reducing costs and the risk of damaging a real robot, or its environment,

plus replicating experiments effortlessly. Furthermore, simulated robotic experiments can generally be

accelerated far past real-world execution times, thus solving reduced data availability. Figure 1.1 depicts

both real and simulated robotic grasping dataset acquisition setups.

However, even the most sophisticated simulators fail to flawlessly capture reality. Thus, machine

learning algorithms trained solely on synthetic data tend to overfit the virtual domain and therefore gener-

alise poorly when applied to a real-world scenario. Our work studies Domain Randomisation [70], which

attempts to bridge this gap between simulation and reality and ultimately aims to produce synthetic data

which can be directly used for training machine learning algorithms for real-robot task execution.

3

(a) Google arm farm [39] (b) Simulated KUKA arm [56]

Figure 1.1: Real and simulated robotic grasp trials, left and right respectively.

1.1.1 Reality gap and domain randomisation

The discrepancy between the real world and the simulated environment is often referred to as the

reality gap. There are two common approaches to bridging this disparity, either (i) reducing the gap

by attempting to increase the resemblance between these two domains or (ii) exploring methods that

are trained in a more generic domain, representative of simulation and reality domains simultaneously.

To achieve the former, we can increase the accuracy of the simulators, in an attempt to obtain high-

fidelity results. This method has a major limitation: it requires great effort in the creation of systems

which model complex physical phenomena to attain realistic simulation. Our work focuses mainly on

the second approach. Instead of diminishing the reality gap in order to use traditional machine-learning

procedures, we analyse methods that are aware of this disparity.

Rather than attempting to perfectly emulate reality, we can create models that strive to achieve ro-

bustness to high variability in the environment. Domain Randomisation (DR) is a simple yet powerful

technique for generating training data for machine-learning algorithms. At its core, it consists in synthet-

ically generating or enhancing data by introducing random variance in the environment properties that

are not essential to the learning task. This idea dates back to at least 1997 [28], with Jakobi’s observa-

tion that evolved controllers exploit the unrealistic details of flawed simulators. His work on evolutionary

robotics studies the hypothesis that controllers can evolve to become more robust by introducing random

noise in all the aspects of simulation which do not have a basis in reality, and only slightly randomising the

remaining which do. It is expected that given enough variability in the simulation, the transition between

simulated and real domains is perceived by the model as a mere disturbance, to which it has become

robust. This is the main argument for this method. Recent literature has established that DR enables

achieving valuable generalisation capabilities [29, 70, 71, 78, 51, 73]. The term came into prominence

with the work of Tobin et al. [70] (Figure 1.2), and has since been applied in several virtual environments

to a range of visual and physical properties.

4

Source: Adapted from Tobin et al. [70]Figure 1.2: Example of Domain Randomisation (DR) applied to visuals of synthetic tabletop grasping scenario.

Three synthetic images are depicted on the left, the real robot setup is shown on the right.

1.1.2 Problem statement

This work focuses on exploring several DR methods employed in recent research which led to promis-

ing results with regard to overcoming the reality gap. Concretely, we study two different scenarios.

We start by addressing the task of object detection using deep neural networks trained on synthetic

data. We create tools for generating datasets employing DR to visual properties of a tabletop scene,

integrated into a well-known open-source robotics simulator. We claim that we can use such datasets to

train detectors that can overcome the reality gap by fine-tuning on a limited amount of real-world domain-

specific images. What is more, we find that this approach can outperform simple, yet widespread domain

adaptation method involving pre-training detectors on huge domain-generic image datasets.

Subsequently, we move to a robotic grasp setting and aim to understand the impact of applying the

principles of DR to the physical properties of the simulated environment. Specifically, we wish to modify

an existing state–of–the–art grasp quality estimation pipeline to incorporate full dynamic simulation of

grasp trials in a widespread open-source robotics simulator. We conjecture that by applying randomness

to physical properties such as friction coefficients, and grasp targets mass, will ultimately lead to more

robust systems.

The research questions we propose to examine were split in two subjects, namely (i) object detection,

and (ii) simulated grasping, and can be summarised as follows:

(i) RQ1: Can we successfully train state–of–the–art object detectors with synthetic datasets created

exploiting Domain Randomisation in a multi-class scenario?

(i) RQ2: Can small domain-specific synthetic datasets generated employing DR be preferable to

pre-training object detectors on huge generic datasets, such as COCO [40]?

(i) RQ3: Which is the impact of each of the texture patterns most commonly employed for visual

Domain Randomisation, in an object detection task?

5

(ii) RQ4: How can we integrate dynamic simulation of robotic grasping trials into a state–of–the–art

framework for grasp quality estimation?

(ii) RQ5: Can DR improve synthetic grasping dataset generation, employing a physics-based grasp

metric in simulation?

1.2 Background

Now that we have presented the challenge at hand and its relevance, we establishconcepts required for fully understanding the context in which our work is inserted.

1.2.1 Object detection and Convolutional Neural Networks

Being capable of performing object classification, localisation, detection and segmentation is closely

related to the ability to reason about and interact with dynamic environments.

Object classification consists of assigning a single label to a given image. Localisation includes not

only classifying the subject of an image but also identifying its position, usually by means of a rectangular

bounding box. Object detection assumes the possibility that more than a single instance can exist in a

single image, namely of different classes. Thus the desired output consists of every instance’s class

label and respective bounding box. Finally, segmentation includes all of the aforementioned problems,

but each instance is defined by its contour. Its objective is to obtain the label and outline of each instance

on the input image. Figure 1.3 provides an example of each of these tasks.

Source: Original images from ImageNet dataset [12]Figure 1.3: Computer Vision problems. Classification, localisation, detection and segmentation respectively from

left to right.

Neural networks can be thought of as generic function approximators, in the sense that their task

consists of learning a mapping between input and output variables. A traditional neural network receives

an input feature vector and processes it along its hidden layers in order to produce the desired output.

The process of training the network is essentially learning the parameters of these hidden layers, which

can be done in a supervised manner, by optimising a carefully designed loss function.

6

Each layer is composed of neurons, which are the primitive building blocks of neural networks. Neu-

rons perform a dot product with an input and the internal weight parameters, add a bias term, and apply

some non-linear activation function, e.g. threshold operation. The output of a neuron is in turn connected

to the input of another neuron of the following layer. For this reason these architectures can effectively

learn highly nonlinear functions.

In a traditional neural network all layers are fully connected, i.e. each neuron is connected to every

neuron of the previous layer. Therefore, these architectures display poor scalability in visual processing

tasks where the network input is a raw image, as far as computational resources are concerned. This is

due to the huge amount of parameters that need to be estimated, as each connection in the network is

associated with a given weight.

Convolutional Neural Networks (CNNs) are similar to regular neural networks in the sense that they

consist of neurons for which we wish to estimate weight and bias parameters. However, they assume

the input is an image with a given width and height and some number of channels (depth). CNNs have

a much more efficient structure as each neuron is only connected to a specific region of the previous

layer. Each layer of a CNN can be seen as a 3D volume as represented in Figure 1.4.

Figure 1.4: Convolutional neural net topology comparison. A standard neural network is represented on the left,and a convolutional neural network on the right. Each circle represents a neuron.

In a CNN each layer transforms a 3D input into some 3D output by means of a differentiable function.

There are essentially four main types of such layers.

1. Convolutional (CONV) layer. This is the core building block of CNNs. Its parameters consist of

filters that only apply to small areas (width×height) but to the whole depth.

2. Rectified linear unit (ReLU) layer. It combines non-linearity and rectification layers, by performing

a simple max(0, x) operation.

3. Pooling (POOL) layer. Its main purpose is to compress the size of the representation, thus reducing

the number of parameters and risk of overfitting. Typically consists of applying a simple max

operation.

7

4. Fully-connected (FC) layer.

When training a neural network each iteration consists of two steps. The first, known as forward

pass consists of computing the output of the network for a given input, with fixed weight parameters. The

second, named backward pass involves recursively computing the error between the network output and

the desired output value (for instance class label), and propagating the results all the way back to the

inputs of the network, adjusting its weights in the process. The latter step is much more computationally

expensive than the former.

In order to streamline CNN training, we often resort to using a batch of training examples per training

iteration, rather than providing a single image. However, this comes at the expense of memory require-

ments. Performing a full pass (both forward and backward passes) with each training example in a

dataset corresponds to an epoch.

The computer vision problems highlighted earlier involve estimating labels Y , given input image X,

and thus require discriminative models. Whereas typical discriminative models, also known as con-

ditional models, aim to estimate the conditional distribution of unobserved variables Y depending on

observed X, denoted by P (Y |X), generative models aim to estimate the joint probability distribution on

X × Y , P (X,Y ). Once trained, the latter models can be used for generating data that is congruent with

the learned joint distribution. In this fashion, we can produce new synthetic data for a target domain

when large amounts of data are available in a similar, yet distinct source domain. This is a promising

solution for the scarcity of labelled training data.

We provide several concrete examples of discriminative models in Section 2.1. Regarding generative

models, we briefly highlight two types of architectures: Generative Adversarial Networks (GANs) and

Variational Auto-Encoders (VAEs).

Generative Adversarial Networks [15] split the training process as a competition between two net-

works. The first is known as the generator, whereas the second is named the discriminator. Their

objective functions are opposite: the generator is optimised to produce synthetic images from real data

in such a way that the discriminator cannot reliably classify whether these images are real or synthetic.

Similarly, Variational Auto-Encoders [33] also employ two separate networks. First, the input is com-

pressed into a compact and efficient representation which exists in a lower-dimensional latent-space by

an encoder network. A second network known as decoder is trained to reconstruct the original input

when given the corresponding compressed representation in the latent space.

8

1.2.2 Grasping

Even though grasping and in–hand manipulation are essential skills for humans to interact with their

surroundings, they remain largely unsolved problems in robotics, specifically when the target object’s

geometry is unknown, or accurate dynamic models of the robot are unavailable. In this section, we

provide a brief overview of some key theoretical concepts for understanding grasping. Friction plays a

crucial role in grasping, as it allows objects to be safely held without slipping.

1.2.2.A Coulomb friction model

Friction is defined as the force resisting motion between two contacting surfaces. Specifically, we

focus on dry friction, which occurs when two solid surfaces are in direct contact with one another.

Experiments carried out from the 15th to 18th centuries have led to establishment of three empirical

laws [54]:

1. Amontons’ 1st Law – The force of friction is independent of the contact surface area;

2. Amontons’ 2nd Law – The frictional force is proportional to the normal force, or load;

3. Coulomb’s Law of Friction – The magnitude of the kinetic friction force is independent of the mag-

nitude of the relative velocity of the sliding surfaces.

Dry friction can be fairly accurately modelled by the Coulomb model, otherwise known as Amonton-

Coulomb friction model. It roughly outlines the properties of dry friction, by considering two separate

interactions (i) static friction, between non-moving surfaces, (ii) kinetic fiction, between sliding surfaces.

The law establishes that the frictional force FT for two contacting surfaces which are pressed together

with a normal force FN is sufficient to prevent motion between the bodies as long as

|FT| 6 µ|FN|, (1.1)

where µ is the coefficient of static friction, an empirical property of the pair of contacting materials.

Equation (1.1) defines a friction cone, where the contact point is the vertex, and the normal force its

axis. If the resulting friction vector is within this cone, then the force should be enough to prevent the

contacting surfaces from sliding. If it is not contained in the friction cone, then it generally cannot prevent

sliding.

For reasons of computational efficiency, the continuous surface of the friction cone tends to be ap-

proximated by an m-sided pyramid.

Grasp stability implies that the object is secured in the manipulator in such a way that contacts do not

slip. A simple evaluation metric involves verifying whether or not the force-closure condition is achieved.

9

FN

FT

FR

(a) Friction cone

FN

FT

FR

(b) Friction pyramid (m = 8)

Figure 1.5: Friction cone and pyramid approximation representations. Given that the resulting contact force FR iswithin the friction cone, the contact does not slip.

1.2.2.B Force-closure

Force-closure is defined in Nguyen [49] as the most basic constraint in grasping. A grasp on a target

object O is force-closure if and only if we can exert arbitrary force and moment on object O by pressing

the fingertips against the object. Equivalently, this grasp can resist any arbitrary force applied on object

O by a contact force from the fingers. This effectively means that the object cannot break contact with

the fingertips without external work (W > 0), i.e. the grasp is robust to external disturbances.

Nguyen [50] states the force-closure problem can be formulated as the solution for a system of n

linear inequalities WTtS > 0, where each row of the matrix W corresponds to the wrench of a single

grasp contact, t denotes a twist, and tS its spatial transpose. The twist corresponds to an external dis-

turbance, whereas contact wrenches correspond to the wrench convex generated by the set of contact

points. A force-closure grasp is obtained if and only if there is no solution (neither homogeneous nor

particular) to the aforementioned system of inequalities, which means that there is no additional work to

be done to compensate the disturbances. For an in-depth description of the mathematical formulation,

we refer the interested reader to the original publication [50].

Intuitively, we can use force-closure as a metric for grasp stability, by maximising the magnitude of

the disturbance a given grasp candidate is able to withstand. This concept can be employed as a binary

metric, that evaluates whether or not a grasp can resist any applied wrench.

10

1.3 Contributions

In the previous sections, we motivated our work and introduced relevant concepts,which will be used throughout the thesis. In this section, we summarise our work’smajor contributions.

Our main goal is to show the application of the concept of Domain Randomisation (DR). This ob-

jective is achieved by (i) validating its use in an object recognition task, by changing objects’ visual

properties, and (ii) by employing it in a grasping scenario, where it is directly applied to the dynamic

properties of a physics simulation.

Our contributions regarding (i) can be summarised as follows.

1. We developed GAP1, an open-source tool for employing DR integrated with a prevalent open

robotics simulation environment (Gazebo2) for generating synthetic tabletop scenarios populated

by parametric objects with random visual properties;

2. Using GAP as the main tool for data generation, we demonstrate the efficiency of DR in a shape

detection task from visual perception. This is achieved by training a state–of–the–art deep neural

network with a fully synthetic dataset and a real one and comparing the performance on a real-

world image test set, with previously unseen objects;

3. We performed a novel ablation study regarding the impact of the degree of randomisation of ob-

ject visual properties, specifically by varying the complexity of the object textures. The code for

replicating our experiments is also available as tf-shape-detection3.

Our contributions regarding (ii) are outlined thusly.

1. We develop a framework for full dynamic simulation of grasp trials within Gazebo, which can be

used to autonomously evaluate grasp candidates for any robotic manipulator or target object. This

tool is open-source and available as grasp4.

2. We extend our proposed DR tool to include randomisation of physical properties of entities in a

grasping scenario. We provide an easy-to-use interface for automating physics DR, by sampling

desired random variables from customisable probability distributions.

3. We create a pipeline for evaluating externally provided grasp candidates [43] with a parallel-plate

gripper in a simulated environment, integrated with our physics DR tool, for generating synthetic

grasping datasets, and perform trials with and without employing DR.1� jsbruglie/gap Gazebo plugins for applying domain randomisation, on GitHub as of December 17, 20182Gazebo simulator http://gazebosim.org/, as of December 17, 20183� jsbruglie/tf-shape-detection Gazebo plugins for applying domain randomisation, on GitHub as of December 17, 20184� jsbruglie/grasp Gazebo plugins for running grasp trials, on GitHub as of December 17, 2018

11

1.4 Outline

The remainder of the document is structured as follows.

Chapter 2 presents an overview of relevant literature. Firstly we provide a brief introduction to Con-

volutional Neural Network in Section 2.1, focusing on increasingly difficult visual tasks, from object clas-

sification to detection and segmentation. Then, we present an analysis of methods for robotic grasping

in Section 2.2. Finally, in Section 2.3 we analyse work that specifically focuses on employing Domain

Randomisation predominantly in grasping-related scenarios, usually resorting to physics simulation en-

vironments.

In Chapter 3 we investigate how we can apply Domain Randomisation to visual properties of a scene.

In order to achieve this, we perform multi-class parametric object detection with state–of–the–art neural

networks, while resorting to synthetic images as training data. We introduce a novel framework for

generating random virtual scenarios integrated with an established robotics simulation environment in

Section 3.1. We perform preliminary trials in Section 3.2 to validate our approach and data generation

pipeline. In Section 3.3 we pursue a much more thorough performance analysis, employing different

schemes for training a state–of–the–art Convolutional Neural Network, and conclude with an ablation

study that focuses on the impact of applying object textures with varying degrees of complexity.

Chapter 4 transitions into a grasping scenario, and instead aims to determine the benefits of applying

DR to the physical properties in a robotic manipulation scenario. In Section 4.1 we address the main

assumptions in our research, with respective motivation. Additionally, we describe the work of Mahler

et al. [43], which we adapted to integrate our data generation pipeline. This research also constitutes

a baseline for offline grasp proposal sampling and grasp quality evaluation via geometric metrics. In

Section 4.2 we propose a novel open-source tool for grasping trials in simulation, with support for simple

manipulators such as parallel-plate grippers but also multi-fingered robotic hands. We describe how

we extended our previous set of DR tools to incorporate the randomisation of physical properties of a

grasping scene in Section 4.3. Finally, in Section 4.4 we acquire a grasping dataset in a simulated envi-

ronment, using Dex-Net [43] for offline grasp candidate generation, and baseline grasp quality estimation

via robust geometric metrics. We compare the latter with the outcome of our full dynamic simulation of

grasp trials within Gazebo, both with and without physics-based DR.

The main body of the thesis is closed in Chapter 5, where we analyse our contributions and conclude

the discussion of our findings. Finally, we present the outline for possible future work.

12

2Related Work

Contents

2.1 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Robotic grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Analytic and data-driven grasp synthesis . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Vision-based grasping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Domain Randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.1 Visual domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Physics domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

13

If I have seen further it is by standing on the shoulders of Giants.

Sir Isaac Newton, 1675

14

This chapter provides the literature review partitioned in three sections. First, we provide a

chronologically-driven overview of CNNs, opening with the relatively simple problem of classification,

and gradually working our way into complex challenges, namely pixel-level segmentation. Secondly,

we analyse current research on robotic grasping, with an emphasis on data-driven grasp synthesis,

specifically resorting to visual information. Finally, we present methods that employ Domain Randomi-

sation in a robotics scenario, most of which specifically tackle grasping, the remaining focusing only on

perception.

2.1 Convolutional neural networks

Having introduced the main properties of Convolutional Neural Networks (CNNs)which allow them to outperform traditional neural networks in computer vision, inthis section we discuss a few concrete approaches for classification, detection andsegmentation tasks. Additionally, we provide the historical context which led to thesustained improvement of the state–of–the–art witnessed in the past few years.

As of today, CNNs are state–of–the–art in visual recognition. Several architectures for CNNs

have been proposed recently in part due to Imagenet’s Large Scale Visual Recognition Challenge

(ILSVRC) [62]. The latter consists of a yearly competition with the goal to evaluate the performance

of algorithms in object classification and detection tasks at a large scale. Analogously, the PASCAL

VOC challenge [13] initially focused on the same problems, later extending the set of challenges to

include object segmentation and action detection.

2.1.1 Classification

AlexNet [35] is considered to have first popularised CNNs. This network was submitted to ILSVRC

2012 and significantly outperformed its runner-up, achieving 15.3% top-5 test error in object classification

while the second-best entry attained 26.2%. It has a similar architecture to LeNet [37], which had been

developed in 1998 by Yann LeCun. AlexNet had over 60 million parameters and 650,000 neurons. As

previously stated. data availability is a major concern when training networks of such dimension, as well

as the computational effort it requires. ILSVRC used a subset of the ImageNet dataset with roughly

1000 images per each of the 1000 categories, resulting in 1.2 million training images, 50,000 validation

images, and 150,000 testing images.

GoogLeNet [66] won ILSVRC 2014. This work’s main contribution was the drastic improvement in

the use of computing resources, managing to reduce the number of estimated parameters to 4 million,

through what the authors call an inception module. The latter applies multiple convolutional filters to

a given input in parallel and concatenates the resulting feature maps. This allows the model to take

15

advantage of multi-level feature extraction from each input.

This architecture has since been improved, its most recent variant being Inception-v4 [67], which

achieved 3.08% top-5 error on the test set of the ImageNet classification challenge.

The runner-up in ILSVRC 2014 became known as VGGNet [65]. This work shows that the depth of

the network is a critical component, by achieving state–of–the–art performance using very small (3× 3)

convolutional filters and 2× 2 pooling layers. However, these configurations require much more memory

due to the massive amount of parameters (approximately 140 million). Most of these parameters consist

of weights in the first fully connected layer, concretely around 100M.

ResNet [22] was victorious in ILSVRC 2015. Its name stands for Residual Network, as the authors’

approach consists of modelling the layers as learning residual functions with reference to the layers’

inputs. Their main contribution is showing that these residual networks are easier to optimise, thus

allowing for deeper networks to be trained and achieving considerably better performance (around 3.57%

test error on ImageNet), whilst maintaining a lower complexity.

2.1.2 Detection

The aforementioned network architectures show the progress in object classification for image input.

However, we have not yet addressed the intuitively more challenging problems such as object detection

and segmentation. In 2013 a three-step pipeline was proposed [17] in an attempt to improve the state

of the art results in PASCAL VOC 2012, achieving a performance gain of over 30% in mean Average

Precision (mAP)1. when compared to the previous best result. Their proposed method entitled R-CNN

(Regional CNN) first extracts region proposals from the image, and then feeds each region to a CNN

with a similar architecture to that of AlexNet [35]. The output of the CNN is then evaluated by a Support

Vector Machine (SVM) classifier. Finally, the bounding boxes are tightened by resorting to a linear

regression model. This network produces the set of bounding boxes surrounding the objects of interest

and the respective classification. The region proposals are obtained through selective search [74]. This

method has a major pitfall – it is very slow. This is due to requiring the training of three different models

simultaneously, namely the CNN to generate image features, the SVM classifier and the regression

model to tighten the bounding boxes. Moreover, each region proposal requires a forward pass of the

neural network.

In 2015, the original author proposed Fast R-CNN [16] to address the above-mentioned issues. This

network has drastically faster performance and achieves higher detection quality. This is mainly due

to two improvements. The first leverages the fact that there is generally an overlap between proposed

interest regions, for a given input image. Thus, during the forward pass of the CNN it is possible to reduce

1For completeness sake, the definition of the mAP metric is provided in Appendix A.1.3, along with that of other requiredconcepts such as precision and recall, that are introduced earlier in Appendix A.1.1.

16

the computational effort substantially by using Region of Interest (RoI) Pooling (RoIPool). The high-level

idea is to have several regions of interest sharing a single forward pass of the network. Specifically,

for each region proposal, we keep a section of the corresponding feature map and scale it to a pre-

defined size, with a max pool operation. Essentially this allows us to obtain fixed-size feature maps for

variable-size input rectangular sections. Thus, if an image section includes several region proposals we

can execute the forward pass of the network using a single feature map, which dramatically speeds up

training times. The second major improvement consists of integrating the three previously separated

models into a single network. A Softmax layer replaces the SVM classifier altogether and the bounding

box coordinates are calculated in parallel by a dedicated linear regression layer.

The progress of Fast R-CNN exposed the region proposal procedure as the bottleneck of the object

detection pipeline. Ren et al. [59] introduces a Region Proposal Network (RPN) which is a fully convo-

lutional neural network (i.e. every layer is convolutional) that simultaneously predicts objects’ bounding

boxes as well as objectness score. The latter term refers to a metric for evaluating the likelihood of the

presence of an object of any class in a given image window. Since the calculation of region proposals

depends on features of the image calculated during the forward pass of the CNN, the authors merge

RPN with Fast R-CNN into a single network, which was named Faster R-CNN. This further optimises

runtime while achieving state of the art performance in the PASCAL VOC 2007, 2012 and Microsoft’s

COCO (Common Objects In Context) [40] datasets. However, the method is still too computationally

intensive to be used in real-time applications, running at roughly 7 frames per second (fps) in a high-end

graphics card (Nvidia® GTX Titan X).

An alternative approach is that of Liu et al. [41], which eliminates the need for interest region proposal

entirely. Their fully convolutional network known as Single Shot Detector (SSD) predicts category scores

and bounding box offsets for a fixed set of default output bounding boxes. SSD outperformed Faster R-

CNN and previous state of the art single shot detector YOLO (You Only Look Once) [58], with 74.3% mAP

on PASCAL VOC07. Furthermore, it achieved real-time performance2, processing images of 512 × 512

resolution at 59 fps on an Nvidia® GTX Titan X Graphical Processing Unit (GPU).

2.1.3 Segmentation

Recent research has made promising advances in pixel-level object segmentation. The work of He

et al. [23] extends Faster R-CNN with a parallel branch for predicting an object high-quality segmentation

mask. The authors claim to outperform every single-model state–of–the–art methods in all of the three

challenges of the COCO dataset, including instance segmentation, bounding box object detection and

person keypoint detection. However, even though this architecture provides a much richer understanding

of a scene from images it cannot be run in real-time, unlike state–of–the–art object detection networks.

2Note that the term real-time is respective to usual image capture rates.

17

The authors report their implementation ran at 5 fps.

In robotics, it is generally preferable to obtain real-time input with sufficient information for the desired

interaction than slower high-quality input streams, which are rendered unusable due to high latency in

communication or processing. During our review, we found this claim to be substantiated by the promi-

nence of grasping research that employs detection rather than object segmentation as an intermediate

step.

2.2 Robotic grasping

The previous section introduced several deep architectures for solving computervision problems with increasing level of difficulty, opening with image classifica-tion task and culminating in pixel-level segmentation and synthetic data creationvia generative models. In this section, we will briefly discuss recent work done onrobotic grasping by employing visual perception. We start by motivating the choiceof data-driven methods and then present several lines of research that were invalu-able to our own work.

2.2.1 Analytic and data-driven grasp synthesis

Grasp synthesis refers to the problem of finding a grasp configuration for a given target object that

fulfils a set of requirements. Depending on the degrees of freedom in the manipulator, a grasp configu-

ration can simply define its position and orientation with respect to the target object. When working with

complex multi-fingered hands, a grasp configuration may be given by the position of each joint, or by the

contact points and respective forces exerted on the target object.

Autonomous grasping is a challenging task, especially when trying to generalise to novel objects.

The main reasons for this include sensor noise and occlusions, which prevent an accurate estimation of

object physical properties such as shape, pose, surface friction and mass; and simplifying assumptions

of the physical models.

The classical approach to grasp planning consists of using analytic methods that assume some

prior knowledge about the object, the robot hand model and the relative pose between them. Analytic

methods, also referred to as geometric, usually involve a formal description of the grasping action,

and solving the task as a constrained optimisation problem over analytic grasp metrics, such as force-

closure [2], Ferrari-Canny [14] or caging [61]. Even though this formal approach ensures desirable grasp

properties, the latter often come at the expense of restrictive assumptions which tend not to hold in a

real-world practical application.

Shimoga [64] states that analytic grasp synthesis aims at achieving four essential properties, namely

dexterity, equilibrium, stability and some dynamic behaviour. Each of them is mathematically formulated

18

and provided with a set of relevant metrics to optimise. Finally, their work focuses on dexterous multi-

fingered grasping, as it claims that parallel-plate or two-fingered grippers are inefficient at grasping

objects of arbitrary shapes and uneven surface, and cannot perform small reorientations independently.

Bicchi and Kumar [2] present a comprehensive review on grasping focused on two separate tasks,

restraining objects, also known as fixturing and object manipulation using fingers, named dexterous

manipulation. The authors study existing analytic grasp formulations, distinguishing between fingertip

grasps for fine manipulation, and enveloping grasps, which are more suitable for the restraining task,

and sometimes called power grasps. The review identifies the need for finding an accurate yet tractable

contact compliance model, which is particularly relevant when not all internal forces can be controlled.

This is the case using underactuated hands, which have fewer number of actuated degrees of freedom

(DOF) than contact forces.

Contrastingly, empirical or data-driven methods try to overcome the challenges of incomplete and

noisy information concerning the object and surrounding environment by directly estimating the quality of

a grasp from a convenient representation obtained from perceptual data [3]. The latter include monocular

3-colour channel (RGB) frames [39, 29, 77, 63], 2.5D RGB-Depth images [43, 10, 47], 3D point cloud

scene reconstructions [68, 69] and even contact sensor input [20, 8].

Data-driven approaches differ from analytic formulations in that they rely on sampling grasp candi-

dates for target objects and rank them according to a grasp quality metric. Typically, this requires some

previous grasping experience, that can either originate from heuristics, simulated data or real robot grasp

trials. Although some data-driven methods measure grasp quality based on analytic formulations [3],

most encode multi-modal data, namely perceptual information, semantics or demonstrations.

Bohg et al. [3] reviewed existing work on data-driven grasp synthesis, as well as methods for ranking

grasp proposals. Their study divides each approach based on whether the target objects are fully known,

familiar or unknown. In the case of known objects, the problem becomes that of object recognition and

pose estimation. For familiar objects, typically the targets are matched with known examples via some

similarity metric. Lastly, for unknown objects, the main challenge is the extraction of object properties

that are indicative of good grasp candidate regions. Methods for novel objects typically rely on local

low-level 2-D or 3-D visual features, or try to estimate the global object shape, either by assuming some

type of symmetry or simply fitting what is observed to approximate volumes.

In this work, the authors observed that most analytic approaches towards grasp synthesis are only

applicable to simulated environments, under the assumption of full knowledge regarding the hand’s

kinematics model, the object’s properties and their exact poses. In a real-world scenario, analytic ap-

proaches are limited, as robotic systems have inherent systematic and random error, which can originate

from inaccurate kinematic or dynamic models, as well as from noisy sensor data. This, in turn, implies

that the relative pose between the robot and the target object can only be approximated, which further

19

complicates accurate fingertip placement on the object surface. In 2000, Bicchi and Kumar [2] pointed

out that few grasp synthesis methods were robust to positioning errors.

Data-driven grasping became popular in 2004, with the release of open-source tool GraspIt!3 [45].

The latter consists of an interactive simulator capable of incorporating new robotic hands and objects.

It uses Coulomb friction model4 to compute the magnitude of forces in the tangent plane to the contact

point. The friction cone is approximated by am-sided pyramid, as shown in Figure 1.5(b). The analysis of

grasp stability is performed in Grasp Wrench Space (GWS), which corresponds to the space of wrenches

that can be applied to an object by a grasp, for provided limits on the contact normal forces.

2.2.2 Vision-based grasping

In our work, we focus mostly on data-driven grasp learning using information collected from visual

and depth sensors, as opposed to tactile feedback for instance. This decision was mainly propelled by

the richness of data provided by RGB and RGB-D cameras and their low cost. These factors have led

to an ever-growing application of such sensors in robotics.

2.2.2.A Grasp detection

Lenz et al. [38] formulates grasping as a detection task for real RGB-D images. Specifically, grasps

are represented as oriented rectangles in the image plane, defined by rectangle centre (x, y), angle

θ and gripper aperture w (rectangle width), as well as the height of the gripper z with respect to the

image plane (check Figure 2.1(a)). Upright gripper grasps are usually referred to as 4-DOF, as only

(x, y, z, θ) can be changed (the width of the gripper is usually not considered in this convention). The

grasp detection is achieved by a two-step cascaded structure with two deep neural networks. The first

network eliminates unlikely grasp candidates quickly and provides the second with its top detections. The

second network is slower, but evaluates fewer grasp candidates. This research improved and greatly

helped establish the Cornell grasping dataset5 – initially proposed in Jiang et al. [30] – as a benchmark

for state–of–the–art grasp detection methods. The latter consists of 1035 real-life RGB-D images of 280

graspable objects, each manually annotated with several ground truth grasp candidates and respective

binary success label. Examples of this dataset are represented in Figure 2.1(b).

Additional modalities have been fused with RGB-D image data for grasping tasks. For instance,

Guo et al. [20] proposes a hybrid deep architecture that fuses visual and tactile perception. The visual

component of their method is also tested on Cornell grasping dataset [30]. To train and evaluate the

tactile module, the authors collect a novel dataset containing visual, tactile and grasp configuration

information, using a real UR5 robotic arm, a Barret hand and a Kinect 2 sensor.3GraspIt! webpage http://graspit-simulator.github.io/, as of December 17, 20184We describe Coulomb friction model in detail in Section 1.2.2.A.5Cornell grasping dataset http://pr.cs.cornell.edu/deepgrasping/, as of December 17, 2018

20

(a) (b)

Figure 2.1: Left: Cornell grasp rectangle convention Source: Redmon and Angelova [57]Right: Examples of Cornell dataset [30] with positive grasp annotations. Green and red edges represent grasp

aperture and width of gripper plates, respectively.

Chu et al. [10] introduces a deep learning architecture that outputs several 4-DOF grasp candi-

dates in a multi-object scenario, given a single RGB-D image of the scene. The proposed architecture

includes a grasp region proposal module and holds the current state–of–the–art results for Cornell

grasping dataset. The grasp estimation problem is then partitioned into a regression over the param-

eters of grasp bounding box, namely its centre (x, y) and dimensions (w × h) and classification of the

(discrete) orientation angle θ. The gripper orientation classifier accounts for the situation in which no

valid grasps exist in a given region proposal by including an additional competing class – a no grasp

class. The network input consists of RG-D images, as proposed earlier by Kumra and Kanan [36].

This work achieved great improvements over existing research and currently set the state–of–the–art

results on Cornell grasping dataset [30] with 96.0% and 96.1% accuracy in image-wise and object-

wise splits, respectively. The authors acquired a multi-object multi-grasp dataset similar to Cornell,

and made it available at grasp_multiObject6. The proposed deep architecture is also open-source as

grasp_multiObject_multiGrasp7. Finally, the authors tested their method in a real robot, achieving

real-time performance and 96.0% grasp localisation and 89.0% grasping success rates on a test set of

household objects.

2.2.2.B Grasp quality estimation

Alternative approaches attempt to directly estimate the probability of grasp success for a given can-

didate pose from visual data, instead of performing grasp detection in a static image of the tabletop

scene.

In Levine et al. [39] the authors train a large convolutional neural network to predict whether a joint

motor command will result in a successful grasp using only monocular RGB camera images, while being

given no information regarding camera calibration or current robot pose. In order to achieve this, they

6� grasp_multiObject Multi-object multi-grasp grasping dataset, on GitHub as of December 17, 20187� grasp_multiObject_multiGrasp Multi-object multi-grasp grasping deep architecture, on GitHub as of December 17, 2018

21

gathered a dataset of upwards of 800.000 grasp trials, using between 6 to 14 robotic manipulators over

the course of several months. Even though the results were favourable, this clearly illustrates how time

and overall resource consuming it is to acquire such a dataset. The authors provide a supplementary

video8 showcasing their work.

Mahler et al. [42] introduces Dex-Net 1.0, a novel grasp proposal model which relies on a large

database with over 10.000 objects autonomously labelled with grasp candidates for parallel-plate grip-

pers and respective analytic quality metrics. The emphasis is centred around a scalable retrieval system

that can be deployed in a cloud computing scenario. Mahler et al. [43] builds upon previous work and

trains a Grasp Quality CNN (GQ-CNN) to predict grasp success given a corresponding depth image.

This work was named Dex-Net 2.0 and its contributions include (i) a new dataset of over 6.7 million point

clouds and analytic grasp metrics corresponding to sampled grasp candidates, which are planned for

a subset of 1500 Dex-Net 1.0 objects; (ii) the novel GQ-CNN model that classifies a candidate grasp

as success or failure given an overhead depth image overlooking an uncluttered single-object table-

top scenario; (iii) a grasp planning method that samples 4-DOF grasp candidates and ranks them with

GQ-CNN. Later, the authors went on to extend this work in order to propose suction gripper grasps [44].

Section 4.1.2 provides additional details on the Dex-Net 2.0 pipeline.

Rather than treating grasp perception as a detection of grasp rectangles in depth images, ten Pas

et al. [69] tackles the harder problem of 3D space, 6-DOF grasping using parallel-plate grippers. The

work incorporates and extends two prior conference publications [68, 19]. The authors’ algorithmic

contributions include (i) a novel method for generating grasp proposals without precise target object

segmentation; (ii) a grasp descriptor that incorporates surface normals and multiple RGB-D camera

viewpoints; (iii) a method for incorporating prior knowledge regarding the category of the target object.

Furthermore, the authors introduce a method for detecting grasps for a specific target object by combin-

ing grasp and object detection. Their framework is tested in a dense clutter pick and place task with a real

Baxter robot. This work introduces a novel method for evaluating grasp detection performance in terms

of recall at an elevated level of precision, i.e. imposing a constraint on the number of false positives.

Their novel grasp representations are generated from a point cloud in the gripper-closing region, with

information regarded the visible and occluded geometry. A simulated dataset is created to pre-train their

proposed algorithm Grasp Pose Detector gpd9 , using object meshes from ShapeNet [9]. Grasps are

evaluated as positive or negative candidates by testing if they are frictionless antipodal, which not only

requires them to be force-closure but to also verify the latter condition for any positive friction coefficient.

Yan et al. [77] tackles the problem of evaluating grasp configurations in unconstrained 3D space

from RGB-D data, generalising to 6-DOF unlike most existing approaches, which impose upright grasp-

ing configurations. The authors introduce a deep geometry-aware grasping network (DGGN) which is

8v video Learning Hand-Eye Coordination for Robotic Grasping on YouTube, as of December 17, 20189� atenpas/gpd Grasp Pose Detector, on GitHub as of December 17, 2018

22

trained in a two-step fashion. Firstly, the network learns a compact object 3D shape representation in

the form of occupancy grid from RGB-D input. This is achieved in a weakly supervised manner in the

sense that instead of directly evaluating the voxelised representation, they instead render a depth image

which is then compared to the ground truth image. Secondly, this internal representation is used to

learn the outcome of a given grasp proposal. This works contribution includes (i) a novel method for

6-DOF grasping from RGB-D input; (ii) grasping dataset acquired in simulation resorting to virtual reality

to obtain over 150.000 human demonstrations; (iii) the model outperformed state–of–the–art baseline

CNNs in outcome prediction task; (iv) the model generalises to novel viewpoints and unseen objects.

Morrison et al. [47] proposes an object-indiscriminate grasp synthesis method which can be used for

real-time closed loop 4-DOF upright grasping for 2-fingered grippers. The authors introduce the Gener-

ative Grasping CNN (GG-CNN) which maps each pixel of an input depth image to a grasp configuration

and associated predicted quality. This 1-to-1 mapping effectively bypasses the need to sample numer-

ous grasp candidates and estimate their quality individually, which leads to long computation times. The

network is trained on a pre-processed and augmented Cornell dataset. It produces four different out-

puts, mapping a single depth frame to essentially four single-channel images, predicting respectively

pixel-wise grasp quality Q, gripper width W and cosine and sine encoding of the gripper angle Φ. The

system is evaluated on a real Kinova Mico arm10 in single object and cluttered environments in both

static and dynamic situations. Their implementation is available on GitHub as GG-CNN11.

2.2.2.C End-to-end reinforcement learning

Reinforcement Learning (RL) allows us to formulate grasp planning as an end to end motor control

problem, as an agent continually retrieves sensorial data from the scene and outputs motor commands.

Typically, in doing so we can couple the reaching and grasping into a single learning task, unlike the

previous methods which focused mainly on the latter.

Supervised learning methods typically do not take into account the sequential aspect of the grasp-

ing task, instead choosing a course of action at the outset or continuously attempting the next most

promising grasp candidate, in a greedy fashion.

Quillen et al. [56] explores RL using deep neural networks for vision-based robotic grasping. Their

work introduces a simulated benchmark for robotic grasping algorithms which favour off-policy learning

and aim to generalise to novel objects. The authors claim that off-policy learning is suitable for grasp-

ing, as it should allow for better generalisation capabilities, by allowing the agent to explore diverse

actions. It features a comparative analysis of several methods and concludes that even simple methods

achieved performance comparable to that of popular algorithms such as double Q-learning. This work’s

10Kinova Mico arm webpage https://www.kinovarobotics.com/en/products/robotic-arm-series/mico-robotic-arm, asof December 17, 2018

11� dougsm/ggcnn Generative Grasping CNN, on GitHub as of December 17, 2018

23

contributions include an open-source simulated Gym environment12 , using PyBullet13 which features a

6-DOF KUKA iiwa14 robotic arm with a 2-fingered gripper for a bin-picking task. Additionally, the authors

benchmark six different methods, namely (i) grasp success prediction proposed by Levine et al. [39],

(ii) double Q-learning [21], (iii) Path Consistency Learning (PCL) [48], (iv) Deep Deterministic Policy

Gradient (DDPG), (v) Monte Carlo Policy Evaluation, and (vi) Corrected Monte Carlo. Each algorithm

is evaluated by overall performance, data-efficiency, robustness to off-policy data and hyper-parameter

sensitivity.

Gualtieri and Platt [18] addresses 6-DOF manipulation for parallel-plate grippers, by formulating

the problem as a Markov decision process (MDP) with abstract states and actions (Sense-Move-Effect

MDP). The authors propose a set of constraints which causes the robot to learn a sequence of gazes

to focus attention on the aspects of the scene relevant to the task at hand. Additionally, they introduce

a novel system which leverages the aforementioned contributions and can learn either, picking, placing

or pick-and-place tasks. The learning model is trained in OpenRAVE15 simulated environment, but also

tested in a real UR5 robotic arm equipped with a Robotiq 85 parallel-jaw gripper and a Structure depth

camera. The authors compare their approach’s performance with GPD, proposed by ten Pas et al. [69],

and found it to beat then-state–of–the–art results for grasping success rates. This was verified for the

case where the algorithm is allowed to perform 10 separate independent runs, and choose the highest

valued state-action value. Without sampling, their method proved inferior to the baseline in some specific

tasks.

2.2.2.D Dexterous grasping

Few data-driven approaches employing deep neural networks and synthetic data focus on the harder

problem of dexterous grasp planning, i.e. finding appropriate grasp poses for multi-fingered robotic

manipulators.

Veres et al. [75] proposes the novel concept of grasp motor image (GMI), which combines object per-

ception and a learned prior over grasp configurations to produce new grasp configurations to be applied

to different and unseen objects. This work focuses on the use of motor imagery for creating internal

representations of an action, requiring some prior knowledge of the object properties. Its main contribu-

tions include showing that a unified space of generic object properties and prior experience of grasping

similar objects can be constructed to allow grasp synthesis for novel objects. Secondly, it provides a

probabilistic framework, leveraging deep generative architectures, namely Conditional Variational Auto-

Encoders (CVAEs) to learn multi-modal grasping distributions for multi-fingered robotic hands. The

12KUKA Gym environment https://goo.gl/jAESt9, as of December 17, 201813PyBullet webpage https://pybullet.org/wordpress/, as of December 17, 201814KUKA iiwa arm webpage https://www.kuka.com/en-de/products/robot-systems/industrial-robots/lbr-iiwa, as of

December 17, 201815OpenRAVE robotics simulator http://openrave.org/, as of December 17, 2018

24

authors collect a dataset of cylindrical precision robotic grasps and corresponding RGB-D images, em-

ploying the setup designed in Veres et al. [76]. The latter is built within V-REP16 simulator and generates

grasp success labels and corresponding contact points and force normals with full dynamic simulation.

This dataset generation pipeline is open-source an available as multi-contact-grasping17. The auto-

encoder network is given RGB-D images and is trained to output contact points and respective surface

normals.

2.3 Domain Randomisation

The previous sections have described how Convolutional Neural Networks (CNNs)became the state–of–the–art method for several visual perception tasks, and de-scribed several data-driven robotic grasp pipelines. Among the latter, we high-lighted many works that employ synthetic data generated in virtual robotic environ-ments, which may underperform when directly transferred to real robots, due to theso-called reality gap. In this section, we present relevant research on Domain Ran-domisation (DR), which is becoming an increasingly known technique for allowingknowledge transfer from fully simulated domains to real-world robots.

In our work, we decided to differentiate between two different types of DR depending on which

properties are randomised. Specifically, we start by verifying the effectiveness of DR applied to visual

perception tasks, and then move on to a robotics scenario, in which the technique is applied to the

physical properties of the system. The literature review follows a similar two-part approach.

2.3.1 Visual domain randomisation

Domain randomisation was employed in the creation of simulated tabletop scenarios to train real-

world object detectors with position estimation. The work of Tobin et al. [70] aims to predict the 3D

position of a single object from single low-fidelity 2D RGB images with a network trained in a simulated

environment. This prediction must be sufficiently accurate to perform a grasp action on the object in a

real-world setting. This work’s main contribution is proving that indeed a model trained solely on sim-

ulated data generated employing DR was able to generalise to the real-world setting without additional

training. Moreover, it shows that this technique can be useful for robotic tasks that require precision.

Their method estimates a specific object’s 3D position with an accuracy of 1.5 cm, succeeding in 38 out

of 40 trials, and was the first successful transfer from a fully simulated domain to reality in the context of

robotic control. Furthermore, it provides an example of properties that can be easily randomised in sim-

ulation, namely object position, colour and texture as well as lighting conditions. Moreover, the authors

introduce distractor objects with the purpose of improving robustness to clutter in the scene. Their model16V-REP simulator http://www.coppeliarobotics.com/, as of December 17, 201817� mveres01/multi-contact-grasping Dexterous grasping dataset generation pipeline, on GitHub as of December 17, 2018

25

uses CNNs trained using hundreds of thousands of 2D low resolution renders of randomly generated

scenes. Specifically, the authors employ a modified VGG-16 architecture [65] and use the MuJoCo18

[72] physics simulation environment.

Zhang et al. [78] proposes a method to transfer visuomotor policies from simulation to a real-world

robot by combining DR and domain adaptation in an object reaching task. The authors split the system

into perceptual and control modules, which are connected by a bottleneck layer. During training, this

layer is assigned the position x∗ of the target object, so the network in the perceptual module effectively

learns how to accurately predict its value, given an RGB image. The control module learns to compute

appropriate motor commands in the joint velocity space v, given the object position x∗ and current joint

angles q. The networks are trained using data acquired with V-REP in a table-top reaching in clutter

scenario for a Baxter robot equipped with a suction gripper. The authors employ DR by (i) cluttering

the scene with parametric distractor objects, with a random flat colour texture; (ii) randomising the initial

robotic arm joint configuration; (iii) altering the colours of the table, floor, robot and target cuboid (±10%

w.r.t. original colours); (iv) marginally changing camera pose and camera Field of View (FoV); (v) slightly

disturbing initial table pose. Domain adaptation is introduced by also using real-world images to fine-tune

the perception module. The control module is trained to output the velocity commands given by V-REP

simulator in order to achieve its Inverse Kinematics (IK) solution obtained via an internally implemented

method, in a supervised manner as well. By employing a suction gripper in place of a conventional

robotic fingered manipulator, the grasping physics simulation can be bypassed.

James et al. [29] explores a tabletop scenario with the objective of performing a simple pick and

place routine with an end-to-end control approach, using DR during the data generation. The authors

introduced an end-to-end (image to joint velocities) controller that is trained on purely synthetic data from

a simulator. The task at hand is a multi-stage action that involves the detection of a cube, picking it up and

dropping it in a basket. To fulfil this task, a reactive neural network controller is trained to continuously

output motor velocities given a live image feed. Figure 2.2 depicts several randomly generated synthetic

scenes.

The authors report that without using DR it was possible to fulfil the task 100% of the time in sim-

ulation. However, the performance degrades in the real-world setting if the model is directly applied.

Specifically, the gripper was only able to reach the vicinity of the cube 72% of the times, and did not

succeed even once in lifting it off the table. This work further analyses relevant research questions,

namely (1) what is the impact on the performance when varying the training data size?; (2) which prop-

erties should be randomised during simulation?; and (3) which are the additional features that may

complement the image information (e.g. joint angles) and improve the performance? To answer these

questions, the authors assessed the importance of each randomised properties in the simulated training

18MuJoCo simulation environment webpage http://www.mujoco.org/, as of December 17, 2018

26

Source: James et al. [29]Figure 2.2: Synthetic scenes employing domain randomisation

data by removing them one at a time, namely the presence of distractor objects in clutter, rendering of

shadows, textures randomisation, camera position change and initial joint angles variation.

The main observations from the experiment results are as follows (1) it was possible to obtain over

80% success rate in simulation with a relatively small dataset of 100.000 images, but to obtain similar

results in a real-world scenario it was needed roughly four times more data; (2) training with realistically

textured scenes as opposed to randomising textures also results in a drastic performance decrease;

(3) not moving the camera during simulation yields the worst performance in the real-world scene as the

cube is only grasped in 9% of the trials and the task finished in 3%. The authors carried out a set of

tests in a real Kinova 6-DOF robotic arm with a 2-finger gripper and found their method not only thrived

in the real-world without ever being fed a real image but was also robust to changes in the cube’s,

basket’s and camera’s initial positions, as well as lighting conditions and the presence of distractor

objects. These tests are briefly shown in a supplementary video19. To the best of our knowledge, this

work first introduced Perlin noise [52] in the context of DR to model complex background settings.

Sadeghi et al. [63] proposes a novel method for learning viewpoint invariant visual servoing skills, i.e.

the ability to control limbs or tools using mainly visual (RGB) feedback. Specifically, the task at hand is

to move a robotic endpoint to a given desired position, in the vicinity of an object in sight. Unlike most

existing visual servoing methods, this model does not require explicit knowledge of the dynamic model

of the system nor an initial calibration phase. Their approach is to train a deep recurrent controller in

a simulated environment employing DR to visual properties of the scene. The authors describe that

by decoupling visual perception and control and adapting the former using real-world data they were19v video Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task on YouTube, as of

December 17, 2018

27

successful in real-world transfer. The adapted model was able to perform in the presence of previously

unseen objects from novel viewpoints on a real-world Kuka iiwa robotic arm. This research uses the

PyBullet simulated environment and a similar setting to that of Quillen et al. [56].

With regard to DR, the visual appearance of the target objects, table and ground plane, as well as

scene lighting are randomised. Objects are textured with high-resolution realistic images of materials

such as wooden floors, stone-slab walls and exhibit a higher complexity than those employed in previous

works [70, 78].

Unrelated to the grasping domain, Tremblay et al. [73] proposes a novel synthetic image dataset

generation pipeline using DR for training deep object detectors. The authors introduce a novel dataset

for detecting car instances in RGB images, and present a comparative study of the performance of three

state–of–the–art CNN architectures, namely SSD [41], Faster R-CNN [59] and R-FCN [11]. Moreover,

they perform a detailed ablation study to evaluate the contribution of each specific DR parameters,

namely light variation, amount of textures, and the presence of proposed flying distractors. Concretely,

their dataset generation framework is built as a plugin to source-available Unreal Engine 420, and allows

one to vary (i) amount and type of target cars, selected from a set of 36 highly detailed 3D meshes;

(ii) number, type, colour and scale of distractors, which include simple parametric shapes but also models

of pedestrians and trees; (iii) texture of all objects in scene, taken from Flickr 8K dataset [24]; (iv) camera

pose; (v) number and location of point lights; (vi) visibility of ground plane. Their work shows that

not only is it possible to produce a network with impressive performance using only non-artistically

generated synthetic data but also with little additional fine-tuning on a real dataset, state–of–the–art

CNNs can achieve better performance than when trained exclusively on real data. A set of synthetic

training examples produced with the proposed dataset generation pipeline is depicted in Figure 2.3.

Source: Tremblay et al. [73]Figure 2.3: Examples from car detection dataset employing domain randomisation

2.3.2 Physics domain randomisation

Whereas randomisation of visual properties seems to have become somewhat common in recent

work employing simulation for generation of training data (at least, to the extent of applying random20Unreal Engine webpage https://www.unrealengine.com, as of December 17, 2018

28

flat colour textures to objects), few authors explicitly mention applying it to physical properties of the

virtual scenario. Moreover, one can apply DR to the procedure of generating grasp target objects, thus

foregoing the need for accurate 3D meshes of real objects, or time-consuming artistic 3D models of

household items.

In Tobin et al. [71] the authors created a pipeline for data generation which includes creating random

3D object meshes, and a deep learning model that both proposes possible grasp configurations and

scores each of them with regard to their predicted efficiency. Their work proposes an autoregressive

architecture that maps a set of observations of an object to a distribution over grasps. Its contributions

are twofold (i) an object generation module and (ii) a grasp sampling method with corresponding grasp

quality prediction network. Regarding the former, millions of unique objects are created by the concate-

nation of small convex blocks segmented from 3D objects present in the popular ShapeNet dataset [9].

Specifically, over 40,000 3D object meshes were decomposed into more than 400,000 approximately

convex parts using V-HACD21. Then, grasps are sampled uniformly in 4 and 6-dimensional grasp space.

Each dimension is discretised into 20 buckets, with each bucket corresponding to a grasp point, within

the object’s bounding box. Grasps which do not touch the object can be discarded at the outset. For

each trial, a depth image is collected from the robot’s hand camera.

The model itself consists of two neural networks, one for grasp planning and the other for grasp

evaluation. The former is used to sample grasps which are likely to be successful whereas the latter

leverages more detailed information from the close-up depth image of the gripper’s camera. The network

is trained in simulation, using MuJoCo physics simulator. Their method was evaluated in simulation with

a set of previously unseen realistic objects achieving a high (> 90%) success rate. It demonstrates the

use of DR tools in order to generate a large dataset of random unrealistic objects and generalise to

unseen realistic objects. However, it should be mentioned that their experiments are all performed in a

simulated environment. In this case, the knowledge transfer between domains refers to the transition

from procedurally generated unrealistic object meshes to 3D models of real-life objects.

Related work has also fused visual and physics-based Domain Randomisation (DR), in order to

achieve real-world transfer in the context of visual servoing [6] and in-hand manipulation [51].

Bousmalis et al. [6] proposes an alternative to DR in order to leverage synthetic data in a real-world

4-DOF upright grasping task using images from a monocular RGB camera. The authors study a range

of domain adaptation methods and introduce GraspGAN, a generative adversarial method for pixel-level

domain adaptation. This work further tests the proposed methods in a real-world scenario, achieving

the first successful simulation-to-real-world transfer for purely monocular vision-based grasping, with a

better performance than that of Levine et al. [39]. Their deep vision-based grasping pipeline is split into

two modules. The first consists of a grasp prediction CNN that outputs the probability of a given motor

21� kmammou/v-hacd Library for near convex part decomposition, on GitHub as of December 17, 2018

29

command resulting in a successful grasp, much like [39]. The second is a simple manually designed

servoing function, that is used to continuously control the robot.

The authors perform a comparative analysis between domain-adversarial neural networks

(DANNs) [15] and pixel-level domain adaptation. Additionally, the authors use a novel set of proce-

durally generated random geometric shapes22 during training, and found this approach to achieve better

performance than using realistic 3D models from ShapeNet [9]. Examples of these objects are shown

in Figure 2.4. This approach is similar to that of Tobin et al. [71] and results are congruent. Even though

the term DR is not explicitly mentioned throughout this work, the authors apply randomisation to both the

visual and dynamic properties of the simulated scene. Specifically, they highlight varying the textures

of the robotic arm and objects, as well as the lighting conditions and randomising the object mass and

friction coefficients. A demonstration of the working system is presented in a supplementary video23.

Source: Bousmalis et al. [6]Figure 2.4: Randomly generated object meshes

A recent work by OpenAI [51] managed to successfully transfer a policy for dexterous in-hand object

manipulation from a fully simulated environment to a real robot. Specifically, by using RL the authors

were able to fulfil a task of object reorientation with a Shadow Dexterous Hand robotic manipulator. The

agent is entirely trained in simulation, in which the authors randomise not only the visual properties of

the scenes but also the physical properties of the system, namely friction coefficients and magnitude of

the gravity acceleration vector. In this fashion, the policies are able to transfer directly to a real-world

environment. The authors provided an impressive video24 showcasing the real robot performance in

consecutive trials. Figure 2.5 shows a set of synthetic scenes with random flat colour textures applied

to the simulated Shadow Dexterous Hand and random lighting conditions.

Their deep architecture uses three monocular RGB camera views to estimate the object 6D pose, i.e.

its position and orientation. The location of each fingertip is obtained with a 3D motion capture system,

22Procedurally generated 3D object models https://sites.google.com/site/brainrobotdata/home/models, as of Decem-ber 17, 2018

23v video GraspGAN converting simulated images to realistic ones and a semantic map it infers on YouTube, as of December17, 2018

24v video Real-time video of 50 successful consecutive object rotations on YouTube, as of December 17, 2018

30

which requires 16 PhaseSpace tracking sensors. Finally, a deep control module is trained to output the

motor commands given the current and target object orientation, as well as the fingertip locations.

The authors specify several physical properties to which applying DR improved performance. The

latter are summarised in Table 2.1. This research uses open-source OpenAI gym25 [7] and specifically

its robotics environments [53] and MuJoCo proprietary physics simulator [72].

Source: OpenAI [51]Figure 2.5: Examples of synthetic scenes with random textures applied to the dexterous robotic manipulator.

Three camera views are concurrently fed to the neural network. Each row corresponds to a single cameraviewpoint across scenes.

Property Scaling factor Additive termTarget object size uniform([0.95, 1.05])Object and robot link masses uniform([0.5, 1.5])Surface friction coefficients uniform([0.7, 1.3])Robot joint damping coefficients loguniform([0.3, 3.0])Actuator P gain loguniform([0.75, 1.5])Joint limits N (0, 0.15) radGravity vector N (0, 0.4) ms^{-2}

Table 2.1: Parameters for randomising physical properties. The latter were either scaled by a random scalingfactor or disturbed by a random Gaussian additive term.

25� openai/gym Toolkit for developing and comparing RL algorithms, on GitHub as of December 17, 2018

31

32

3Domain Randomisation for Detection

Contents

3.1 Parametric object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.1 GAP: A domain randomisation framework for Gazebo . . . . . . . . . . . . . . . . 35

3.2 Proof–of–concept trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Scene generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.2 Preliminary results for Faster R-CNN and SSD . . . . . . . . . . . . . . . . . . . . 42

3.3 Domain randomisation vs fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Single Shot Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Improved scene generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.3 Real images dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.4 Experiment: domain randomisation vs fine-tuning . . . . . . . . . . . . . . . . . . 47

3.3.5 Ablation study: individual contribution of texture types . . . . . . . . . . . . . . . . 50

33

Goals transform a random walk into a chase.

Mihaly Csikszentmihaly, 1996

34

3.1 Parametric object detection

In the last chapter, we provided the reader with a brief review of the state–of–the–art methods, starting with the fundamental principles of Convolutional NeuralNetworks (CNNs), exploring how recent proposals explored deep architectures inrobotic grasping tasks and finally the potential of a recently popular technique formoving from simulated to the real-world domain. In this chapter, we wish to startexploring the potential of Domain Randomisation (DR) applied to visual propertiesof a scene for object detection tasks.

The main objective of this section is to explore the application of DR to visual properties of a scene.

The goal is to train a state–of–the–art neural network with synthetic data generated automatically in a

simulator and transfer performance to a real-world application.

We chose the task of vision-based object detection, which involves not only the classification of

several object instances in an image but also determining their corresponding bounding boxes. Even

though the latter is not as informative as pixel-level segmentation masks, recent methods for object

detection have shown to achieve real-time performance [58, 41]. In robotics, it is generally preferable to

produce rough estimates at a high rate than very accurate predictions with a substantial delay, depending

on the problem at hand.

3.1.1 GAP: A domain randomisation framework for Gazebo

Our research is focused on the applications of DR in the field of robotics. As established in Sec-

tion 1.1, real-world robotic trials are both time and resource-consuming. It is usually preferable to run

simulated trials in a robotics environment, due to its many advantages, namely the reproducibility of

experiments and the fact that these can typically be executed faster than real-time.

3.1.1.A Gazebo simulator

We chose Gazebo [34] as the simulation environment for generating synthetic data for our experi-

ments. It is known for its stable integration with ROS1 [55], an increasingly popular robot middleware

platform. The research conducted at our laboratory Vislab2 is heavily reliant on middlewares such as

ROS and YARP3. Ivaldi et al. [26] evaluates the then-existing robotic platforms for full dynamic simula-

tion with the purpose of electing a new environment for which to port the iCub4 robot into. The authors

carried out an online survey in 2014 and found that researchers primarily valued realism, but also using

a uniform Application Programming Interface (API) for both real and simulated robot interaction. This

1ROS Middleware webpage http://www.ros.org/, as of December 17, 20182Vislab webpage http://vislab.isr.ist.utl.pt/, as of December 17, 20183YARP webpage http://www.yarp.it, as of December 17, 20184iCub webpage http://www.icub.org/, as of December 17, 2018

35

study concluded that Gazebo was the optimal choice among open-source alternatives for simulating hu-

manoid robots, and V-REP the best among commercial options. Gazebo also excels in its out-of-the-box

official support for numerous robotic platforms. Table 3.1 provides a comparative analysis which further

illustrates our point.

Robot Gazebo V-REP MuJoCoiCub 3 7 7ABB YuMi 3 7 7Vizzy [46] (in-house platform) 3 7 7Shadow Dexterous Hand 3 7 CommunityFetch Mobile Manipulator 3 7 CommunityUR5/10 arms 3 3 CommunityBaxter 3 3 3KUKA iiwa 3 3 3Kinova Jaco/Mico arms 3 3 3Barret hand 3 3 3

Table 3.1: Availability of robotic platforms in popular simulation environments. Community label specifies thatofficial support was not provided, but a community-sourced package is available.

Gazebo leverages a distributed architecture with separate modules for physics simulation, graphical

rendering, communication and sensor emulation. Gazebo is split into a server application gzserver

and a client with graphical interface gzclient. Inter-process communication is handled internally by

ignition Transport library, which in turn relies on Google Protobuf for message serialisation and

boost ASIO for the transport mechanism. This framework employs named topic-based communication

channels, to which nodes can publish and subscribe to, similarly to ROS. In fact, gazebo_ros_pkgs5

provides tools for directly remapping ROS topics to Gazebo internal communication channels. However,

other paradigms present in the ROS framework such as Services are not provided in Gazebo, yet should

be possible to implement.

As of the time of writing, Gazebo supports four open-source physics engines, namely (1) a modi-

fied version of Open Dynamic Engine (ODE)6, (2) Bullet7, (3) Symbody8, (4) Dynamic Animation and

Robotics Toolkit (DART)9. However, only the first can be used with the default binary installation. Sup-

port for the remaining engines requires Gazebo to be compiled from source. Models and robots use

a common XML representation in Simulation Description Format (SDF) which can be translated to ad-

equate representations for each of the supported physics libraries. Regarding the rendering engine,

Gazebo relies on OGRE10 for 3D graphics, GUI and sensor libraries. Figure 3.1 contains a diagram of

Gazebo’s architecture and module dependencies.

5� ros-simulation/gazebo_ros_pkgs Wrappers, tools and additional API’s for using ROS with Gazebo, on GitHub as ofDecember 17, 2018

6ODE webpage https://www.ode.org/, as of December 17, 20187Bullet webpage bulletphysics.org, as of December 17, 20188Symbody webpage https://simtk.org/projects/simbody, as of December 17, 20189DART webpage http://dartsim.github.io/, as of December 17, 2018

10OGRE rendering engine https://www.ogre3d.org/, as of December 17, 2018

36

Physics

ODE

DARTSymbody

Bullet

Physics Simulation

3rdParty

Libraries

GazeboModules

Features Inter­process Communication

ignition::Transport

boost::ASIOGoogle Protobuf

Messages Transport

Visuals

GUI

Sensor Simulation

Sensors

QtOGRE

Rendering

Figure 3.1: Gazebo architecture and dependencies overview. Arrows indicate module dependencies.

Perhaps one of the major advantages of Gazebo is its support for plugins, which provide access to

most features in a self-contained fashion, and can be dynamically loaded and unloaded during simula-

tion. Plugins can be of six different types, namely (i) World (ii) Model (iii) Sensor (iv) System (v) Visual

(vi) GUI. Each plugin should solely control the entity is attached to, although it is possible to interact

with other unrelated modules by binding callback functions to specific events which are broadcast inter-

nally. In any case, this is not encouraged as it often results in incorrect usage and possible simulation

instability.

3.1.1.B Desired domain randomisation features

After a thorough review of the related work mentioned in Section 2.3 we identified a few common

techniques for applying DR in the generation of synthetic images. With regard to visual features, we

emphasise the following relevant features:

– Spawn, move and remove objects either with parametric shapes or complex geometry;

– Randomly generate textures and apply them to any object;

– Manipulate the lighting conditions of the scene;

– Render image frames with a virtual camera, with configurable parameters.

Our goal was to incorporate all of the above-mentioned capabilities in a novel tool for simulation within

the Gazebo framework. At the time of development, no other tool targeted DR specifically. Moreover,

most tools for programmatic simulator interaction relied on ROS. This seems redundant, as Gazebo

already provides its own named-topic inter-process communication framework.

37

3.1.1.C Offline pattern generation library

Gazebo provides a way to apply textures to models, but not an easy way to create them, apart

from simple single-coloured materials. A valid approach is to generate textures in real-time using the

rendering engine’s own material scripting language. However, this approach brings a few immedi-

ate disadvantages, namely being engine-specific, computationally expensive (depending on the com-

plexity of the material), and the fact that makes it harder to reproduce obtained textures, if need be.

We developed a texture generation script, which we have made available as a standalone C++ library

pattern-generation-lib11. This tool allows us to generate four types of textures, namely single flat

colour, gradient between two colours, chequerboard and Perlin noise [52]. The tool also creates the

required folder structure for these to be usable in Gazebo. We have exploited trivial parallelism in the

generation of images using OpenMP12, which has greatly sped up the texture creation process. Textures

are therefore created prior to simulation. Figure 3.2 shows a set of examples for each of the four pattern

types supported by our pattern generation library.

(a) Flat (b) Chess (c) Gradient (d) Perlin

Figure 3.2: Examples output for each of the four pattern types supported by our pattern generation library.

Perlin Noise

Our experiments (further detailed in Section 3.3.5) have found Perlin noise to be rather efficient for

applying DR to object detection tasks. For this reason, we provide a brief description of the algorithm for

Perlin noise generation. A much deeper analysis and thorough explanation of this method can be found

online13.

11� ruipimentelfigueiredo/pattern-generation-lib Texture generation library, on GitHub as of December 17, 201812OpenMP (Open Multi-Processing) API https://www.openmp.org/, as of December 17, 201813Understanding Perlin Noise http://flafla2.github.io/2014/08/09/perlinnoise.html, as of December 17, 2018

38

The Perlin noise function outputs a floating-point value n ∈ [0, 1] given the point t with coordinates

{x, y, z}. First, we compute the corresponding coordinates of the target point in a 3-dimensional unit

cube. For each of the 8 vertices of the cube, we compute a pseudorandom gradient vector. The

implementation in [52] uses a discrete set of 12 vectors, that point from the centre of the cube to each

of the edges of the cube. Then, we need to calculate the 8 vectors from the given point t to each of

the 8 surrounding points in the cube. The dot product between the gradient vector and each distance

vector provides the final influence values. When the distance vector is in the direction of the gradient

the dot product is positive, and it is negative if the direction is opposite. Next, we perform a linear

interpolation between the resulting 8 values to obtain something analogous to a weighted average of

the influence. Finally, we apply a smoothing fade function, to improve the somewhat rough nature of the

linear interpolation.

The pseudocode of the Perlin noise generation algorithm is shown in Algorithm A.1. Since we com-

pute the value of each pixel independently, we can exploit full parallelism even when generating a single

image.

3.1.1.D GAP: A collection of Gazebo plugins for domain randomisation

We introduce GAP, a collection of tools for applying DR to Gazebo’s simulated environment. This

software is open source and available as gap14. Our framework has undergone several relevant changes

during development. Originally we conceived our novel tool as a set of two independent server-side

plugins, which were used in our earlier experiments. The first plugin handled object spawning and

removal, whereas the second was responsible for synchronous camera renders of a desired simulated

scene, as well as bounding box computations. Eventually, we added a third plugin for handling visual

objects, which sped up changing object textures, as it allowed us to avoid respawning objects for each

scene, and instead modify existing objects during simulation.

Our software was initially built with the purpose of generating synthetic tabletop scenes, with a ground

plane and randomly placed parametric objects. Our goal was to produce a DR-powered dataset for a

simple shape detection task. In order to achieve it we designed a tool that can:

– Spawn, move and remove objects, either with parametric primitive shapes such as cubes and

spheres or a provided 3D mesh;

– Randomly generate textures and apply them to any object, resorting to our offline pattern genera-

tion library;

– Spawn light sources with configurable properties and direction;

– Capture images with a virtual camera, possibly with added Gaussian noise;14� jsbruglie/gap Gazebo plugins for applying domain randomisation, on GitHub as of December 17, 2018

39

– Obtain 2D bounding boxes from the 3D scene in order to label data for an object detection task.

This is achieved by sampling points on the object surface and projecting them onto the camera

plane. Box class objects solely require the projection of its 8 vertices, whereas spheres require as

much as (360◦/30◦)2 = 144 surface points, for a fixed angle step of 30 degrees.

Due to Gazebo’s distributed structure, we divided our plugin into three modules, tasked respectively

with changing objects’ visual properties, interacting with the main world instance and finally querying

camera sensors. We aimed for a modular design, so that each tool could be used independently. Thus,

each module provides an interface for receiving requests and replying with feedback messages. The

latter can be leveraged by a client application to create a synchronous scene generation pipeline, despite

the program’s distributed nature.

Figure 3.3 showcases a synthetic scene generated using our novel framework, with three parametric

objects and a ground plane, each using one of the used four types of textures. The bounding box anno-

tations are computed using the aforementioned surface sampling method, which requires full knowledge

of the object geometry.

Figure 3.3: Example synthetic scene employing all four texture patterns. Bounding box annotations computed byour tool. The ground has a flat colour, box has gradient, cylinder has chess and sphere has Perlin noise.

40

3.2 Proof–of–concept trials

In the previous section we introduced GAP, a novel framework for applying DomainRandomisation (DR) to visual properties within the Gazebo robotics simulation en-vironment. We described its architecture and main features toward the generationof a synthetic dataset for an object detection task. In this section, we specify howthis dataset was created and elaborate on preliminary trials using state–of–the–artCNNs, namely SSD and Faster R-CNN.

3.2.1 Scene generation

With our novel DR tool integrated with the Gazebo simulated environment, we constructed a tabletop

scene generation pipeline. We started by producing 5.000 textures for each of the four supported pattern

types (flat colour, gradient, chess and Perlin noise) using our offline pattern generation tool. Then, we

wrote a client script to spawn simple parametric shapes, namely spheres, cubes and cylinders placed in

a grid, with random textures, sampled uniformly from the aforementioned set of 20k textures. For each

scene, we acquire a RGB frame and automatic annotations including object bounding boxes and class

labels.

Specifically, each synthetic tabletop-like scene incorporates a ground plane, a single light source and

the set of parametric objects. Each object is constrained to occupy a cell in a grid, although they can

rotate freely provided they are supported by the ground plane. The objects can be partially occluded in

the resulting image frame, but do not collide. Even though we deactivate physics simulation, the objects

in the scene should be in a plausible and stable pose.

The scene generation procedure consists of four steps:

1. Spawn camera and a random number N ∈ [5, 10] of parametric shapes. Each object is put in one

of 25 cells in a 5 by 5 grid.

2. Request to move camera and light source to random positions, sampled from a dome surface, with

its centre directly above the world origin, so as to ensure all objects are in the camera’s FoV.

3. Obtain object bounding boxes, by sampling 3D object surface points and project them onto the 2D

image plane.

4. Save image of the scene, as well as object bounding box and class annotations. The latter are

compatible with PASCAL VOC 2012 [13] annotation format.

We provide examples of synthetic scenes created with our novel framework in Figure 3.4.

With the first version of our pipeline setup, we managed to create a dataset of 9.000 synthetic scenes.

The texture generation process took little over 20 minutes, but is heterogeneous with regard to required

41

Figure 3.4: Synthetic scenes generated using GAP. Bounding box annotations provided automatically by our tool.

time, as Perlin noise is more computationally intensive to generate. Each simulated scene took roughly

more than a second to be produced on a standard laptop, with an Intel® i7 4712HQ processor and

an Nvidia® 940m GPU. At the time, we estimated that the full dataset generation process should take

around 3 hours. To validate our claim that indeed one could use this synthetic data for training object

detectors, we carried out an introductory experiment.

3.2.2 Preliminary results for Faster R-CNN and SSD

We trained two state–of–the–art object detection networks: Faster-RCNN and SSD. Note that

whereas SSD is considered to have real-time performance, Faster-RCNN is much slower to train.

We resorted to the open-source TensorFlow15 GPU-accelerated implementation which is available on

tensorflow/models16.

The output of the networks includes both a detection bounding box and its class label per detection.

The networks were pre-trained on COCO dataset [40] and fine-tuned – i.e. trained only the last fully-

connected layer of the networks for over 5.000 epochs. We did this using our datasets, with a varying

number of frames from our small in-house real-world dataset and our novel preliminary synthetic dataset.

The former consisted of 242 images acquired in our lab with a Kinect v2 RGB-D sensor, half of which

was always used as the test set. The test set contains object instances not seen in the real-world

training set. The acquired synthetic RGB frames had a Full-HD resolution (1920 × 1080) and were

encoded in JPEG lossy format, to match the real dataset. An improved version of this dataset was used

on later experiments, as reported on Section 3.3. We also provide samples of this improved dataset with

respective annotations in the latter section (cf. Figure 3.6).

Specifically, we fine-tune each network with either only synthetic data, only real data or using both

training sets. All images are reshaped to a 1080× 1080 and downscaled to 300× 300 pixels. We applied

15TensorFlow machine learning framework https://www.tensorflow.org/, as of December 17, 201816� tensorflow/models Models and examples built with TensorFlow, on GitHub as of December 17, 2018

42

standard data augmentation techniques namely random horizontal flips to the preprocessed image.

Finally, we evaluate the performance in the real-world test set with the standard per-class Average

Precision (AP) and mAP metrics17, as defined in PASCAL VOC [13].

The experimental results of these preliminary trials on the real-world test set are shown in Table 3.2.

Figure 3.5 shows example output of our best performing detector.

Network # Simulated # Real AP Box AP Cylinder AP Sphere mAPFaster R-CNN 0 121 0.89 0.83 0.91 0.88Faster R-CNN 9000 0 0.80 0.75 0.90 0.82Faster R-CNN 9000 121 0.82 0.78 0.90 0.83SSD 0 121 0.74 0.60 0.76 0.70SSD 9000 0 0.70 0.47 0.77 0.64SSD 9000 121 0.67 0.40 0.80 0.62

Table 3.2: Preliminary experimental results. Second and third column state the number of training examples fromeach dataset. Last column corresponds to the mAP over all three classes. Remaining columns are per-class AP

scores. The results were computed for an IoU of 0.5.

Figure 3.5: Preliminary results on real image test set

From Table 3.2 we observe the following (i) Faster R-CNN performs around 20% better than SSD

in average, (ii) the real-world-only training set performs better than the other training sets in all but one

case, yet achieves the best mAP score, (iii) the most difficult object to detect and localise is the cylinder

class. At the time we suggested the number of simulated images was not large enough to consider

different lighting conditions, and thus training with the randomised dataset alone is not able to reach the

desired performance on real-world data. Later we performed tests that would suggest otherwise, in the

sense that we could achieve better performance using an even smaller amount of synthetic images. We

observe that the networks fine-tuned only with synthetic data achieved lower performance, but they were

still able to perform decently in a real-world dataset. These preliminary results inspired us to pursue a

more rigorous study of DR applied to object detection. We published the description of our tool alongside

these results in Borrego et al. [5], in April 2018.17For completeness sake, the definition of the mAP metric is provided in Appendix A.1.3, along with that of other required

concepts such as precision and recall, that are introduced earlier in Appendix A.1.1.

43

3.3 Domain randomisation vs fine-tuning

In the last section, we described a synthetic scene generation pipeline for detectingobjects with parametric shapes using state–of–the–art Convolutional Neural Net-works (CNNs). In this section, we will provide an in-depth study on how to improvethe preliminary results. We start by presenting an overview of Single Shot Detec-tor (SSD), which we chose as the object detector used from this point on. Then weperform a comparative study on training SSD on synthetic data employing the prin-ciples of Domain Randomisation (DR), as opposed to fine-tuning an existing modelpre-trained on large domain-generic datasets.

In order to apply an object detector in a new domain, it is often required to collect some training

samples from the domain at hand. Labelling data for object detection is harder than labelling it for

object classification, as bounding box coordinates are needed in addition to target object’s identity, which

increases to the importance of optimally benefiting from the available data.

After data collection, a detector is selected, commonly based on a trade-off between speed and

accuracy, and is fine-tuned using the available target domain data. Our proposal is to use a synthetic

dataset, with algorithmic variations in irrelevant aspects of simulation, instead of relying on networks

pre-trained with datasets which share little resemblance to the task at hand. This approach is further

detailed in this section.

3.3.1 Single Shot Detector

Even though our preliminary results established that Faster R-CNN obtained better performance in

the object detection task, we elected SSD as the base detector. The main reason for this is that SSD

is one of the few detectors that can be applied in real-time while showing a decent accuracy. Once

more, we emphasise that in the field of robotics real-time performance often comes as a necessity for

closed-loop control systems. Furthermore, since SSD is considerably faster to train, we were allowed

to perform a thorough ablation study with respect to the individual contribution of texture types for DR,

which is further detailed in Section 3.3.5.

In any case, we believe that SSD has no property which makes it more or less susceptible to benefit

from DR. Thus, we expect our results to directly generalise to other deep learning based detectors.

We provide a brief description of the inner workings of SSD in this report. However, readers should

refer to the original publication [41] for a comprehensive study of the detector.

At the root of all deep learning based object detectors, there exists a base CNN which is used as a

feature extractor for further down-stream tasks, i.e. bounding box generation and foreground/background

recognition. Similar to YOLO architecture [58], SSD takes advantage of the concept of priors or default

boxes18 where each cell identifies itself as including an object or not, and where this object exists in a18Called anchor box in YOLO.

44

position relative to a default location. However, unlike YOLO, SSD does this at different layers of the

base CNN. Since neurons in different layers of CNNs have different receptive fields in terms of size and

aspect ratios, effectively, objects of various shapes can be detected.

During training, if a ground truth bounding box matches a default box, i.e. they have an IoU of more

than 0.5, the parameters of how to move this box to perfectly match the ground truth are learned by

minimising a smooth L1 metric19. Hard negative mining is used to create a more balanced dataset

between foreground and background boxes. This technique essentially consists of explicitly creating

negative training examples from false positive detections. Finally, Non-Maximum Suppression (NMS) is

used to determine the final location of the objects.

Unlike the original SSD architecture, we used MobileNet [25] as the base CNN for feature extraction

in all experiments. MobileNet changes the connections in a conventional CNN to drastically reduce

its number of parameters, without having a significant toll on performance, relative to a comparable

architecture.

3.3.2 Improved scene generation

In order to lower the execution time of synthetic data generation, we modified the scene creation

script. Originally, each captured image required parametric objects to be generated from a SDF (Sim-

ulation Description Files)20 formatted string, which was altered during run-time in order to allow for

different object dimensions and visuals. Furthermore, objects were created and destroyed in between

scenes. We altered our pipeline so we could instead modify their visual properties directly through the

Gazebo API, reusing existing objects in simulation. We accomplished this by creating a novel plugin that

interacts exclusively with the rendering engine and the objects visual representation21. This approach is

much more efficient than spawning and removing objects each iteration.

We start by spawning the maximum number of objects of each type in the scene, below the ground

plane, where they cannot be seen by the virtual camera sensor. Then, for each scene, the client ap-

plication performs requests to change the objects’ visuals poses, scale and materials. This method

has been widely used to reduce load times in the video game industry, where performance constraints

are ever present. Specifically, objects are created out of bounds and placed in view of the player once

they become relevant. We found this optimisation to be quite effective and obtained almost doubled

performance, generating 9.000 synthetic images in little over 1h30min, albeit resorting to a larger set of

available random textures (a total of 60.000 textures, compared to the reported 20.000), which should

have increased run-time. We believe this contribution to be significant, particularly considering we did

not have to write rendering-specific code to optimise the pipeline.

19For completeness sake, the definition of the L1-norm loss function is provided in Appendix A.1.5.A.20SDF Specification http://sdformat.org/, as of December 17, 201821Check Appendix A.2.1.A for additional implementation details.

45

Additionally, we made several improvements regarding the modularity of each component in the

framework. Other enhancements concern making our tool straightforward to use and to integrate with

other projects, namely the addition of automatic Doxygen22 documentation, which can be found online23

and the ability to use our software kit as a proper CMake24 package, with support for installation and

version system. This essentially translates to providing a shared library to interact with our plugins.

Finally, we provided example clients that showcase how to use each of the developed tools. Among

these is the script used to create the dataset of images of synthetic tabletop scenes.

3.3.3 Real images dataset

It is our goal to apply our object detector to real-world images. We created an improved dataset of

images acquired in our lab with a Kinect v2 RGB-D sensor, part of which were used in our preliminary

trials with both SSD and Faster R-CNN. These scenes include household objects (such as cups, mugs,

balls and cardboard boxes) placed arbitrarily on the ground, or stacked on top of each other, with varying

degrees of occlusion. Each image was manually annotated with object class labels and bounding boxes.

The refined dataset consists of 250 real images, 49 of which contain objects unseen in training, for the

sole purpose of reporting final performance. The train, validation and test partitions of our real image

dataset are specified in Table 3.3.

Training Validation Test Total175 26 49 250

Table 3.3: Number of real images in train, validation and test partitions.

In this dataset, there was no consideration to explicitly keep the percentage of different classes

balanced (Table 3.4). Thus, we have also reported precision-recall curves for each class.

Partition # Box # Cylinder # Sphere TotalTrain set 502 (63%) 209 (26%) 86 (11%) 797Test set 106 (40%) 104 (40%) 53 (20%) 263

Table 3.4: Number of instances and percentage of different classes in the real dataset.

A set of training images from our small in-house dataset is shown in Figure 3.6. Finally, the dataset

has been made available online25.

22Doxygen: Generate documentation from source code http://www.doxygen.nl/, as of December 17, 201823GAP Documentation http://web.tecnico.ulisboa.pt/~joao.borrego/gap/, as of December 17, 201824CMake: tools designed to build, test and package software https://cmake.org/, as of December 17, 201825Real dataset http://vislab.isr.ist.utl.pt/datasets/#shapes2018, as of December 17, 2018

46

Figure 3.6: Real image training set examples

3.3.4 Experiment: domain randomisation vs fine-tuning

We have conducted various experiments and tests to quantify the results of different scenarios.

Initially, two sets of 30k synthetic images were generated. The software modifications we mentioned

in Section 3.3.2 have greatly facilitated this process. These two sets differed from one another by the

degree to which the virtual camera in the scene has changed its location. In the first set, the viewpoint

was fixed, whereas in the other set, its location varied largely across the scene. All synthetic images

have Full-HD (1920 × 1080) resolution and are encoded in JPEG lossy format, to match the training

images from the real images dataset. For training and testing, images are down-scaled to half these

dimensions (960 × 540) which is the resolution employed for all test scenarios in our pipeline.

For baseline calculations, we have used SSD, trained on COCO, and fine-tuned it on the train set

until the performance on the real-image validation set failed to improve. In other experiments, we have

used MobileNet which was trained on ImageNet as the CNN classifier of SSD and first fine-tuned it

on synthetic datasets with bigger learning rates and later, in some experiments, fine-tuned again with

smaller learning rates on the real dataset.

Finally, smaller synthetic datasets of 6k images were generated, each with a type of texture missing,

and an additional baseline for comparison which includes every pattern type. These datasets allowed

us to study the contribution of each individual texture in the final performance, as well as evaluate the

effectiveness of smaller synthetic datasets.

Networks were trained with mini-batches of size 8, on a machine with two Nvidia Titan Xp GPUs, for

a duration depending on the performance in a real image validation set. We have only used horizontal

47

flips and random crops, with parameters reported in the original SSD paper, as the pre-processing step,

since we are interested in studying the effects of synthetic data and not different pre-processings. Finally,

in compliance with the findings in Tremblay et al. [73], all the weights of the network are being updated

in our experiments.

We wish to quantify how much an object detector performance would improve due to the usage of

synthetic data. To this purpose, initially, we fine-tuned a SSD, pre-trained on COCO dataset with our

real image dataset for 16.000 epochs, which was deemed sufficient by evaluating the performance on

our validation set. We used a decaying learning rate α0 = 0.004, with a decay factor k = 0.95 every

t = 100k steps. In the subsequent sections, we refer to this network as the baseline.

Afterwards, we trained SSD with only its classifier pre-trained on ImageNet, using our two synthetic

datasets of 30k images each.

These datasets are similar to the one generated during our preliminary trials, yet contain simulated

tabletop scenarios with a random number of objects N ∈ [2, 7], placed randomly on the ground plane in

a smaller 3× 3 grid, to avoid overlap.

In the first dataset, the camera pose is randomly generated for each scene, such that it points to

the centre of the object grid. This generally results in high variability in the output, which may improve

generalisation capabilities of the network at the expense of added difficulty to the learning task, as, for

instance, it exhibits higher levels of occlusion. In the second dataset, the camera is fixed overlooking the

scene at a downward angle, which is closer to the scenario we considered in the real dataset. Example

scenes with viewpoint candidates for each dataset are shown in Figure 3.7.

(a) Moving viewpoint (b) Fixed viewpoint

Figure 3.7: Viewpoint candidates in synthetic scene generation. Left: Viewpoint changes both position androtation in between scenes. Subfigure represents four possible camera poses. Right: Viewpoint is static.

48

The scene light source is always allowed to move in a manner akin to the camera in the first dataset.

Similar to the baseline, the networks were trained on these datasets for over 90 epochs, based on

their performance on the validation set employing an exponentially decaying learning rate, starting at

α0 = 8× 10−3, and a decay of k = 0.95 every t = 50k steps. These networks were then directly applied

to the test set (which has real images) without any fine-tuning on our dataset of real object data, in order

to quantify how much knowledge can be directly transferred from synthetic to real domain.

Finally, these two detectors were fine-tuned on the real dataset for over 2200 epochs and with a

fixed learning rate of α = 10−3. The result of this analysis is depicted in Figure 3.8 and summarised

in Table 3.5. Additionally, Figure 3.9 shows example output of our best performing network for our real

image test set.

Box Cylinder Sphere Mean0.0

0.2

0.4

0.6

0.8

1.0

AP @

0.5

IoU

COCO + RealFVMV

FV + RealMV + Real

Figure 3.8: Per class AP and mAP of different detectors. MV: Moving Viewpoint; FV: Fixed Viewpoint;Real: fine-tuned on the real dataset

Run AP Box AP Cylinder AP Sphere mAPCOCO + Real 0.7640 0.5491 0.6664 0.6598FV 0.4190 0.4632 0.8581 0.5801MV 0.5578 0.3230 0.8603 0.5804FV + Real 0.8988 0.7573 0.8395 0.8319MV + Real 0.8896 0.5954 0.7591 0.7480

Table 3.5: SSD performance on the test set. For abbreviations refer to Figure 3.8.

The network trained on the dataset with no camera pose variation and fine-tuning on real data exhibits

the best performance at 0.83 mAP, which corresponds to an improvement of 26% over baseline.

Perhaps unintuitively, variations in camera position have hurt the mAP score. We believe this is

due to the absence of noticeable camera variations in our real test set. This is further discussed in

Section 5.1. In any case, it is expected that the network trained on the corresponding dataset is more

49

Figure 3.9: SSD output for real image test set when trained on FV and fine-tuned on Real datasets.

robust and would perform better if it was tested against a dataset with varying camera/light positions.

Figure 3.10 shows the precision-recall curves of different networks for each class. Consistently, the

networks trained on the fixed viewpoint dataset and fine-tuned on the real dataset outperform other

variations. This trend is only less prominent in the case of the sphere class, where, seemingly, due to

the smaller amount of examples of this class in the real dataset (Table 3.4) the training benefits less from

fine-tuning on the real dataset for some values of iso-F1 surfaces26. This observation is also visible in

Table 3.5. We hypothesise that more real sphere examples could help the detector in improving the AP

score for spheres.

3.3.5 Ablation study: individual contribution of texture types

A valid research question in DR research is which is the contribution of including various textures

as well as the importance of sample sizes. To study this question, we have created smaller synthetic

datasets with only 6k images, wherein each of them a single specific texture pattern is missing. Once

more, we selected MobileNet pre-trained on ImageNet as the classifier CNN. However, the detectors

were instead trained on these smaller synthetic datasets and then fine-tuned on the real dataset.

The training of all networks on synthetic datasets lasted for 130 epochs, which was found to be the

point where the mAP did not improve over the validation set, with an exponentially decaying learning

rate starting at α0 = 0.004, k = 0.95 and t = 50k steps. Finally, these networks were fine-tuned with

the real-image dataset for 1100 epochs, with a constant learning rate α = 0.001. The performance of

26For completeness sake, the definition of the F1-score is provided in Appendix A.1.4.

50

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

COCO + RealFVMVFV + RealMV + Real

(a) Box

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

COCO + RealFVMVFV + RealMV + Real

(b) Cylinder

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

COCO + RealFVMVFV + RealMV + Real

(c) Sphere

Figure 3.10: Precision-recall curves of different variants of the detectors. For abbreviations refer to Figure 3.8.

SSD on the validation set during training is shown in Figure 3.11. The results of these experiments are

reported in Figure 3.12 and Table 3.6.

Training dataset AP Box AP Cylinder AP Sphere mAPAll 0.8344 0.6616 0.8694 0.7885No Flat 0.8775 0.7546 0.8910 0.8410No Chess 0.8332 0.6958 0.8485 0.7925No Gradient 0.8296 0.6172 0.8536 0.7668No Perlin 0.7764 0.5058 0.7880 0.6901

Table 3.6: SSD performance on the test set after train on each of the 6k sub–datasets and fine-tuned on realimages.

By comparing figures Figures 3.12(a) and 3.12(b) it is clear that all variations have benefited from

fine-tuning with the real dataset.

Regarding individual texture patterns, generally speaking, by removing more and more complex

51

0 20 40 60 80 100Iteration Step (k)

0.0

0.2

0.4

0.6

0.8

1.0

mAP

@ 0

.5IO

U

AllNo FlatNo Chess

No GradientNo Perlin

(a) Before fine-tuning

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Iteration Step (k)

0.0

0.2

0.4

0.6

0.8

1.0

mAP

@ 0

.5IO

U

AllNo FlatNo ChessNo GradientNo Perlin

(b) During fine-tuning

Figure 3.11: Performance of SSD on validation set during training on smaller datasets of 6k images, each missinga type of texture, with the exception of the baseline, prior and during fine-tuning on real image dataset (3.11(a),

3.11(b)).

All No Flat No Chess No Gradient No Perlin0.0

0.2

0.4

0.6

0.8

1.0

mAP

@ 0

.5Io

U 0.610.55

0.46 0.47

0.23

(a) Before fine-tuning

All No Flat No Chess No Gradient No Perlin0.0

0.2

0.4

0.6

0.8

1.0

mAP

@ 0

.5Io

U

0.790.84

0.79 0.770.69

(b) After fine-tuning

Figure 3.12: Performance of SSD on the test set when trained on smaller datasets of 6k images, each missing atype of texture, with the exception of the baseline, prior and after fine-tuning on real image dataset (3.12(a),

3.12(b)).

textures (flat to be the least complex and Perlin to be the most complex), the performance hurts, and we

found Perlin noise to be a vital texture for object detection, while the flat texture has the least significance.

Consistent with this observation, according to Figure 3.12(b), the small dataset with all the textures

cannot always compete with some of the datasets where a texture is missing.

According to Figures 3.12(a) and 3.12(b), the detector trained on the small dataset with all texture

classes outperformed other variations on the validation set during training. However, presumably due to

the smaller number of samples and simultaneously many texture types it overfitted to the objects in the

52

train set and failed to generalise to the test set as well as the remaining detectors.

The network trained on our smaller dataset of only 6k images without the “flat” texture has even

slightly out-performed the network that was trained on 30k synthetic images. This result seems to be

consistent for detectors with classifiers trained on real images, trained on synthetic data and then again

fine-tuned with real samples. After a fixed number of images, the mAP performance oscillates for one or

two percents.

Once again, due to having an imbalanced real test set we provide the precision-recall curves for each

of the trained networks in Figure 3.13.

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

AllNo FlatNo ChessNo GradientNo Perlin

(a) Box

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

AllNo FlatNo ChessNo GradientNo Perlin

(b) Cylinder

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

AllNo FlatNo ChessNo GradientNo Perlin

(c) Sphere

Figure 3.13: Precision-recall curves of different variants of the detectors after fine-tuning on the real dataset.

53

The code for replicating our experiments is also available as tf-shape-detection27. This repository

includes an interactive Jupyter notebook28 with a step-by-step walkthrough which explains how to

1. Download the real image dataset publicly hosted in our server;

2. Preprocess an image dataset and create binary TF records for usage in TensorFlow;

3. Train SSD with configurable starting checkpoints, learning parameters, etc.;

4. Compute mAP curves on the validation set during training and visualising them with TensorBoard

web application;

5. Export the resulting inference graphs and infer detections on new images;

6. Obtain precision metrics and precision-recall curves on a test set.

Furthermore, it provides a list that summarises all the tests we performed, in tests.md29, and the associ-

ated configuration files for each run, which specify learning parameters and number of training iterations.

Our pipeline is based on TF_ObjectDetection_API30, and uses the open-source implementation of SSD

available in tensorflow/models31.

The results presented in the current section were made available as an early preprint article, in

Borrego et al. [4], in July 2018.

27� jsbruglie/tf-shape-detection Object detection pipeline, on GitHub as of December 17, 201828Jupyter project webpage http://jupyter.org/, as of December 17, 201829Trial description https://github.com/jsbruglie/tf-shape-detection/blob/master/tests.md, as of December 17, 201830� wagonhelm/TF_ObjectDetection_API TensorFlow Object Detection API Tutorial, on GitHub as of December 17, 201831� tensorflow/models TensorFlow Models, on GitHub as of December 17, 2018

54

4Domain Randomisation for Grasping

Contents

4.1 Grasping in simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Overview and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2 Dex-Net pipeline for grasp quality estimation . . . . . . . . . . . . . . . . . . . . . 58

4.1.3 Offline grasp sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 GRASP: Dynamic grasping simulation within Gazebo . . . . . . . . . . . . . . . . . . 64

4.2.1 DART physics engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.2 Disembodied robotic manipulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.3 Hand plugin and interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Extending GAP with physics-based domain randomisation . . . . . . . . . . . . . . . 67

4.3.1 Physics randomiser class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4 DexNet: Domain randomisation vs geometry-based metrics . . . . . . . . . . . . . . 68

4.4.1 Novel grasping dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.2 Experiment: grasp quality estimation . . . . . . . . . . . . . . . . . . . . . . . . . 70

55

It is far better to grasp the universe as it really is than to persist in delusion, however

satisfying and reassuring.

Carl Sagan, 1995

56

4.1 Grasping in simulation

In the previous chapter, we studied the application of Domain Randomisation (DR)to visual features in an object detection task. We proposed a novel frameworkintegrated with a widely used robotics simulator for synthetic scene generation. Thissimulated data was then used for training state–of–the–art Convolutional NeuralNetworks (CNNs), which we then tested on a small real image in-house dataset. Inthis chapter, we will focus on the problem of robotic grasping, and how to extendDR to physical properties in the context of simulated dynamic grasp trials. We startby stating our problem and assumptions in order to simplify it.

In this section, we aim to explore how to apply DR in a simulated grasping environment. Particularly,

we wish to apply randomisation to the physical properties of the entities at play, such as mass, friction

coefficients, robotic joint dynamic behaviour and controller parameters. Ultimately, we hope to create

synthetic datasets which can be used for real robotic grasping. For this, we require a tool that incorpo-

rates a grasp proposal module and pairs each of the grasp candidates with a quality metric. Additionally,

in order to achieve a robust grasp quality estimation, we wish to obtain some metric through full dynamic

simulation of grasp trials, preferably in an open and established robotics simulator. This approach differs

from that of Mahler et al. [43], which computes metrics from the geometry of the object, such as force-

closure and Ferrary-Canny [14]; or GraspIt! [45], which relies on building the GWS from each individual

contact between the robotic manipulator and the target object. Instead, we propose to obtain a grasp

quality estimation by replicating real robotic trials in a simulated environment, and evaluate whether

they are successful or not. In this fashion, by applying DR to the physical properties between different

attempts of a given grasp configuration we aim to achieve a robust metric.

4.1.1 Overview and assumptions

We are aware that robotic grasping is a hard task, even more so when the target object geometry is

unknown. For this reason, we must first simplify the problem at hand, imposing reasonable constraints,

which may eventually be relaxed, so as to streamline our research.

Firstly, in our work we chose not to include the reaching motion as part of the grasping problem.

Essentially, this task corresponds to positioning the robotic manipulator in a pre-grasp pose and should

be made possible by a 6-DOF arm to which the hand is attached. For an adequately calibrated robot, with

known kinematic model, this effectively corresponds to solving an IK problem, i.e. computing the angles

of each arm joint required for the endpoint to perform the desired trajectory, while avoiding collisions with

other objects in the world. Currently existing frameworks such as MoveIt!1 already provide a satisfactory

solution to the reaching problem. The latter software can produce collision-free trajectories in simple

scenarios and supports a great number of robot platforms.1Robotic motion planning software https://moveit.ros.org/, as of December 17, 2018

57

By eliminating the reaching action from our concerns we are able to focus solely on the grasp quality

estimation, and avoid computationally expensive IK calculations or learning policies for closed-loop arm

control. Thus, we assume a disembodied manipulator with full 3D-space mobility. Evidently, this or

similar approaches have been proposed in relevant recent literature [31, 60, 75, 77, 18].

In order to perform full dynamic simulation of grasp trials we require an adequate physics engine,

preferably integrated with Gazebo, so our experiments on grasping come as a natural extension of our

previous work on DR, reported in Chapter 3. This, in turn, restricts us to rigid body simulation, as none

of the supported engines can effectively simulate soft or compliant objects2. To our knowledge, only

proprietary physics engine MuJoCo fully supports this feature, while integrated with a robotic simulation

framework. The contact forces present in a grasping action play a crucial role in whether it will be

successful or not. For rigid, dry contacting surfaces, most physics engines use the Coulomb friction

model, which we further elaborate on, in Section 1.2.2.A.

Finally, for now we chose to focus on parallel-plate grippers, as opposed to dexterous manipula-

tors. We intend to extend our work to multi-fingered autonomous grasping. For this reason, our novel

framework provides out-of-the-box support for some desired features, namely coupled joints, which are

typically found in underactuated hands.

Summarising, our principal assumptions involve using:

1. A disembodied robotic manipulator with unconstrained movement in 3-D space;

2. Simplified rigid body contact and Coulomb friction models;

3. A parallel-plate gripper as opposed to dexterous multi-fingered robot hands.

In the next section, we provide a brief description of Dex-Net 2.0 [43], which was later used in our

work.

4.1.2 Dex-Net pipeline for grasp quality estimation

Rather than designing our end-to-end pipeline for grasping candidate generation and evaluation, we

decided to employ that of Mahler et al. [43]. This allows us to focus on studying the impact of employing

DR and prevents additional biases stemming from novel self-proposed methods, which would ultimately

confuse a comparative analysis.

Recalling Section 2.2.2.B, the authors tackle grasping unknown objects with parallel-plate grippers

given visual input from an RGB-D camera mounted on the gripper. The problem of grasping is split into

two steps, namely (i) grasp candidate proposal and (ii) grasp quality evaluation.

2 In fact, DART physics engine can simulate soft contacts using the model of Jain and Liu [27]. However, currently, Gazeboonly supports this feature for box-shaped objects.

58

Specifically, [43] states the second step as approximating a grasp robustness function, given by

Q(u,y) = E[S(u,x)|u,y], where

• S(u,x) ∈ {0, 1} is a binary success metric;

• u = (p, φ) ∈ R3 × S1 4-DOF upright grasp candidate, with grasp centre p (x, y, z) and angle φ;

• x = (O, TO, TC , γ), respectively object geometry, object and camera poses and friction coefficient;

• y ∈ RW×H+ a 2.5D point cloud (depth image with width W and height H) acquired using a camera

with known parameters.

The depth camera is aligned with each grasp proposal before capture, generating a 32 × 32 pixel

frame. Each grasp is sampled provided no collision is detected, and then a relaxed force-closure metric

is computed analytically with the geometric information of each object. Uncertainty is introduced in the

problem to model sensor errors using Monte-Carlo sampling. Specifically, object and gripper poses, as

well as the static friction coefficient are disturbed by Gaussian noise, during metric computation for grasp

candidates. Regarding data augmentation, each frame can be flipped vertically and horizontally due to

the gripper symmetry.

In our work, we wish to use Dex-Net as the main building block. Our goal is to generate a novel

grasping dataset, using 3D models scanned from real-world household objects found in KIT object

dataset [32]. Grasp candidates for these objects should be computed using the method of [43] so

we can establish an adequate baseline. Analogously, we should use this framework to provide baseline

grasp quality estimates, computed resorting to geometric metrics. At this stage, we introduce Domain

Randomisation in the metric used to evaluate a grasp candidate. Instead of relying on geometric-driven

metrics, we perform full dynamic simulation of grasping trials, while randomising physical properties

of the environment. We expect our method to provide a more robust estimate of grasp success, and

ultimately improve the grasps selected in a real-world robotic scenario.

4.1.3 Offline grasp sampling

Sampling grasp candidates in continuous 3D space (6-DOF) is no trivial undertaking, even for simple

robotic grippers with a single actuated joint. In this section, we describe the method of Mahler et al. [43],

which we employed in our own work, with minor modifications so as to fit our scenario.

Each object’s shape is defined by a signed distance function (sdf) f : R3 → R, which is 0 on the object

surface, positive outside the object, and negative within. The object surface for a given sdf f is denoted

by S = {y ∈ R3|f(y = 0)}. All points are specified with respect to an object-centric reference frame

centred at the object centre of mass z and oriented along the principal axes of S. This representation is

59

chosen essentially for computational efficiency purposes, as it allows us to quickly sample points from

the object surface in a discrete grid.

Mahler et al. [43] uses a three-step grasp generation algorithm, respectively:

1. Sampling step: use a generic algorithm for sampling grasps, given the object polygon mesh (col-

lection of vertices, edges and faces that define the object shape) and gripper properties.

2. Coverage rejection step: compute the distance between grasps and prune those with a distance

smaller than Dmin. This step aims to achieve maximum object surface coverage.

3. Angle expansion step (optional): expand candidate grasps by sampling approach angles from a

discrete set of nθ possible values.

This iterative process is repeated until enough grasps have been generated, or a maximum number

of iterations is reached. The pseudocode for this method is present in Algorithm 4.1. A superficial

complexity analysis shows that this algorithm is bound by Imax, yet each iteration executes a sampling

algorithm, which in turn may be an iterative process. Furthermore, computing distances between grasps

is quadratic with the number of grasps |Gi|, i.e. O(|Gi|2). Due to the non-trivial complexity of this

algorithm, one must carefully choose its parameters, to avoid unnecessary long computation times.

Dex-Net implements three distinct grasp sampling methods, with increasing computation power re-

quirements and expected output grasp quality. The simplest grasp sampling algorithm is uniform grasp

sampling. It consists of selecting pairs of points on the object surface at random and verifying whether

they could correspond to the contact points of a valid grasp. Gaussian sampling is slightly more sophis-

ticated. The grasp centre is sampled from a Gaussian distribution with mean coincident with the object’s

centre of mass and the grasp axis is chosen by sampling spherical angles uniformly at random. In our

work, we focus mainly on the third method, which supplies robust antipodal grasp candidates, albeit at

the expense of added complexity and longer run times.

4.1.3.A Antipodal grasp sampling

Antipodal points are a pair of points on an object surface, which surface normal vectors are colinear

but have opposite direction. Under specific finger contact conditions, antipodal grasps ensure force-

closure. We provide a brief walkthrough of the sampling algorithm. We start off by uniformly sampling a

fixed number ns of surface points. For each sample psurf, we randomly perturb the surface point to obtain

p. In it, we place a candidate contact point and compute the properties of the approximate friction cone.

This step consists in approximating the friction cone by a pyramid with a predefined number of faces and

friction coefficient value. If the magnitude of the friction force is less than that of the tangential force, the

friction cone is invalid, as the object would slip, as per Equation (1.1). Should this be the case, we move

on to obtain a new point p and repeat the procedure, until a maximum number of samples N is reached.

60

Algorithm 4.1: Grasp generator. Consists of a 3-step process. (i) Use generic algorithm forsampling grasps; (ii) Prune grasps for maximum surface coverage; (iii) Expand grasp candidate listby applying rotations.

Input : O Object 3D mesh,n Target number of grasps,k Target multiplier,s Grasp sampling algorithm,Imax Maximum number of iterations,nθ Number of grasp rotations,H Gripper properties

Output: G Grasp proposals

Function generateGrasps(O, n, k, Imax, nθ)G← ∅;i← 0;nr ← n ; // Number of grasps remainingwhile nr > 0 and i < Imax do

ni ← nr × k;Gi ← s.sampleGrasps(O,ni, H);Gpi ← ∅ ; // Pruned graspsfor grasp gi in Gi do

dmin ← minimumDistanceToGrasps(gi,Gi \ {gi});if dmin > Dmin then

Gpi ← Gpi ∪ gi;

Gci ← ∅ ; // Candidate graspsfor grasp gi in Gpi do

for angle θj in nθ dogij ← gi;gij .setApproachAngle(θj);Gci ← Gci ∪ gij ;nr ← nr − 1;

G← G ∪Gci;k ← k × 2;i← i+ 1;

G.shuffle();return G[:n] ; // Truncate to size n

If a valid friction pyramid is obtained we use it to uniformly sample grasp axes v. For each sampled

grasp axis v in v, we attempt to compute the second contact point. Additionally, there is a probability

of 50% that we use −v instead, so as to reduce the impact of incorrectly oriented surface normals on

the original object mesh. Some additional restrictions are then imposed. Firstly, we make sure that the

distance between the two estimated contact points fits in the open gripper. Secondly, we check if the

second contact point corresponds to a valid friction pyramid. Finally, we impose the force-closure condi-

tion. If all these restrictions are met, the grasp is added to the output set of grasps. The pseudocode for

this algorithm is shown in Algorithm 4.2. A set of antipodal grasp candidates is present in Figure 4.1.

61

Algorithm 4.2: Antipodal grasp sampler. Samples antipodal grasp that fulfil force closure condi-tion.

Input : O Object 3D mesh,n Target number of grasps,N Maximum number of samples,wM Gripper maximum width,wm Gripper minimum width

Output: G Antipodal grasp proposals

Function AntipodalSampler.sampleGrasps(O, n, N , wM , wm)G← ∅;S← surfacePoints(O);ns ← min (n, |S|);S← ns random samples from S;for surface point psurf in S do

for i← 0 ; i < N ; i← i+ 1 dop← perturb(psurf);c1 ← contact(O, p);f,← frictionCone(c1);if f is a valid friction cone (object does not slip) then

V← sampleFromFrictionCone(f );for direction v in V do

if random[0,1] > 0.5 thenv ← −v

gi ← graspFromContactAndAxis(O, p, v, wM , wm);if grasp gi is wide enough and is force closure then

G← G ∪ gi;

return G;

Figure 4.1: Antipodal grasp candidates for Baxter gripper, aligned with table normal for 4 objects from KIT dataset.Gripper colour encodes robust force-closure metric as a gradient: green is stable, orange unstable. For additional

details refer to Section 4.4.

62

4.1.3.B Reference Frames

To fully understand the grasp generation process it is crucial to understand the reference frames in

use, as well as how to transform from one to another. The grasp candidates are computed assuming an

object-centric reference frame. However, during the dynamic trials, the robotic manipulator pose has to

be defined with reference to the simulation world reference frame. We adopt the following relevant ref-

erence frames: (1) World frameW; (2) Target object frame O; (3) Grasp canonical frame G; (4) Robotic

manipulator frameM; (5) Base link frame B.

The world frame W corresponds to the main reference frame. In the simulated environment, this

reference frame defines absolute coordinates. The target object frame O coincides with the object pose.

Grasp candidates are expressed w.r.t this reference frame. The grasp has its own reference frame G,

with y corresponding to the grasp axis (parallel to the vector between fingers) and x to the palm axis

(parallel to fingers). Each grasp candidate can be expressed by TG→O, the transformation from the

grasp canonical frame to the object frame. The relationship betweenM and G is constant over time and

defined by a rigid transformation. The base link frame B should be used for robot control and accounts

for a possible mismatch between the reference coordinates of the gripper with regard to the base link in

separate representations.

Figure 4.2 shows a visual representation of the reference frames for Baxter’s robotic gripper, which

was used in our grasping trials. The main reason for choosing this gripper is that we had already

performed preliminary trials on Gazebo and coincidentally it was also one of the three available choices

in Dex-Net. We specifically portray the default convention for this gripper in the Dex-Net framework,

where the base reference frame is named the mesh frame and is used for visual representation of

the gripper and collision checking. Our simulated Baxter gripper base link employs a slightly different

reference frame, with a different offset in the z coordinate.

Figure 4.2: Reference frames for Baxter gripper. From left to right we show the gripper base link frame B, roboticmanipulator or gripper frame M and the grasp canonical frame G.

We provide additional details about the representations for transformation between reference frames

used throughout our implementation in Appendix A.2.2.E.

63

4.2 GRASP: Dynamic grasping simulation within Gazebo

In the former section, we have stated our objective and main assumptions. More-over, we motivated resorting to an existing pipeline for grasp proposal and qualityestimation [43], and provided a brief analysis of the proposed offline grasp samplingalgorithm. In this section, we introduce a novel framework for dynamic simulationof grasp trials in Gazebo.

We propose a novel framework for Gazebo, for performing dynamic simulation of grasp trial outcome,

which we named GRASP, and is open source as grasp3. One of our greatest concerns was flexibility

and accessibility, to allow for straightforward integration with any existing robotic manipulator and object

dataset. So far, the development of this platform has produced

– A model plugin for programmatic control of a floating robotic manipulator, either a single DOF

gripper or multi-fingered hand, with support for coupled (underactuated) joints 4;

– An interface class for simplifying interaction with the former from a client application, which sup-

ports a uniform configuration across distinct manipulators;

– A server-side service that allows for efficiently checking whether two objects are colliding and, if

so, obtain the contacts at play at a given instant;

– A tool for creating Gazebo model datasets from object mesh files;

– A script for obtaining object stable poses via dynamic simulation, which leverages a novel model

plugin, attached to grasp target objects;

– A pipeline to compute grasp trial outcomes using full dynamic simulation, given a dataset of objects

and a set of grasp candidates per object computed externally for the desired robotic manipulator.

In the coming subsection, we motivate our choice of an adequate physics engine.

4.2.1 DART physics engine

By default, Gazebo uses a modified version ODE physics engine, which can be quite effective for

simulating mobile robots navigating in environments, possibly with uneven terrain. It can also be used

for more complex operations and testing inverse kinematic solvers, or even simple manipulation tasks,

as shown by in an online pick-and-place routine for Baxter5. However, attempting finer-grain manip-

ulation such as grasping complex objects typically results in physically unrealistic behaviour, such as

3� jsbruglie/grasp Gazebo plugins for running grasp trials, on GitHub as of December 17, 20184For more details on coupled joint emulation, check Appendix A.2.2.A.5Baxter MoveIt! Pick and Place Demo https://www.youtube.com/watch?v=BPOLWBsOnOQ, as of December 17, 2018

64

manipulator instability or links clipping through each other. One available solution is to compute grasp

metrics analytically, in order to determine whether or not a grasp would be successful, and then bypass

the grasp trial altogether and just attach the object to the manipulator, ceasing to compute further inter-

actions between the manipulator and the target. An implementation of this is available in gazebo-pkgs6.

In our preliminary trials, we were able to confirm that indeed the default setup would not allow us to

perform full dynamic simulation of grasping on complex objects. Furthermore, simple scenarios employ-

ing a gripper or complex hand would be highly unstable and typically result in a server crash, especially

if self-collision is partially enabled i.e. some links of the manipulator can collide with others. This be-

haviour is desired for instance when using Shadow Dexterous Hand, where the high number of degrees

of freedom would otherwise allow for a great number of impossible configurations, with links clipping

through each other.

Our initial assumption was that merely altering the underlying physics engine would allow for a faith-

ful simulation of complex grasping trials. Motivated by Johns et al. [31], we decided to perform our

experiments using DART. The authors perform dynamic simulation of parallel-plate grasping trials and

report that even though DART was slower than physics engines more suitable for computer graphics

applications, such as Bullet, it showed greater accuracy in modelling precise, real-world behaviour. One

of the few downsides of this decision includes resulting in a more involved setup process, as Gazebo

has to be compiled from source. In fact, due to a dependency issue with official software repositories,

we found ourselves forced to install DART and fcl7 from source as well.

4.2.2 Disembodied robotic manipulator

We do not aim to solve the problem of servoing, i.e. planning a valid trajectory so the hand can reach

the grasp candidate pose. Thus, as previously mentioned we employ a disembodied manipulator with full

mobility in 3D-space. To achieve this we propose a set of six virtual joints, each controlling respectively

x,y,z position and roll, pitch and yaw rotation. The former three are prismatic joints, whereas the latter

are rotational, with no lower or upper limits (also known as continuous). This particular solution was

suggested by one of Gazebo’s developers on the official community forums 8 even though we found this

approach to cause great instability when using ODE physics engine. Notice however that we did not

write any part of the framework to be physics-engine specific, and it is possible that in the future our

tools can be successfully employed in other engines. By default, we disable gravity for the manipulator

alone, which is equivalent to robust robot arm controllers maintaining the manipulator’s pose.

6� JenniferBuehler/gazebo-pkgs A collection of tools and plugins for Gazebo, on GitHub as of December 17, 20187� flexible-collision-library/fcl Flexible collision library, on GitHub as of December 17, 20188Check http://answers.gazebosim.org/question/18109/movement-of-smaller-links-affecting-the-whole-body/

?answer=18111#post-id-18111, as of December 17, 2018.

65

4.2.3 Hand plugin and interface

The main component of our novel grasp pipeline consists of a Gazebo Model plugin for robotic

hands. The latter can be interacted with directly or via a custom client interface, that provides high-level

commands such as moving fingers to a set of configurable positions, with coupled dynamic movement

(somewhat analogous to synergies [1]) and support for underactuated manipulators9. One of our main

concerns when developing this tool was to provide a uniform, yet highly customisable procedure for

sampling grasp trials for grippers and multi-fingered manipulators alike.

In our specific case, we desire to obtain the outcome of pre-computed grasp candidates, as ac-

curately and efficiently as possible. For this, we designed a client script which requires actions to be

performed sequentially, such as moving to a pre-grasp pose, then closing fingers, and so on. To enforce

the order in which these smaller tasks are carried out, we implemented a server-side timer, that can be

set up to trigger and notify a client with a configurable delay. The key concept here is that it actually

uses the simulation time, as real-world time cannot be used reliably. The reason for this is that each

simulation step corresponds to a constant simulated time step ∆t, which, in reality, may take a variable

amount of time to compute. Concretely, during grasp trials, when a higher number of contacts is present,

simulation performance can drop to 20%.

Since grasp candidates are provided externally, and, in our case, do not take into account collisions of

the gripper with either the object or the ground plane, some are bound to be invalid, for certain object rest

poses. In simulation, if objects clip through one another, the physics engine tries to solve the interaction,

and ultimately ends up creating a large number of contacts between the colliding surfaces. This will have

a tremendous performance impact. For this reason, we developed a fast plugin for checking whether

two objects are colliding and if so, obtaining the contact points10.

9Check Appendix A.2.2.A for implementation details on how this is achieved despite not being available out-of-the-box inGazebo.

10Check Appendix A.2.2.B for implementation details of our efficient strategy for querying the simulator for contacts.

66

4.3 Extending GAP with physics-based domain randomisation

In the preceding section, we described the main features of our novel frameworkfor dynamic simulation of grasp trials in Gazebo. In the present section, we explainhow we extended our set of tools for Domain Randomisation (DR) to incorporaterandomisation of physical properties of objects in a scene.

In Section 3.1 we introduced a set of Gazebo plugins for employing DR to the visual appearance

of synthetic scenes. Since our present goal is to randomise physical properties, we decided to en-

hance our proposed framework GAP and develop a brand-new module for physics-based DR. The main

implemented features allow us to:

– Change gravity vector components;

– Alter the scale of existing models;

– Modify individual link properties, namely mass and inertia tensor;

– Change individual joint properties, such as angle lower and upper limits, as well as maximum effort

and velocity constraints;

– Adjust individual PID controller parameters and target values;

– Vary individual collision surface properties, including µ Coulomb friction coefficients. Some param-

eters are physics-engine specific, but our tool uses Gazebo’s own message types, and as such

should be fully compatible with the supported libraries.

Our tool includes yet another Gazebo World plugin, that has full access to the physics engine. We

also provide a client interface that can easily be imported into third-party applications and used as an

API. It provides a set of methods to easily access each of the aforementioned functions and automati-

cally packages the command in messages that are sent and parsed by the DR plugin instance, running

on the Gazebo server application. We designed the interface to be as flexible as possible, providing a

layer of abstraction to developers for simpler interaction with our tool. Since synchronisation tends to

be of utmost importance for automating robotic trials, and coordinating a sequence of events between

client and service execution, we provide a blocking request service, that halts the client until an explicit

acknowledgement response is sent back.

4.3.1 Physics randomiser class

To integrate our novel physics DR tools in the dynamic grasp trial simulation pipeline we envisioned

a method for attaching generic properties with specific random variations. For instance, we desired the

67

ability to tether the target object mass to a random scaling factor, itself behaving according to a uniform

distribution defined in a configurable interval [a, b]. Moreover, we wanted to be able to specify all of

the physical properties, affected models, links and joints, and the respective random distributions in a

simple, user-readable format.

For this, we designed the physics randomiser class, as part of our grasping pipeline toolkit. It maps

random distributions to physical properties specified in a yaml configuration file11, samples values from

these distributions, and communicates with the Gazebo server via our custom API in GAP to update the

physical properties of the scene. This class allows the automation of the whole scene randomisation

process, through a clean interface, and swift testing, as all of the parameters are read from an external

file, upon launching the client application.

This tool showcases the potential of our DR interface. Moreover, for improved maintainability we use

GAP as an external dependency and enforce version checks, to ensure third-party software is compatible

with installed binaries of our framework.

4.4 DexNet: Domain randomisation vs geometry-based metrics

In the former sections, we introduced a set of tools we designed for employing Do-main Randomisation (DR) to the physical properties of a simulated grasp scenario.In this section, we describe in detail how we integrated parts of Dex-Net [43] in ourgrasp framework for Gazebo. Specifically, we outline the proposed experiment forvalidating the claim that indeed DR can lead to more robust grasp quality estima-tion.

4.4.1 Novel grasping dataset

Our current process for generating a grasping dataset can be summarised as follows:

1. Create and pre-process object dataset in Dex-Net;

2. Import objects into Gazebo compatible models;

3. Compute antipodal grasp candidates in Dex-Net and respective baseline quality metrics, from

object geometry;

4. Compute realistic object rest poses in Gazebo via physics simulation;

5. Import object rest poses into Dex-Net in order to obtain aligned grasp candidates;

6. Import resulting grasp candidates into Gazebo and perform grasp trials in order to ascertain

whether grasp is robust or not, introducing DR by randomising properties between trial repetitions.11An example configuration is provided in Appendix A.2.2.D.

68

We start by creating a novel object dataset, using the models from KIT dataset [32]. It contains 3D

models of 129 different household items, including not only boxes and cups but also more complexly

shaped objects, such as small sculptures of dogs or horses. The latter evidently present a harder

challenge to parallel-plate grippers. The main reason for picking this dataset is twofold: (i) KIT had

previously been used for training Dex-Net grasp quality estimator network, which reduces the risk of

dataset-specific issues that could arise; (ii) the triangular meshes for each 3D object were simplified to

have roughly 800 faces, which is desirable, as it conveys a high level of detail, but is not enough that

dynamic simulation in Gazebo is painfully slow. Preliminary experiments with our pipeline showed that

very high detail 3D meshes, such as those provided in ShapeNet [9] (which in turn was also used in

Mahler et al. [43]), resulted in very poor performance, due to the high number of contacts in simulation.

We could, however, have generated lower fidelity models using a 3D mesh edition tool such as open-

source MeshLab12.

Our dataset is pre-processed in Dex-Net, which applies a scaling factor that ensures that the mini-

mum dimension of the object fits in the open gripper. In our simulated experiments, we used a Baxter

electric gripper with narrow aperture and basic solid fingertips. The main reason for this was that it was

already integrated with Dex-Net, and required minimal changes for functioning correctly. Moreover, we

had already worked previously with Baxter in Gazebo.

Then, we need to convert the processed meshes into Gazebo models, so they can be effortlessly

spawned in simulation. We provide a script that performs this task, which involves converting meshes to

a Gazebo-compatible format13.

We compute antipodal grasp candidates for each of these objects, as described in Section 4.1.3.A.

These grasp candidates essentially define the gripper contact points with the object that should result

in a successful grasp. However, the approach angle must be computed in such a way that the gripper

does not collide with the planar surface the object is resting on, henceforth referred to generically as the

table. Dex-Net uses an analytic method for obtaining stable poses with the associated probability, yet

early evaluation has shown that they are not very reliable. This can be particularly undesirable when

running the dynamic simulation in Gazebo, as the object may start moving and intersect with the gripper.

Instead, we compute object rest poses in Gazebo, using our set of tools. A rest pose is obtained by

dropping the object from an initial height and ensuring that both positional and angular velocities drop

below a threshold ε.

We can then import these reliable rest poses into Dex-Net in order to align the grasp with the normal

of the table. This minimises the amount of grasp poses that immediately result in a collision with the

planar surface. These aligned grasps candidates are then exported and read in our pipeline. Finally, we

can proceed to perform grasp trials, within Gazebo’s simulated environment.

12MeshLab website http://www.meshlab.net/, as of December 17, 201813Dex-Net generates .obj files by default, and Gazebo can only process Collada (.dae) or .stl files

69

The process of running simulated grasp trials is succinctly described in Algorithm 4.3.

Algorithm 4.3: Obtains success labels via dynamic simulat9ion, given a dataset of objects andgrasp candidates.

Input : D Object dataset,G Grasp candidates dataset,M Manipulator interface,R Randomiser interface,N Maximum number of trials

Output: O Grasp outcome

Function runTrials(D, G)setupCommunicationInterfaces(M,N );for target object t in dataset D do

Gt ← G.getGrasps(t);st ← D.getStablePoses(t);M .spawnObject(t, st.first());for candidate gti in grasps Gt do

for G trials doR.randomizePhysics();rtij ← tryGrasp(M, t, gti) ; // Trial successM .resetScene();

rti ← 1g

G∑rtij ;

O← O ∪ {t, gti, rti};return O;

Function tryGrasp(M, t, g)M .setPose(GS) ; // Safe hand poseM .openFingers() ; // Grasp pre-shapeM .setPose(g) ; // Grasp candidate poseif M in collision then

return Error;

M .closeFingers() ; // Grasp post-shapeM .lift();if t in collision with groundplane then

return 0.0;

return 1.0;

4.4.2 Experiment: grasp quality estimation

For a comparative approach, we resort to Dex-Net’s robust force-closure, which performs Monte

Carlo sampling when computing the force-closure metric for a target object, randomising object and

gripper relative poses, as well as the friction coefficient between the two. Since our simulation can

only provide binary grasp success labels per trial, there is no evident way to directly compare them.

Similarly to what was originally performed in Mahler et al. [43], we propose a derived binary success

70

label, computed as follows

S(u,x) =

{1 EQ > δ

0 otherwise,(4.1)

where EQ denotes the expected grasp quality (robustness), in our case the robust force-closure metric.

We are not interested in invalid grasp poses, i.e. which result in an instant collision between the

gripper and the target object or planar surface. Since this can be easily verified in our simulation pipeline

we can automatically discard these grasps. Note however that this constraint depends on the object

resting pose, and future work can improve upon this by sampling grasps for several rest poses.

4.4.2.A Trials without domain randomisation

We started by validating our grasp candidates proposal integration by acquiring 200 antipodal grasps

for each of the 129 objects of KIT dataset14. We align these grasps by providing a realistic object rest

pose, computed in Gazebo using our framework. We then proceed to simulate grasp trials for each

of the aligned grasp candidates, without performing any randomisation whatsoever. Thus, there is no

point in repeating each trial, contrary to shown in Algorithm 4.3. Each grasp candidate is automatically

annotated with a binary success label s ∈ {0, 1}, or a flag indicating that the grasp candidate is invalid

due to collisions at the outset. Table 4.1 presents the averaged results for our grasp trials for KIT dataset.

The object-wise results are shown in Table A.1.

Dataset Successful trials Failed trials Invalid trialsKIT 13.4591 % 5.7578 % 80.7836 %KIT– 14.8514 % 6.3529 % 78.7957 %

Table 4.1: Object-wise averaged number of successful, failed and invalid grasp trials for KIT dataset. KIT– datasetdenotes KIT without objects for which only invalid grasps were available. No randomisation or trial repetitions are

employed.

Few objects have no valid grasp attempts, for the considered stable pose. These include for instance

either very small objects or flat, thin boxes lying on the table, which typically led to the gripper colliding

with the planar surface. This high amount of invalid grasps is expected, as our grasp proposal module

only enforces constraints on the contact points, and aligns the grasp angle according to the planar

surface normal, for a given object stable pose. It does not, however, take into consideration the shape

of the gripper.

Notably, the amount of successful trials is over 233% greater than of failed grasp attempts. For in-

stance, the can-shaped PineappleSlices object was successfully grasped 114 times and only dropped

once. We should mention that we did not oversee the full process of data generation, as it spans multiple

hours of execution, and have yet to thoroughly evaluate the realism of the successful grasp attempts.

14Dex-Net antipodal grasp sampler was unable to provide the target 200 grasps for a few objects. Particularly, for the GlassBowlobject no valid grasp candidate was retrieved, due to its large diameter and concave shape.

71

Figure 4.3 shows a time-lapse of a successful grasp attempt.

Figure 4.3: Time-lapse of successful grasp trial on CokePlasticLarge object

4.4.2.B Trials with domain randomisation

We then proceeded to include DR in the physical properties in our grasp trials. Concretely, we vary

the friction coefficient of the gripper fingertips and the grasp target object, as well as the mass of the

object. We perform a total of 10 repetitions per valid grasp trial, subject to randomisation of properties

in between trials. The latter are specified in Table 4.2, and match those employed in [51].

Property Scaling factorTarget object mass uniform([0.5, 1.5])Surface friction coefficients uniform([0.7, 1.3])

Table 4.2: Randomised physical properties and corresponding probability distribution for sampler.

For each valid grasp candidate, we count the number of successful attempts. This value is then

averaged over the number of trials per grasp candidate (in this case 10). We wish to compare this

result with the binary success label for a single grasp trial, without employing DR. In order to do so, we

convert the robust force-closure metric to a binary value, by choosing a threshold δ (cf. Equation (4.1)).

The outcome average success rate for trials employing DR is thresholded at 0.5, which corresponds to

classifying a grasp as a success when at least 50% of the trials succeeded. We then use the thresholded

robust force-closure as our ground truth, and formulate the metric correlation as a binary classification

problem. Thus, we can effectively compare the metrics from trials with and without DR.

We evaluate the accuracy at several values of δ, which corresponds to varying the level at which our

ground truth classifies a grasp to be successful. Figure 4.4 presents the accuracy for six objects with

and without physics-based DR (orange and blue, respectively). These we selected such that each had

at least 15% of valid grasp trials.

It seems that for low values of the threshold δ, i.e. weaker constraint for classifying a grasp as

successful, DR causes our grasp quality to differ from the ground truth binary metric. Evaluating the

concrete detections (number of true positives, true negatives and so on), we identify that DR causes

72

(a) MelforBottle

(d) TomatoHerbSauce

(b) ChoppedTomatoes

(e) CondensedMilk

(c) FruitDrink

(f) DanishHam

Figure 4.4: Accuracy of our binary success metric, for the grasp trials with and without physics-based DR (orangeand blue, respectively). Accuracy is computed for different values of threshold δ. Figure shows results for 6 of the

129 objects used for simulations.

a higher amount of false negative detections, which means that it provides a more conservative grasp

quality estimate, as it rejects low confidence grasps when the baseline classifies them as a success.

This generally results in the lower accuracy observed in each subfigure, for smaller δ. On the other

hand, for higher values of δ, we notice that DR results in a fewer amount of false positives, once more

suggesting that the metric is more conservative.

Overall, for objects with sufficient valid grasp trials (> 15 %), DR results in a more accurate metric,

using the robust force-closure metric from Dex-Net as ground truth, for high values of δ.

However, there are exceptions. Notably, we present the case of the CondensedMilk object (Fig-

ure 4.4.e), which seems to suggest the inferiority of the metric with DR. In simulation, the gripper was

able to pick up the object with several grasp configurations consistently, even though the ground truth

metric predicted a very small success rate for each of them (always below 30 %). Finally, in Figure 4.4.f)

we show that for a few objects there is no noticeable difference between the two proposed metrics.

Even though we are looking at a fairly limited sample, these are positive results, as they demonstrate

that DR can diminish the risk of wrongly selecting a grasp pose as a viable candidate, when it does not

exhibit robustness. However, to adequately support this claim, additional and more thorough testing is

needed.

73

74

5Discussion and Conclusions

Contents

5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

75

Life is one long struggle between conclusions based on abstract ways of conceiving cases,

and opposite conclusions prompted by our instinctive perception of them.

William James, 1890

76

5.1 Discussion

The previous sections described our approach, experimental trials and their out-come. In this section, we reiterate our major findings and attempt to unveil thecauses for unexpected results.

Regarding our object detection trials, our work was the first to demonstrate the application of DR in a

multi-class scenario. Our preliminary trials suggested that even with small synthetic datasets and even

smaller number of domain-specific real images it was possible to train state–of–the–art object detectors

and achieve acceptable performance. However, at that point it was unclear which was the impact of

having pre-trained the networks on the massive COCO [40] dataset. Furthermore, we believed that

performance was bottlenecked by the size of our synthetic dataset.

Our subsequent experiments with SSD showed that we could successfully train our object detec-

tion network from scratch employing only synthetic images. Initially, one could argue that knowledge

from the generic classes of the COCO dataset may fail to generalise to our parametric shape classes.

Yet, pre-training networks on COCO was shown to be useful, as it outperformed both networks trained

solely on synthetic data. However, the synthetic datasets employing DR allowed SSD to substantially

outperform the COCO baseline when fine-tuned on a limited dataset of domain-specific real images, for

a much smaller number of iterations. This development proves that domain-specific synthetic datasets

employing DR can lead to substantial improvements over pre-training on huge generic datasets, which

is currently a common approach.

We concluded that it is possible to achieve great results with small synthetic datasets. In an unre-

ported exploratory trial, we generated a dataset of 200.000 simulated scenes only to attain a negligible

improvement in detection performance. This result is in line with the findings in Tremblay et al. [73],

suggesting that more is not always better. Additionally, we verified it was possible to achieve better

performance with the smaller dataset of 6.000 images with no flat texture, than employing 30.000 with

all possible texture variations.

We found that even though only the work of James et al. [29] employs Perlin noise for DR, this texture

proved to be the most useful among the patterns mentioned in the literature.

Contrary to our expectations, we found that adding camera variations to our synthetic datasets low-

ered performance on the real test set. We presume this is a direct consequence of our test set showing

no noticeable differences in camera pose between scenes. Thus, unnecessary variations merely in-

crease the difficulty of the learning task. We re-emphasise that the goal of this work was to generalise

detections across unseen instances of the categories, given only a small real dataset and inevitably,

data artefacts always leak in such limited datasets. However, for DR to optimally work, one should only

randomise aspects of the environment that are expected to vary in the real scenario. For instance, con-

sidering a moving viewpoint may be useful for a robot with moving eyes but detrimental for fixed-installed

77

cameras. This result holds, even though we did not put any effort in aligning the simulated camera with

the real one.

In our experiments, we did not model the impact of using distractor shapes, although it has become

a common DR technique. In our particular case, these could correspond to shapes such as pyramids, or

cones or even with arbitrarily complex geometry. Nevertheless, our test set contains spurious irrelevant

objects, and our detectors were able to correctly classify them as background.

With respect to our novel pipeline for the generation of synthetic grasping datasets, we faced two

major challenges. The first is that a very high fraction of proposed grasps had to be discarded at the

outset, as these led to the robotic manipulator colliding with either the object or planar surface. The

employed grasp proposal from Dex-Net [43] has to be adapted to incorporate knowledge of the gripper

geometry, so as to maximise the amount of collision-free grasp candidates. The second concerns the

tuning of both simulation parameters and the randomisation of physics properties, in order to strike an

adequate balance between realistic simulation and a dataset for robust grasp estimation.

We have performed trials with and without the application of DR and argue that the latter leads to a

better grasp quality estimation w.r.t robustness, as evidenced by the results described in Sections 4.4.2.A

and 4.4.2.B.

5.2 Conclusions

In the previous section, we presented the discussion of our findings, relating them toour own initial expectations and the results of relevant related work. In this section,we wrap up the main body of the document, stating our major conclusions andsummarising our contributions.

Recalling the research questions presented in Section 1.1.2, we hereby compile this work’s major

findings.

(i) RQ1: Can we successfully train state–of–the–art object detectors with synthetic datasets

created exploiting Domain Randomisation in a multi-class scenario? The experiments presented

in Section 3.2 have shown that indeed one can transfer the benefits of applying DR to synthetic dataset

generation into a multi-class object detection scenario. Moreover, we propose an open-source dataset

generation framework integrated with established robotic simulator Gazebo. We successfully trained two

state–of–the–art deep CNNs with non-photo-realistic simulated images dataset and obtained decent, yet

not overwhelming preliminary results. However, this was unsurprising, as some of the main aspects of

the object’s appearance, namely their scale and the possibility of having distinct textures per object face

were not thoroughly considered in the synthetic data generation. Because of this, networks fully trained

on synthetic data were unable to fully transfer performance to real images.

78

(i) RQ2: Can small domain-specific synthetic datasets generated employing DR be preferable to

pre-training object detectors on huge generic datasets, such as COCO? The analysis described

in Section 3.3.4 allows us to conclude that DR can be used to train networks that outperform widespread

domain adaptation method involving fine-tuning on large and available datasets, by as much as 26% in

mAP metric. This effectively demonstrates that DR is a viable option when dealing with small amounts

of annotated real images for a multi-class object detection task.

(i) RQ3: Which is the impact of each of the texture patterns most commonly employed for visual

Domain Randomisation, in an object detection task? The ablation study described in Section 3.3.5

established the importance of each of the four texture patterns found in recent relevant textures. We

found that by removing more complex textures performance of state–of–the–art object detector SSD

on real image test set would drop considerably, despite having comparable results on the validation set

during the fine-tuning stages. This suggests that patterns of higher complexity are more relevant for the

generalisation capability of object detectors, which although intuitive, had not been clearly demonstrated

prior to our work.

(ii) RQ4: How can we integrate dynamic simulation of robotic grasping trials into a state–of–

the–art framework for grasp quality estimation? In Section 4.3 we describe how we extend GAP in

order to allow for the randomisation of physical properties such as object mass and friction coefficients,

in Gazebo. We present a novel framework for dynamic grasp trial in Gazebo (Section 4.2), which can

be used for generating grasp success labels, under physical property randomisation. We use Dex-

Net [43], an existing state–of–the–art grasping framework for the generation of grasp candidates and

providing respective geometric grasp metrics. We propose replacing the latter with our own grasp quality

estimation metric obtained from dynamic trials in a simulated environment. Furthermore, this study

establishes that, contrary to numerous reports, Gazebo simulator can indeed be used for simulating

complex grasp trials and produce realistic outcomes.

(ii) RQ5: Can DR improve synthetic grasping dataset generation, employing a physics-based

grasp metric in simulation? In Section 4.4 we describe how we use our novel framework for dynamic

simulation of grasp trials and acquire a dataset pairing grasp candidates with respective expected quality

metrics. We then perform a comparative study, reported in Sections 4.4.2.A and 4.4.2.B, analysing the

impact of applying DR to physical properties of objects, and conclude that DR leads to a more robust

grasp quality estimation. Our claim is substantiated by using Dex-Net [43] robust grasp quality metric to

provide ground truth success labels, and compare them with the outcome of our trials with and without

DR.

79

5.3 Future Work

Regarding visual perception, in this work we have studied the simplest form of domain adaptation,

namely fine-tuning, as it is by far the most popular approach due to its simplicity. We could also employ

more sophisticated forms of domain adaptation, as GANs become an increasingly viable option, even

when only limited amounts of labelled training data is available. In real scenarios where final perfor-

mance metric is usually the most pertinent consideration, various data augmentation techniques such

as colour or brightness distortions, random crops, should be added to the training pipeline of DR to

improve the generalisation capabilities of the detector at test time.

Moreover, in order to further validate our findings, we should deploy our detector in an adversarial test

set, which exhibits heavy lighting condition variations, different camera poses and other disturbances.

This should demonstrate the robustness of our detectors in comparison with the baseline.

Regarding grasp quality estimation pipeline, the roadmap is clear. First of all, to indeed verify that we

can train deep neural networks for evaluating grasp candidates with our physics DR enriched dataset we

should train the CNN of Dex-Net 2.0 [43]. We should then proceed to compare the performance of the

baseline network, trained with geometric grasp quality measures with those produced with our pipeline.

Furthermore, we barely scratched the surface regarding the properties we can randomise in simu-

lation, and future work could leverage the implemented features, employing a large object dataset with

several trial repetitions, in order to achieve a considerable grasping dataset exploiting DR for robustness

and generalisation capabilities.

We provide a framework for multi-fingered robotic manipulators but have yet to perform any mean-

ingful simulated experiments employing such robots. Future work could extend our work to dexterous

grasping, even employing discrete grasp synergies, which are already supported in our framework. Ul-

timately, by providing a wrapper for a grasp quality estimation pipeline employing such a dataset, one

could move on to perform validation trials in a real dexterous robot. Currently, we already support our

in-house Vizzy platform for performing full dynamic grasp trial simulation.

80

Bibliography

[1] A. Bernardino, M. Henriques, N. Hendrich, and J. Zhang, “Precision grasp synergies for dexterous

robotic hands,” in 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dec.

2013, pp. 62–67.

[2] A. Bicchi and V. Kumar, “Robotic grasping and contact: A review,” in Proceedings 2000 ICRA.

Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia

Proceedings (Cat. No.00CH37065), vol. 1, Apr. 2000, pp. 348–353 vol.1.

[3] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-Driven Grasp Synthesis-A Survey,” IEEE Trans-

actions on Robotics, vol. 30, no. 2, pp. 289–309, Apr. 2014.

[4] J. Borrego, A. Dehban, R. Figueiredo, P. Moreno, A. Bernardino, and J. Santos-Victor, “Applying

Domain Randomization to Synthetic Data for Object Category Detection,” ArXiv e-prints, Jul. 2018.

[5] J. Borrego, R. Figueiredo, A. Dehban, P. Moreno, A. Bernardino, and J. Santos-Victor, “A Generic

Visual Perception Domain Randomisation Framework for Gazebo,” in 2018 IEEE International Con-

ference on Autonomous Robot Systems and Competitions (ICARSC), Apr. 2018, pp. 237–242.

[6] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor,

K. Konolige, S. Levine, and V. Vanhoucke, “Using Simulation and Domain Adaptation to Improve

Efficiency of Deep Robotic Grasping,” CoRR, vol. abs/1709.07857, 2017.

[7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba,

“OpenAI Gym,” CoRR, vol. abs/1606.01540, 2016.

[8] R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. H. Adelson, and S. Levine, “The Feeling

of Success: Does Touch Sensing Help Predict Grasp Outcomes?” CoRR, vol. abs/1710.05512,

2017.

[9] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.-X. Huang, Z. Li, S. Savarese,

M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An Information-Rich 3D Model

Repository,” CoRR, vol. abs/1512.03012, 2015.

81

[10] F.-J. Chu, R. Xu, and P. A. Vela, “Deep Grasp: Detection and Localization of Grasps with Deep

Neural Networks,” CoRR, vol. abs/1802.00520, 2018.

[11] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object Detection via Region-based Fully Convolutional

Networks,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama,

U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 379–387.

[12] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical

image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

Jun. 2009, pp. 248–255.

[13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual

Object Classes (VOC) Challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp.

303–338, Jun. 2010.

[14] C. Ferrari and J. Canny, “Planning optimal grasps,” in Proceedings 1992 IEEE International Confer-

ence on Robotics and Automation, May 1992, pp. 2290–2295 vol.3.

[15] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lem-

pitsky, “Domain-adversarial Training of Neural Networks,” J. Mach. Learn. Res., vol. 17, no. 1, pp.

2096–2030, Jan. 2016.

[16] R. Girshick, “Fast R-CNN,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec.

2015, pp. 1440–1448.

[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object

Detection and Semantic Segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), Jun. 2014, pp. 580–587.

[18] M. Gualtieri and R. Platt, “Learning 6-DoF Grasping and Pick-Place Using Attention Focus,” ArXiv

e-prints, Jun. 2018.

[19] M. Gualtieri, A. ten Pas, K. Saenko, and R. P. Jr., “High precision grasp pose detection in dense

clutter,” CoRR, vol. abs/1603.01564, 2016.

[20] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and N. Xi, “A hybrid deep architecture for robotic grasp

detection,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017,

pp. 1609–1614.

[21] H. V. Hasselt, “Double Q-learning,” in Advances in Neural Information Processing Systems 23, J. D.

Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, Eds. Curran Associates,

Inc., 2010, pp. 2613–2621.

82

[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 770–778.

[23] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Computer Vision (ICCV), 2017

IEEE International Conference On, 2017, pp. 2980–2988.

[24] M. Hodosh, P. Young, and J. Hockenmaier, “Framing Image Description As a Ranking Task: Data,

Models and Evaluation Metrics,” J. Artif. Int. Res., vol. 47, no. 1, pp. 853–899, May 2013.

[25] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam,

“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” CoRR, vol.

abs/1704.04861, 2017.

[26] S. Ivaldi, J. Peters, V. Padois, and F. Nori, “Tools for simulating humanoid robot dynamics: A survey

based on user feedback,” in 2014 IEEE-RAS International Conference on Humanoid Robots, Nov.

2014, pp. 842–849.

[27] S. Jain and C. K. Liu, “Controlling Physics-based Characters Using Soft Contacts,” in Proceedings

of the 2011 SIGGRAPH Asia Conference, ser. SA ’11. Hong Kong, China: ACM, 2011, pp. 163:1–

163:10.

[28] N. Jakobi, “Evolutionary Robotics and the Radical Envelope-of-Noise Hypothesis,” Adaptive Behav-

ior, vol. 6, no. 2, pp. 325–368, 1997.

[29] S. James, A. J. Davison, and E. Johns, “Transferring End-to-End Visuomotor Control from Simula-

tion to Real World for a Multi-Stage Task,” in Proceedings of the 1st Annual Conference on Robot

Learning, ser. Proceedings of Machine Learning Research, S. Levine, V. Vanhoucke, and K. Gold-

berg, Eds., vol. 78. PMLR, Nov. 2017, pp. 334–343.

[30] Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from RGBD images: Learning using a new

rectangle representation,” in 2011 IEEE International Conference on Robotics and Automation, May

2011, pp. 3304–3311.

[31] E. Johns, S. Leutenegger, and A. J. Davison, “Deep learning a grasp function for grasping under

gripper pose uncertainty,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ. IEEE, Oct.

2016, pp. 4461–4468.

[32] A. Kasper, Z. Xue, and R. Dillmann, “The KIT object models database: An object model database

for object recognition, localization and manipulation in service robotics,” The International Journal

of Robotics Research, vol. 31, no. 8, pp. 927–934, 2012.

[33] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” ArXiv e-prints, Dec. 2013.

83

[34] N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot

simulator,” in 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

(IEEE Cat. No.04CH37566), vol. 3, Sep. 2004, pp. 2149–2154 vol.3.

[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional

Neural Networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C.

Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[36] S. Kumra and C. Kanan, “Robotic grasp detection using deep convolutional neural networks,” in

2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sep. 2017,

pp. 769–776.

[37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recog-

nition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[38] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International

Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.

[39] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning Hand-Eye Coordination for Robotic

Grasping with Deep Learning and Large-Scale Data Collection,” CoRR, vol. abs/1603.02199, 2016.

[40] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Mi-

crosoft COCO: Common Objects in Context,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla,

B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 740–755.

[41] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot

MultiBox Detector,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling,

Eds. Cham: Springer International Publishing, 2016, pp. 21–37.

[42] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kröger, J. Kuffner,

and K. Goldberg, “Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using

a multi-armed bandit model with correlated rewards,” in IEEE International Conference on Robotics

and Automation (ICRA), 2016, pp. 1957–1964.

[43] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-Net 2.0:

Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics,”

CoRR, vol. abs/1703.09312, 2017.

[44] J. Mahler, M. Matl, X. Liu, A. Li, D. V. Gealy, and K. Goldberg, “Dex-Net 3.0: Computing Robust

Robot Suction Grasp Targets in Point Clouds using a New Analytic Model and Deep Learning,”

CoRR, vol. abs/1709.06670, 2017.

84

[45] A. T. Miller and P. K. Allen, “Graspit! A versatile simulator for robotic grasping,” IEEE Robotics

Automation Magazine, vol. 11, no. 4, pp. 110–122, Dec. 2004.

[46] P. Moreno, R. Nunes, R. Figueiredo, R. Ferreira, A. Bernardino, J. Santos-Victor, R. Beira, L. Var-

gas, D. Aragão, and M. Aragão, “Vizzy: A Humanoid on Wheels for Assistive Robotics,” in Robot

2015: Second Iberian Robotics Conference, L. P. Reis, A. P. Moreira, P. U. Lima, L. Montano, and

V. Muñoz-Martinez, Eds. Cham: Springer International Publishing, 2016, pp. 17–28.

[47] D. Morrison, P. Corke, and J. Leitner, “Closing the Loop for Robotic Grasping: A Real-time, Gener-

ative Grasp Synthesis Approach,” CoRR, vol. abs/1804.05172, 2018.

[48] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the Gap Between Value and Policy

Based Reinforcement Learning,” CoRR, vol. abs/1702.08892, 2017.

[49] V. Nguyen, “Constructing force-closure grasps,” in Proceedings. 1986 IEEE International Confer-

ence on Robotics and Automation, vol. 3, Apr. 1986, pp. 1368–1373.

[50] ——, “Constructing force-closure grasps in 3D,” in Proceedings. 1987 IEEE International Confer-

ence on Robotics and Automation, vol. 4, Mar. 1987, pp. 240–245.

[51] OpenAI, “Learning Dexterous In-Hand Manipulation,” ArXiv e-prints, Aug. 2018.

[52] K. Perlin, “Improving Noise,” in Proceedings of the 29th Annual Conference on Computer Graphics

and Interactive Techniques, ser. SIGGRAPH ’02. San Antonio, Texas: ACM, 2002, pp. 681–682.

[53] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin,

M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, “Multi-Goal Reinforcement Learning: Chal-

lenging Robotics Environments and Request for Research,” CoRR, vol. abs/1802.09464, 2018.

[54] V. L. Popov, “Coulomb’s Law of Friction,” in Contact Mechanics and Friction: Physical Principles

and Applications. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 133–154.

[55] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS:

An open-source Robot Operating System,” in ICRA Workshop on Open Source Software, 2009.

[56] D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine, “Deep Reinforcement Learning

for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods,”

CoRR, vol. abs/1802.10264, 2018.

[57] J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in

2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 1316–

1322.

85

[58] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time

Object Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

Jun. 2016, pp. 779–788.

[59] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection

with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 39, no. 6, pp. 1137–1149, Jun. 2017.

[60] A. Rocchi, B. Ames, Z. Li, and K. Hauser, “Stable simulation of underactuated compliant hands,”

in 2016 IEEE International Conference on Robotics and Automation (ICRA), May 2016, pp. 4938–

4944.

[61] A. Rodriguez, M. T. Mason, and S. Ferry, “From Caging to Grasping,” Int. J. Rob. Res., vol. 31,

no. 7, pp. 886–900, Jun. 2012.

[62] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,

M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”

International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015.

[63] F. Sadeghi, A. Toshev, E. Jang, and S. Levine, “Sim2Real Viewpoint Invariant Visual Servoing by

Recurrent Control,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

Jun. 2018.

[64] K. Shimoga, “Robot Grasp Synthesis Algorithms: A Survey,” The International Journal of Robotics

Research, vol. 15, no. 3, pp. 230–266, 1996.

[65] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recog-

nition,” CoRR, vol. abs/1409.1556, 2014.

[66] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and

A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), Jun. 2015, pp. 1–9.

[67] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, Inception-ResNet and the Impact of Residual

Connections on Learning,” AAAI Conference on Artificial Intelligence, Feb. 2016.

[68] A. ten Pas and R. Platt, “Using Geometry to Detect Grasps in 3D Point Clouds,” CoRR, vol.

abs/1501.03100, 2015.

[69] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp Pose Detection in Point Clouds,” The

International Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017.

86

[70] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for

transferring deep neural networks from simulation to the real world,” in IROS. IEEE, Sep. 2017,

pp. 23–30.

[71] J. Tobin, W. Zaremba, and P. Abbeel, “Domain Randomization and Generative Models for Robotic

Grasping,” CoRR, vol. abs/1710.06425, 2017.

[72] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in 2012

IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 5026–5033.

[73] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boo-

choon, and S. Birchfield, “Training Deep Networks with Synthetic Data: Bridging the Reality Gap by

Domain Randomization,” CoRR, vol. abs/1804.06516, 2018.

[74] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective Search for

Object Recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, Sep.

2013.

[75] M. Veres, M. Moussa, and G. W. Taylor, “Modeling Grasp Motor Imagery Through Deep Conditional

Generative Models,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 757–764, Apr. 2017.

[76] M. Veres, M. A. Moussa, and G. W. Taylor, “An Integrated Simulator and Dataset that Combines

Grasping and Vision for Deep Learning,” CoRR, vol. abs/1702.02103, 2017.

[77] X. Yan, M. Khansari, Y. Bai, J. Hsu, A. Pathak, A. Gupta, J. Davidson, and H. Lee, “Learning Grasp-

ing Interaction with Geometry-aware 3D Representations,” CoRR, vol. abs/1708.07303, 2017.

[78] F. Zhang, J. Leitner, M. Milford, and P. Corke, “Sim-to-real Transfer of Visuo-motor Policies for

Reaching in Clutter: Domain Randomization and Adaptation with Modular Networks,” CoRR, vol.

abs/1709.05746, 2017.

87

88

AAppendix

Contents

A.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.2 Intersection over Union (IoU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.3 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.1.4 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.1.5 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.2.1 GAP Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.2.2 GRASP Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.3 Additional Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

89

A.1 Metrics

In this section, we elaborate on a few common metrics that are referenced throughout the thesis and

well-known in the research fields of machine learning and pattern recognition.

A.1.1 Precision and Recall

Precision essentially measures how accurate a positive prediction is. It corresponds to the fraction

of relevant instances among all of the retrieved instances. On the other hand, recall measures how

many of the instances deemed relevant should have actually been selected. Mathematical definitions

are provided in Equations (A.1) and (A.2) respectively, where T and F prefixes stand for true and false,

and P and N suffixes for positive and negative detections.

precision =TP

TP + FP=

TPAll detections

(A.1)

recall =TP

TP + FN=

TPAll ground truths

(A.2)

Precision and recall metrics cannot be both optimal, as their objectives clash. Usually, we attempt

to find a balance between the two, in an attempt to not miss positive detections, but not wrongly identify

false positives. For this reason, it is invaluable to plot precision-recall curves. Since we had an imbal-

anced dataset, precision-recall curves were helpful in our research, and can be seen in Figures 3.10

and 3.13.

A.1.2 Intersection over Union (IoU)

In the context of pattern recognition, IoU serves as an evaluation metric for assessing the perfor-

mance of an object detection algorithm. Given a predicted bounding box for an object in an image, and

a ground truth bounding box annotation, it weighs the area of the overlapping region between the two

with respect to the area of their union. It is computed as seen in Equation (A.3).

IoU =Area of overlapArea of union

(A.3)

A.1.3 Average Precision

Given a precision-recall curve p(r), defined for r ∈ [0, 1], AP corresponds to the average value of p(r)

which effectively corresponds to the area under the curve, as seen in Equation (A.4).

AP =

∫ 1

0

p(r) dr (A.4)

90

When working on a discrete set of points from the p(r) curve, the integral becomes a finite sum over its

computed values, as shown in Equation (A.5).

AP =

K∑k=1

P (k)∆r(k) (A.5)

The mAP for a set of Q queries is the mean of the AP scores for each query q, as present in Equa-

tion (A.6).

mAP =

∑Qq=1AP (q)

Q(A.6)

A.1.3.A PASCAL VOC Challenge

In the the PASCAL Visual Object Classes 2007 challenge the interpolated AP is computed for each

object class [13], and defined as the mean precision at a set of eleven evenly spaced recall levels

r ∈ [0, 0.1, . . . , 1], as follows

APPASCAL =1

11

∑r∈[0,0.1,...,1]

pint(r). (A.7)

The precision at each recall level r is interpolated by selecting the maximum measured precision for

which the corresponding recall r ≥ r

pint(r) = maxr:r≥r

p(r). (A.8)

In this case, the mAP is given by computing the mean of the per-class AP:

mAPPASCAL =1

C

C∑c=1

APc. (A.9)

In order to plot the precision-recall curves, one must establish the criteria for classifying a detection

as correct. In this case, the class must match the ground truth, and the IoU of the predicted and ground

truth bounding boxes must reach a threshold (typically IoUmin = 0.5).

A.1.4 F1 score

The F1 score, also known as F-score or F-measure is a metric that conveys the balance between

precision and recall. Specifically, it is the harmonic mean of precision and recall, and thus given by

F1 =

(precision−1 + recall−1

2

)−1= 2 · precision · recall

precision + recall. (A.10)

91

A.1.5 Loss functions

In machine-learning and optimisation, a loss function, otherwise known as a cost function, is a func-

tion that maps a set of inputs to a real number. An optimisation problem aims at minimising a cost

function. An objective function can either be a loss function or its negative, in which case the problem

becomes how to maximise it. When the latter case applies, it is commonly designated by reward, profit,

utility or fitness function.

A.1.5.A L1-norm and L2-norm

L1-norm loss function intends to minimise the sum of the absolute differences between the target

values yi and the estimated values yi. It is also commonly named Least Absolute Deviations and it is

given by

L1 =∑i

|yi − yi| = ‖yi − yi‖1 . (A.11)

Minimising this loss function corresponds to predicting the conditional median of y. Smooth L1-norm

addresses the problem that the standard L1 norm is not differentiable in 0, and is given by

L1 smooth(x) =

{0.5x2 , if |x| ≤ 1

|x| − 0.5 , otherwise. (A.12)

L2-norm loss function is also known as Least Squares error, and it is computed as follows

L2 =

√∑i

(yi − yi)2 = ‖yi − yi‖2 . (A.13)

The aforementioned error functions thrive in different scenarios. Robustness is defined as the ability

to reject outliers in a dataset. L1-norm is more robust than L2, as the latter greatly increases the cost of

outliers as it uses the squared error. However, L2-norm provides a more stable solution, in the sense that

it exhibits less variation when the dataset is slightly disturbed. Given that L2 corresponds to a Euclidean

distance, there is only a single solution. On the other hand, L1 is a Manhattan distance, which means it

can lead to multiple possible solutions.

92

A.2 Implementation

In this section, we provide relevant implementation details for the two developed frameworks. These

provide additional information to the official automatically generated documentation, that is hosted on-

line. Throughout development, we encountered several issues derived from the usual constraints of

building on top of existing code, and some of the solutions we found to be effective are neither intuitive

nor very straightforward. Whereas understanding these design choices might not be crucial for using

our tools from an end-user standpoint, these additional reports might be helpful for developers that may

one day decide to extend our platform.

A.2.1 GAP Implementation details

The official documentation of gap1 is hosted online2 and was generated automatically using Doxy-

gen3. It provides full coverage of plugin classes and interfaces, as well as additional utilities and ex-

amples. This section describes relevant implementation details of some features of our framework,

motivating design choices and clarifying seemingly convoluted procedures.

A.2.1.A Simultaneous multi-object texture update

Our first version of the dataset generation pipeline spawned and removed objects each scene. We

then proposed an enhanced approach that reuses object instances already in simulation. This led us to

the creation of a Gazebo plugin that handles an object visual appearance during simulation.

Gazebo’s modular design results in a barrier between physics and rendering classes, from the API

standpoint. Furthermore, the provided conventional Visual plugins are tied to a single model’s visual

appearance. These conditions persuaded us to create a Visual plugin, which is instanced once per

object in the scene, to which it is linked. Each object is able to specify which textures it should load and

supports loading a group of materials that match a given prefix. The latter feature was particularly useful

in our texture ablation study (Section 3.3.5).

These plugin instances share communication request and response topics. Because of this, we

are able to broadcast a single request to every connected peer simultaneously. Each object visual is

either modified or reset depending on whether its name is mentioned in the target field of the request

message. Thus, by specifying which objects should be updated, we implicitly state that the remaining

should perform some default behaviour (in this case resetting to its initial state). Each object has its

own instance of the plugin and is responsible for providing feedback to the client once texture has been

1� jsbruglie/gap Gazebo plugins for applying domain randomisation, on GitHub as of December 17, 20182Official GAP documentation http://web.tecnico.ulisboa.pt/joao.borrego/gap/, as of December 17, 20183Doxygen: Generate documentation from source code http://www.doxygen.nl/, as of December 17, 2018

93

updated. The client application (in our case the scene generation script) is tasked with keeping track of

which plugin instances have already replied.

A.2.1.B Creating a custom synthetic dataset

We provide a brief tutorial on how to create a custom synthetic dataset using our framework, which

is available online4 In order to do so, one must first launch Gazebo server and open a world file that in

turn loads our WorldUtils plugin. Then, in a separate terminal, we launch our scene_example client,

specifying the number of scenes to generate and dataset output directories for images and annotations.

This script assumes that the custom models for the ground plane, box, cylinder, sphere and camera

are located in a valid Gazebo resource directory and can be loaded in simulation. Parameters such as

which textures should be loaded and camera properties should be specified in these models respective

SDF files. Finally, we developed a Python script to debug resulting datasets, which displays images and

annotation overlays, specifying both class and bounding box, per object.

A.2.1.C Randomising robot visual appearance

Gazebo provides built-in features for programmatically altering objects during simulation, by listening

to named topics for particular requests. The latter include for instance messages for Visuals which

define the visual appearance of some model. Although in most of our proposed tools we handle parsing

requests ourselves, our DR interface uses these features directly. This is mostly because the physics

and rendering engines are decoupled and, as such, the plugin handling altering physical properties has

no control whatsoever regarding object visuals. We chose to use the default /visual topic, rather

than implementing custom visual plugins, as we had previously done for our object detection synthetic

dataset generation.

The randomisation of the visual appearance of each of the links in a robotic hand was performed in

OpenAI [51]. Figure A.1 shows Shadow Dexterous Hand robotic manipulator undergoing randomisation

of individual link colours.

Figure A.1: Shadow Dexterous Hand with randomised individual link textures

4Custom synthetic dataset tutorial https://github.com/jsbruglie/gap/tree/dev/examples/scene_example, as of Decem-ber 17, 2018

94

A.2.1.D Known issues

During our synthetic dataset generation, we noticed that Gazebo resources are automatically loaded

into memory. However, when replaced by other textures, we noticed memory usage did not decrease.

This suggests there exists a memory leak concerning the texture images. However, we believe this to

be a bug in Gazebo itself, specifically Rendering::Visual.setMaterial() but have made no further

efforts to identify the cause.

A.2.2 GRASP Implementation details

The official documentation of grasp5 is hosted online6 and was generated automatically using Doxy-

gen.

A.2.2.A Simulation of underactuated manipulators

Underactuated manipulators correspond to robotic hands which have fewer actuated DOF than over-

all number of moving joints. This allows for increased dexterity and compliant designs, while reducing

the number of required actuators and the overall complexity of the manipulator.

By default, Gazebo does not provide support for such joints, although a community-made Mimic Joint

plugin7. Previous experience with an older version of this plugin while working with Vizzy [46] had led

to underwhelming results, as the plugin commands collided with already existing PID instances. This

would typically result in unrealistic and buggy behaviour, sometimes even allowing fingers to clip through

objects.

For this reason, we provide our own implementation of this feature as part of our manipulator plugin

(HandPlugin), which reuses existing PID controllers and can control several joints at once. We have

found our solution to produce realistic behaviour and improved stability.

We need only specify a joint group, which consists of an actuated joint, and pairs of mimic joints

and respective multipliers. We model the coupling between actuated and mimic joints with a linear

relationship. Given the target value for an actuated joint ta, the target for the i-th mimic joint ti, associated

with multiplier mi is given by ti = mi · ta. The joint groups, initial target values and type of PID control

should be specified in the hand model SDF description.

5� jsbruglie/grasp Gazebo plugins for grasping in Gazebo, on GitHub as of December 17, 20186Official GRASP documentation http://web.tecnico.ulisboa.pt/joao.borrego/grasp/, as of December 17, 20187Set of Gazebo plugins by the Robotics Group University Of Patras. https://github.com/roboticsgroup/roboticsgroup_

gazebo_plugins, as of December 17, 2018

95

A.2.2.B Efficient contact retrieval

In Gazebo, contact information is broadcast using a named topic, to which a contact manager pub-

lishes. The latter creates filters for deciding which contacts should be published, for minimal performance

impact. However, we found that using the provided sensor plugin class and monitoring the communi-

cation channel caused a severe simulation speed downgrade (by as much as 30%). Since our specific

case does not require permanent monitoring of contacts, but rather evaluation at certain key instants,

we decided on an alternate approach.

Essentially, we developed a plugin that accepts requests with names of links for which to test for

contacts. When a request is received, the plugin briefly subscribes to the communication topic. As soon

as the first message is received it immediately leaves the channel.

If no other subscribers exist, this resumes regular simulation performance. Even though this may

seem convoluted, it was actually inspired by an official solution to a similar problem, involving Gazebo’s

graphical client’s feature to toggle the visualisation of contacts in simulation.

A.2.2.C Adding a custom manipulator

Currently, our framework has out-of-the-box support for two variants of Baxter grippers (short+narrow,

long+wide fingers), Shadow Dexterous Hand and Vizzy’s [46] right hand. Vizzy is our in-house robotic

platform, mostly used for human-robot interaction related research. Creating a custom manipulator

involves the following steps: (i) Obtain the robot manipulator’s Unified Robot Description Format (URDF)

file (typically generated using xacro), and simplify collision model; (ii) Add virtual joints and Hand plugin

instance to URDF file; (iii) Create an entry in the robot configuration file for the interface class.

In order to achieve efficient and reliable simulation, the manipulator’s collision model should be sim-

ple, and consist of convex components, or even parametric volumes, such as cylinders and boxes. We

performed this job manually for Vizzy’s right hand but could have resorted to open-source 3D model

editor software MeshLab8, to programmatically generate convex and low-polygon counterparts of each

link mesh.

The remaining steps of the process are further detailed in the repository’s official webpage. Finally,

we started creating a tool with a graphical interface for streamlining this process, which worked decently

for the pipeline we had then. Currently, the tool is not operational and requires fixing simple parsing

issues.

8MeshLab website http://www.meshlab.net/, as of December 17, 2018

96

A.2.2.D Adjusting randomiser configuration

Our randomiser class can be fully configured with a single YAML file, which essentially allows one to

specify

– Which properties to randomise, e.g. link masses, model scale vector components, joint damping

or friction coefficients, etc.;

– The associated probability distribution tied to each property, currently supporting uniform, log-

uniform and Gaussian distributions, with customisable parameters;

– The target joint, links or models of the property randomisation, supporting generic grasp target

objects by employing a pre-defined keyword that is later replaced by the appropriate model name;

– Whether the random sample is addictive or a coefficient given each property’s initial value.

The randomiser configuration used in the trials described in section 4.3 is presented in Listing A.1.

properties:link_mass:

dist:uniform: {a: 0.5, b: 1.5}

additive: Falsetarget:

0: {model: TARGET_OBJECT, link: base_link, mass: 0.250}friction_coefficient:

dist:uniform: {a: 0.7, b: 1.3}

additive: Falsetarget:

0: {model: TARGET_OBJECT, link: base_link,mu1: 10000, mu2: 10000}1: {model: baxter, link: gripper_l_finger_tip, mu1: 1000, mu2: 1000}2: {model: baxter, link: gripper_r_finger_tip, mu1: 1000, mu2: 1000}

Listing A.1: Example randomiser configuration

A.2.2.E Reference frame: Pose and Homogenous Transform matrix

In our work, we use 6D poses (x, y, x, roll, pitch, yaw) and 4 × 4 homogeneous transformation ma-

trices interchangeably when working with transforms between reference frames. Ignition’s math library

provides mathematical operators for chaining transform operations, which correspond to multiplying the

respective transform matrices.

97

A.2.2.F Known issues

Currently, our grasp simulation pipeline still exhibits some software stability issues, likely stemming

from our yet limited understanding of Gazebo’s internal physics API and lower-level interactions with

DART engine. This seems only natural, as the open-source simulator has grown to become a complex

tool, with its own share of issues and subtle intricate procedures which are hard to fully understand in

such a short time period. During earlier stages of testing our framework, we noticed sometimes the

Gazebo server would crash when running for several hours at once. When attempting to diagnose the

issue we arrived a the conclusion that it was related to some low-level computation in FCL (Flexible

collision library), yet made no further effort to fix it. Instead, we merely added protection mechanisms to

ensure minimal stress on the physics engine.

Currently, our pipeline can sample hundreds of grasps for multiple objects, without any significant

problems, although some physics-related bugs may occur under extreme cases where the forces exerted

on the object are tremendous.

A.3 Additional Material

In this section, we present supplementary material, which was not included in the main body of the

document as it may be important, yet not crucial to understanding our work.

Algorithm A.1 presents the pseudocode description of the algorithm for Perlin noise generation, out-

lined in Section 3.1.1.C.

Table A.1 presents the extensive results of our dynamic simulation trials described in Section 4.4.2.A,

including the outcome label for each of the up to 200 grasp trials performed per object in the KIT object

dataset [32].

98

Algorithm A.1: Perlin noise generator. In our 2D image case, only x and y coordinates areprovided. For z, a uniformly random zi coordinate is provided per each image channel.

Input : x x coordinate of target point t,y y coordinate of target point t,z z coordinate of target point t,p permutation table

Output: n Perlin noise value

Function generatePerlinNoise(x, y, z)/* Compute coordinates in unit cube */{xf , yf , zf} ← computeCoordinatesUnitCube(x, y, z);/* Compute fade curves */{u, v, w} ← { fade(xf ), fade(yf ), fade(zf ) };/* Hash coordinates of the 8 cube corners from permutation table p */{aaa, aba, aab, abb, baa, bab, bbb} ← getHash(p);x1 ← lerp ( gradient (aaa, xf , yf , zf ), gradient (baa, xf − 1, yf , zf ), u);x2 ← lerp ( gradient (aba, xf , yf − 1, zf ), gradient (bba, xf − 1, yf − 1, zf ), u);x3 ← lerp ( gradient (aab, xf , yf , zf − 1), gradient (bab, xf − 1, yf , zf − 1), u);x4 ← lerp ( gradient (abb, xf , yf − 1, zf − 1), gradient (bbb, xf − 1, yf − 1, zf − 1), u);y1 ← lerp (x1, x2, v);y2 ← lerp (x3, x4, v);n← 1

2 ( lerp(y1, y2, w) + 1)return n;

Function fade(t)/* For smoother transition between gradients */return 6t5 − 15t4 + 10t3;

Function lerp(a, b, t)/* Linear interpolation given inputs a, b for parameter t */return a+ t(b− a);

Function gradient(hash, x, y, z)/* Convert lower 4 bits of hash into one of 12 gradient directions */h← hash & 0b1111 ; // Obtain first 4 bits of hash/* If the MSB of the hash is 0 then u = x, else u = y */u← (h� 0b1000) ? x else y;if h� 0b0100 then

v ← y ; // If the 1st and 2nd MSBs are 0, v = yelse if h = 0b1100 or h = 0b1110 then

v ← x ; // If the 1st and 2nd MSBs are 1, v = xelse

v ← z/* Use the last 2 bits to decide if u and v are positive or negative */u← (h� 0b0001) ? u else −u;v ← (h� 0b0010) ? v else −v;return u+ v;

99

Object Success Failure InvalidAmicelli 16 61 123BakingSoda 62 0 138BakingVanilla 58 5 137BathDetergent 19 20 161BlueSaltCube 23 0 177BroccoliSoup 31 0 169CatLying 8 0 171CatSitting 7 0 193CeylonTea 2 0 198ChickenSoup 22 2 176ChocMarshmallows 38 1 161ChocSticks2 0 0 200ChocSticks 0 0 200ChocoIcing 26 12 162ChocolateBars 43 0 157ChoppedTomatoes 71 2 127Clown 1 1 109CoffeeBox 17 22 161CoffeeCookies 8 34 158CoffeeFilters2 18 19 163CoffeeFilters 21 7 172CokePlasticLarge 15 1 184CokePlasticSmallGrasp 22 13 165CokePlasticSmall 32 0 168CondensedMilk 44 36 120CoughDropsBerries 10 25 165CoughDropsHoney 8 34 158CoughDropsLemon 5 32 163Curry 12 11 177DanishHam 66 3 131Deodorant 11 0 189Dog 0 0 192DropsCherry 70 1 129DropsOrange 65 1 134Dwarf 12 1 187FennelTea 20 31 149Fish 2 0 198FizzyTabletsCalcium 16 2 182FizzyTablets 13 3 184FlowerCup 34 23 143FruitBars 0 0 200FruitDrink 42 1 157FruitTea 70 0 130GreenCup 0 71 125GreenSaltCylinder 58 2 140HamburgerSauce 11 5 184Heart1 0 0 200HerbSalt 30 9 161HeringTin 0 0 200HotPot2 52 0 148HotPot 30 2 168HygieneSpray 6 1 193InstantDumplings 26 41 133InstantIceCoffee 42 40 118InstantMousse 4 19 177InstantSauce2 36 0 164InstantSauce 11 36 153InstantSoup 0 0 195InstantTomatoSoup 27 0 173JamSugar 55 4 141KnaeckebrotRye 32 45 123Knaeckebrot 29 42 129KoalaCandy 28 28 144LetterP 16 3 181

Object Success Failure InvalidLivioClassicOil 57 1 142LivioSunflowerOil 45 3 152Margarine 0 0 156Marjoram 13 0 187MashedPotatoes 6 37 157MelforBottle 47 0 153MilkDrinkVanilla 72 1 127MilkRice 6 17 177Moon 0 1 199MuesliBars 27 0 173NutCandy 0 0 200NutellaGo 0 0 183OrangeMarmelade 63 4 133OrgFruitTea 53 46 101OrgHerbTea 46 59 95Paprika 36 0 164PatchesSensitive 17 32 151Patches 45 1 154Peanuts2 27 21 152Peanuts 23 32 145Peas 30 37 133PineappleSlices 114 1 85Pitcher 10 0 190Pony 1 0 163PotatoeDumplings 0 0 200PotatoeStarch 10 45 145PotatoeSticks 54 1 145PowderedSugarMill 18 10 172PowderedSugar 18 41 141RavioliLarge 56 1 143RedCup 97 0 103Rice 75 2 123RuskWholemeal 64 2 134Rusk 101 7 92SardinesCan 30 0 170SauceThickener 18 23 159SauerkrautSmall 27 36 137Sauerkraut 29 53 118Seal 0 1 97Shampoo 17 1 182ShowerGel 22 1 177SmacksCereals 2 22 176SmallGlass 66 10 124SoftCakeOrange 3 25 172SoftCheese 56 2 142Sprayflask 1 0 199Sprudelflasche 18 22 160StrawberryPorridge 3 23 174Sweetener 19 1 180TomatoHerbSauce 59 0 141TomatoSauce 43 3 154Toothpaste 0 0 200Tortoise 1 1 54ToyCarYelloq 16 0 184VitalisCereals 11 28 161Wafflerolls 31 16 153Waterglass 62 3 135WhippedCream2 5 11 184WhippedCream 15 16 169Wineglass 0 1 199YellowSaltCube2 7 7 186YellowSaltCube 39 0 161YellowSaltCylinderSmall 20 8 172YellowSaltCylinder 37 0 163

Table A.1: Object-wise number of successful, failed and invalid grasp trials for KIT dataset. No randomisation ortrial repetitions were employed.

100