Feature Learning and Structured Prediction for Scene ...

Feature Learning and Structured

Prediction for Scene Understanding

Salman H. Khan

This thesis is presented for the degree of

Doctor of Philosophy

of The University of Western Australia

School of Computer Science and Software Engineering.

28 Feb 2016

a Dedicated to my parents,a

Abstract

When one talks about the visual comprehension ability of humans, even a young

child can easily describe events happening in a scene, differentiate between different

scene types, identify objects present in a scene and effortlessly reason about their

location and geometry. The ultimate goal of computer vision is to mimic the as-

tounding capabilities of human vision. However after ∼ 50 years of progress in this

area, computer vision is still far from the scene understanding capabilities of a tod-

dler. In this dissertation, we aim to further extend the frontiers of computer vision

by investigating robust feature learning and structured prediction frameworks for

visual scene understanding. This dissertation is organized as a collection of research

manuscripts which are either already published or submitted to internationally ref-

ereed conference and journals.

The dissertation explores two distinct aspects of scene understanding and analy-

sis. First, we explore improved feature representations for scene understanding tasks.

We investigate both hand-crafted as well as automatically learned feature represen-

tations using deep neural networks. Second, we propose new structured prediction

models to incorporate rich relationships between both low-level and high-level scene

elements. More specifically, we study some of the most important sub-tasks under

the umbrella of scene understanding such as semantic labelling, geometric and vol-

umetric reasoning, object shadow detection and removal, scene categorization and

change detection and analysis. The proposed algorithms in this dissertation pertain

to different data modalities including RGB images, RGB+Depth data, underwater

imagery, dermoscopy images, synthetic images and spectral data from satellites.

A major hurdle towards the goal of scene understanding is the limited availability

of data and annotations. This dissertation also contributes towards this aspect by

gathering two new datasets along with their annotations. Moreover, we present

methods to directly deal with specific data related issues e.g., recovery of missing

data, learning with only weak supervision and handling highly unbalanced datasets

during model learning. Our proposed approaches show very promising results on a

diverse set of scene understanding tasks. We hope that this dissertation will inspire

more such efforts to realise the ultimate objective of visual scene understanding in

machine vision.

Acknowledgements

I am deeply thankful to my supervisors, Mohammed Bennamoun, Roberto Togneri,

Ferdous Sohel and Imran Naseem. They provided me with their full support and

encouragement during my stay at UWA. I especially want to thank my Principal

Supervisor, Mohammed, for inspiring me to work hard, making himself available to

answer my questions at all the times and providing his continuous feedback on my

work. Had it not been his sheer academic and professional brilliance, this journey

would have been very difficult. Thank you for advice, guidance and contributions

to my research.

I want to express my gratitude towards Yvette Harrap and Kelli Pierce for their

administrative assistance; Ryan McConigly, Samuel Thomas and Daniel Ross for

their technical and IT support; Brian Skjerven and Ashley Chew for help with the

iVEC super computer. I am also thankful to Mark Reynolds (Head of School) and

other staff members at the School of Computer Science and Software Engineering

(CSSE) for their help and support during my candidature.

I am greatly in-debt to my colleagues and fellow postgraduate students at the

UWA for making this journey comfortable and sharing some pleasant moments to-

gether. I am especially thankful to my friends Ammar Mahmood, Umar Asif, Naveed

Akhter and Zohaib Khan. But this list is not complete without a special person,

Munawar Hayat, whose companionship was crucial to this thesis. We had many

fruitful discussions about science, religion, politics and life in general, which helped

me a lot in getting through tough times.

I am thankful to my mentors, peers, collaborators and organisations which sup-

ported me during this period. I would like to especially thank Xuming He and Fatih

Porikli (NICTA, ANU) for providing valuable support and supervising me during my

internship at NICTA. I am thankful to Faisal Shafait and Arif Mahmood for their

beneficial support and encouraging comments during our interactions. I appreciate

the financial and logistic support offered by the UWA (IPRS Scholarship), ARC

(DP150104251, DP110102166, DP150100294 and DE120102960), NICTA (hosting

my internship), NVIDIA (for donating GPUs) and Geoscience Australia (GA) for

providing the data and the expert annotations. I am grateful to numerous people,

including Prof. Dani Lischinski from Hebrew University , Jian Zhang from Stanford

University, Prof. Graham D. Finlayson from University of East Anglia, Prof. Mark

Drew from Simon Fraser University, who replied to my repeated queries regarding

their research. I am also thankful to my peers, whose quality research inspired me,

and anonymous reviewers, who provided valuable feedback and comments which

greatly helped me improve my publications.

I owe a great deal to my family. I want to thank my mother, Rukhsana, my

father, Abdul Hameed and all of my elder brothers and sisters, who brought me

up with their love and affection, and taught me the virtues of honesty, hard-work,

commitment and perseverance. I especially want to express my gratitude towards

my mother, for her devotion to our upbringing and countless prayers all through

these years. I am also indebted to my wonderful wife, who provided me with her

continuous support and care. To my little son, Qasim, you are the one whose smile

makes me forget all the worries after a long tiring day! Thank you for being with

Finally, and above all, I am profoundly grateful to my Lord for holding me stead-

fast in the face of confusion, doubt and disappointment. He has been a continuous

driving force during this long journey. I wish I could thank him enough for his

blessings and favors. ‘Our Lord! Accept (this service) from us: For Thou art the

All-Hearing, the All-knowing. Our Lord! bestow on us Mercy from Thyself, and

dispose of our affair for us in the right way!’. (Al-Quran)

Contents

List of Tables vii

List of Figures ix

Publications Included in this Thesis xiv

Contribution of Candidate to Published Papers xvii

1 Introduction 1

1.1 Background and Definitions . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Geometry Driven Semantic Understanding of Scenes . . . . . . 7

1.3.2 Automatic Shadow Detection and Removal . . . . . . . . . . . 8

1.3.3 Joint Estimation of Clutter and Objects’ Spatial Layout . . . 8

1.3.4 A Discriminative Representation of Convolutional Features . . 9

1.3.5 Cost-Sensitive Learning of Deep Feature Representations . . . 10

1.3.6 Weakly Supervised Change Detection in a Pair of Images . . . 10

1.3.7 Forest Change Detection in Incomplete Satellite Images with

Deep Convolutional Networks . . . . . . . . . . . . . . . . . . 11

2 Geometry Driven Semantic Labeling of Indoor Scenes 13

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Proposed Conditional Random Field Model . . . . . . . . . . . . . . 17

2.3.1 Unary Energies . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.2 Pairwise Energies . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.3 Proposed Higher-Order Energies . . . . . . . . . . . . . . . . . 27

2.4 Structured Learning and Inference . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Learning Parameters . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.2 Inference in CRF . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Planar Surface Detection . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Automatic Shadow Detection and Removal from a Single Photo-

graph 51

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Related Work and Contributions . . . . . . . . . . . . . . . . . . . . 54

3.3 Proposed Shadow Detection Framework . . . . . . . . . . . . . . . . . 58

3.3.1 Feature Learning for Unary Predictions . . . . . . . . . . . . . 58

3.3.2 Contrast Sensitive Pairwise Potential . . . . . . . . . . . . . . 60

3.3.3 Shadow Contour Generation using CRF Model . . . . . . . . . 62

3.4 Proposed Shadow Removal and Matting Framework . . . . . . . . . . 62

3.4.1 Rough Estimation of Shadow-less Image by Color-transfer . . 65

3.4.2 Generalised Shadow Generation Model . . . . . . . . . . . . . 68

3.4.3 Bayesian Shadow Removal and Matting . . . . . . . . . . . . . 71

3.4.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 73

3.4.5 Boundary Enhancement in a Shadow-less Image . . . . . . . . 74

3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5.2 Evaluation of Shadow Detection . . . . . . . . . . . . . . . . . 76

3.5.3 Evaluation of Shadow Removal . . . . . . . . . . . . . . . . . 83

3.5.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4 Separating Objects and Clutter in Indoor Scenes via Joint Reason-

ing 93

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.1 CRF Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.3.2 Potentials on Cuboids . . . . . . . . . . . . . . . . . . . . . . 98

4.3.3 Potentials on Superpixels . . . . . . . . . . . . . . . . . . . . . 102

4.3.4 Superpixel-Cuboid Compatibility . . . . . . . . . . . . . . . . 102

4.4 Cuboid Hypothesis Generation . . . . . . . . . . . . . . . . . . . . . . 103

4.5 Model Inference and Learning . . . . . . . . . . . . . . . . . . . . . . 104

4.5.1 Inference as MILP . . . . . . . . . . . . . . . . . . . . . . . . 104

4.5.2 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . 105

4.6.1 Dataset and Setup . . . . . . . . . . . . . . . . . . . . . . . . 105

4.6.2 Cuboid Detection Task . . . . . . . . . . . . . . . . . . . . . . 106

4.6.3 Clutter/Non-Clutter Segmentation Task . . . . . . . . . . . . 109

4.6.4 Foreground Segmentation Task . . . . . . . . . . . . . . . . . 112

4.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.8 Supplementary Material:

“Separating Objects and Clutter in Indoor Scenes” . . . . . . . . . . 114

4.8.1 Inference as MILP . . . . . . . . . . . . . . . . . . . . . . . . 114

4.8.2 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . 115

5 A Discriminative Representation of Convolutional Features 117

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.3 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.3.1 Dense Patch Extraction . . . . . . . . . . . . . . . . . . . . . 121

5.3.2 Convolutional Feature Representations . . . . . . . . . . . . . 123

5.3.3 Scene Representative Patches (SRPs) . . . . . . . . . . . . . . 124

5.3.4 Feature Encoding from SRPs . . . . . . . . . . . . . . . . . . 126

5.3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 127

5.4.1 A Dataset of Object Categories in Indoor Scenes . . . . . . . . 128

5.4.2 Evaluated Datasets . . . . . . . . . . . . . . . . . . . . . . . . 132

5.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 133

5.4.4 Ablative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.4.5 Effectiveness of Mid-level Information . . . . . . . . . . . . . . 140

5.4.6 Dimensionality Analysis . . . . . . . . . . . . . . . . . . . . . 140

5.4.7 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.4.8 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 143

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6 Cost-Sensitive Learning of Deep Feature Representations 145

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.3.1 Problem Formulation for Cost Sensitive Classification . . . . . 150

6.3.2 Our Proposed Cost Matrix . . . . . . . . . . . . . . . . . . . . 152

6.3.3 Cost-Sensitive Surrogate Losses . . . . . . . . . . . . . . . . . 153

6.3.4 Optimal Parameters Learning . . . . . . . . . . . . . . . . . . 158

6.3.5 Effect on Error Back-propagation . . . . . . . . . . . . . . . . 160

6.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.4.1 Datasets and Experimental Settings . . . . . . . . . . . . . . . 164

6.4.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . 166

6.4.3 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . 168

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7 Weakly Supervised Change Detection in a Pair of Images 179

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.3 Two-stream CNNs for Change Localization . . . . . . . . . . . . . . . 183

7.3.1 Model overview . . . . . . . . . . . . . . . . . . . . . . . . . . 183

7.3.2 Deep network architecture . . . . . . . . . . . . . . . . . . . . 184

7.3.3 Model inference for change localization . . . . . . . . . . . . . 188

7.4 EM Learning with Weak Supervision . . . . . . . . . . . . . . . . . . 190

7.4.1 Mean-field E step . . . . . . . . . . . . . . . . . . . . . . . . . 190

7.4.2 M step for CNN training . . . . . . . . . . . . . . . . . . . . . 191

7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7.5.1 CNN implementation . . . . . . . . . . . . . . . . . . . . . . . 191

7.5.2 Datasets and Protocols . . . . . . . . . . . . . . . . . . . . . . 192

7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7.6 Change Detection in Multiple Images . . . . . . . . . . . . . . . . . . 202

7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8 Forest Change Detection in Incomplete Satellite Images 205

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

8.3 Case Study: Data Description . . . . . . . . . . . . . . . . . . . . . . 213

8.4 Data Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

8.4.1 Data Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

8.4.2 Sparse Reconstruction based Image Enhancement . . . . . . . 216

8.4.3 Thin Cloud Removal . . . . . . . . . . . . . . . . . . . . . . . 217

8.5 Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

8.5.1 Multiscale Region Proposal Generation . . . . . . . . . . . . . 220

8.5.2 Candidate Suppression . . . . . . . . . . . . . . . . . . . . . . 221

8.5.3 Deep Convolutional Neural Network . . . . . . . . . . . . . . 221

8.6 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 223

8.6.1 Evaluation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 223

8.6.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 224

8.6.3 Baseline Approaches . . . . . . . . . . . . . . . . . . . . . . . 225

8.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9 Conclusion 237

9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

9.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

9.3 Future Directions and Open Problems . . . . . . . . . . . . . . . . . 238

A Disintegration of Higher-Order Energies 241

A.0.1 Disintegration of Higher-Order Energies to Second-Order Sub-

Modular Energies for Swap Moves . . . . . . . . . . . . . . . . 241

A.0.2 Disintegration of Higher-Order Energies to Second-Order Sub-

Modular Energies for Expansion Moves . . . . . . . . . . . . . 242

B Proofs Regarding Cost Matrix ξ′ 245

List of Tables

2.1 Comparison of plane detection results on the NYU-Depth v2 dataset 32

2.2 Results on the NYU-Depth v1, v2 and the SUN3D Datasets . . . . . 38

2.3 Class-wise Accuracies on NYU-Depth v1 . . . . . . . . . . . . . . . . 39

2.4 Class-wise Accuracies on NYU-Depth v2 (22 classes) . . . . . . . . . 39

2.5 Class-wise Accuracies on the NYU-Depth v2 (40 classes) . . . . . . . 40

2.6 Comparison of the results on the NYU-Depth v1 Dataset . . . . . . . 45

2.7 Comparison of results on the NYU-Depth v2 Dataset . . . . . . . . . 45

2.8 Comparison of results on the NYU-Depth v2 Dataset (4-class labeling

task) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.9 Comparison of results on the NYU-Depth v2 Dataset (4-class labeling

task) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.10 Comparison of results on the NYU-Depth v2 Dataset (40-class label-

ing task) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1 Evaluation of the Proposed Shadow Detection Scheme . . . . . . . . . 75

3.3 Results when ConvNets were trained and tested across different datasets. 78

3.2 Class-wise Accuracies of Our Proposed Framework in Comparison

with the State-of-the-art Techniques . . . . . . . . . . . . . . . . . . . 79

3.4 Quantitative Evaluation for Shadow Removal . . . . . . . . . . . . . 84

4.1 Inference Running time Comparisons for Variants of MILP Formulation105

4.2 An Ablation Study on the Model Potentials/Features . . . . . . . . . 109

4.3 Evaluation on Clutter/Non-Clutter Segmentation Task . . . . . . . . 110

4.4 Evaluation on Foreground/Background Segmentation Task . . . . . . 110

4.5 Statistics for Cuboids Fitted on Cluttered Regions . . . . . . . . . . . 114

5.1 Mean Accuracy on the MIT-67 Indoor Scene Dataset . . . . . . . . . 131

5.2 Mean Accuracy on the 15-Category Scene Dataset . . . . . . . . . . . 134

5.3 Mean Accuracy on the UIUC 8-Sports Dataset. . . . . . . . . . . . . 136

5.4 Mean Accuracy for the NYU v1 Dataset. . . . . . . . . . . . . . . . 137

5.5 Equal Error Rates (EER) for the Graz-02 dataset. . . . . . . . . . . 137

5.6 Ablative Analysis on MIT-67 Scene Dataset. . . . . . . . . . . . . . 141

5.7 Analysis of Feature Dimensions and their Corresponding Accuracies . 141

6.1 Evaluation on DIL Database. . . . . . . . . . . . . . . . . . . . . . . 168

6.2 Evaluation on MLC Database. . . . . . . . . . . . . . . . . . . . . . . 169

6.3 Evaluation on MNIST Database. . . . . . . . . . . . . . . . . . . . . 169

6.4 Evaluation on CIFAR-100 Database. . . . . . . . . . . . . . . . . . . 170

6.5 Evaluation on Caltech-101 Database . . . . . . . . . . . . . . . . . . 171

6.6 Evaluation on MIT-67 Database. . . . . . . . . . . . . . . . . . . . . 172

6.7 Comparisons of Our Approach with the State-of-the-art Class-imbalance

Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

6.8 Comparisons of our Approach (Adaptive Costs) with the Fixed Class-

specific Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.1 Detection results in terms of average precision and overall accuracy . 196

7.2 Segmentation Results and Comparisons with Different Baseline Meth-

ods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

7.3 Ablative Analysis on the CDnet-2014 Dataset . . . . . . . . . . . . . 196

7.4 More Comparisons for the Segmentation Performance of our model

on the CDnet-2014 Dataset . . . . . . . . . . . . . . . . . . . . . . . 198

7.5 Segmentation Performance for Different Fixed τ on the CDnet-2014

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

8.1 The flags included in the pixel quality map available with the Landsat

NBAR images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

8.2 Patch-wise classification and detection results for the temporal se-

quence are summarized above. . . . . . . . . . . . . . . . . . . . . . 227

8.3 Our results for onset/offset detection and comparisons with several

baseline techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

List of Figures

1.1 Computer vision algorithms perform well on individual tasks, but lack

a full visual understanding to be able to answer intelligent questions

about the scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contextual information is important for scene understanding tasks . . 2

2.1 The figure summarizes our proposed approach to combine global ge-

ometric information with low-level cues. . . . . . . . . . . . . . . . . 18

2.2 A factor graph representation for our CRF model . . . . . . . . . . . 21

2.3 Effect of the Ensemble Learning Scheme . . . . . . . . . . . . . . . . 23

2.4 Learning Location Prior using Geometrical Context . . . . . . . . . . 26

2.5 Robust Higher-Order Energy . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 An illustrative example showing the results of the planar surface de-

tection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.7 Comparison of our algorithm with [242] . . . . . . . . . . . . . . . . . 34

2.8 Examples of the semantic labeling results on the NYU-Depth v1 dataset 37

2.9 Examples of semantic labeling results on the NYU-Depth v2 dataset . 41

2.10 Examples of the semantic labeling results on the SUN3D dataset . . . 44

2.11 The introduction of HOE improves the segmentation accuracy around

the boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.12 Confusion Matrices for NYU-Depth and SUN3D Datasets . . . . . . . 48

3.1 Overview of Our Shadow Detection and Removal Scheme . . . . . . . 53

3.2 The Proposed Shadow Detection Framework . . . . . . . . . . . . . . 57

3.3 ConvNet Architecture used for Automatic Feature Learning to Detect

Shadows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4 The Proposed Shadow Removal Framework . . . . . . . . . . . . . . . 61

3.5 Detection of Object and Shadow Boundary . . . . . . . . . . . . . . . 63

3.6 Detection of Umbra and Penumbra Regions . . . . . . . . . . . . . . 64

3.7 Multi-level Color Transfer . . . . . . . . . . . . . . . . . . . . . . . . 69

3.8 Shadow Removal Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.9 ROC curve comparisons of proposed framework with previous works. 78

3.10 Qualitative examples of our results . . . . . . . . . . . . . . . . . . . 80

3.11 Examples of Ambiguous Cases . . . . . . . . . . . . . . . . . . . . . . 81

3.12 Shadow Recovery Results on Sample Images . . . . . . . . . . . . . . 82

3.13 Comparison with Automatic/Semi-Automatic Methods . . . . . . . . 85

3.14 Comparison with Methods Requiring User Interaction . . . . . . . . . 87

3.15 Examples of Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . 89

3.16 Different Applications of Shadow Detection, Removal and Matting . . 90

4.1 An Overview of Our Clutter Detection and Object Geometry Esti-

mation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.2 Graph Structure Representation for the Potentials . . . . . . . . . . . 97

4.3 The Distribution of Variation in Color for Cluttered and Non-cluttered

Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.4 Jaccard Index Comparisons for all Annotated Cuboids . . . . . . . . 107

4.5 Comparison of Our Results with the State-of-the-art Technique [110] 109

4.6 Qualitative Results for Cuboid Detection . . . . . . . . . . . . . . . . 112

4.7 Ambiguous Cases in Cuboid Detection . . . . . . . . . . . . . . . . . 113

5.1 An Overview of the Scene Classification Framework . . . . . . . . . . 118

5.2 Deep Un-structured Convolutional Activations . . . . . . . . . . . . . 122

5.3 Multi-level Patches Contain Different Levels of Scene Details . . . . . 124

5.4 CMC Curve for the Benchmark Evaluation on the OCIS Dataset . . . 128

5.5 A Word Cloud Representation of Object Categories in Indoor Scenes

(OCIS) database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.6 Example Images from the ‘Object Categories in Indoor Scenes’ Dataset130

5.7 Confusion matrices for Three Scene Classification Datasets . . . . . . 138

5.8 The contributions of Distinctive Patches for the Correct Class Pre-

diction of a Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.9 Confusion Matrix for the MIT-67 Dataset . . . . . . . . . . . . . . . 142

5.10 Example Mistakes and the Limitations of Our Method . . . . . . . . 143

5.11 Time consumed to Associate Extracted Patches with the Codebook

Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.1 Examples of Class Imbalance in the Popular Classification Datasets . 147

6.2 The CNN Parameters (θ) and Class Dependent Costs (ξ) used during

the Training Process of our Deep Network . . . . . . . . . . . . . . . 153

6.3 The 0-1 Loss along-with several other Common Surrogate Loss Func-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.4 The CE loss Function for the Case of Binary Classification . . . . . . 161

6.5 Confusion Matrices for the Baseline and CoSen CNNs on the DIL and

MLC datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.6 The CNN Architecture used in This Work . . . . . . . . . . . . . . . 167

6.7 The Imbalanced Training Set Distributions used for the Comparisons

Reported in Table 6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6.8 Training and Validation Error on the DIL Dataset . . . . . . . . . . . 177

7.1 Overview of Change Detection in a Pair of Images . . . . . . . . . . . 181

7.2 Factor Graph Representation of the Weakly Supervised Change De-

tection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

7.3 CNN Architecture used in This Work . . . . . . . . . . . . . . . . . . 187

7.4 Qualitative Results on the CDnet-2014 Dataset . . . . . . . . . . . . 194

7.5 Qualitative Results on the GASI-2015 and PCD-2015 Datasets . . . . 197

7.6 Ambiguous Cases for Change Detection . . . . . . . . . . . . . . . . . 199

7.7 Sensitivity analysis on the Number of Nearest Neighbours used to

Estimate Foreground Probability Mass Parameter (τ) . . . . . . . . 200

7.8 More Qualitative Results of the Proposed Approach . . . . . . . . . . 200

8.1 Region of interest for change detection (Victoria, Australia) . . . . . 209

8.2 Gantt Chart of the Fire and Harvest Incidents . . . . . . . . . . . . . 211

8.3 Examples of artifacts in the data. . . . . . . . . . . . . . . . . . . . . 212

8.4 Examples of SLC-off artifacts. . . . . . . . . . . . . . . . . . . . . . . 213

8.5 Data Recovery Results on Single Frames . . . . . . . . . . . . . . . . 215

8.6 Our approach to detect and remove thin translucent clouds . . . . . . 217

8.7 Box proposals are generated at multiple scales to capture all sizes of

change events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.8 The CNN architecture used for forest change detection. . . . . . . . . 222

8.9 The trend of missed events and mean onset/offset difference when the

temporal threshold for valid detection is changed. . . . . . . . . . . . 226

8.10 Labeled change region coverage by the different number of bounding-

box change proposals. . . . . . . . . . . . . . . . . . . . . . . . . . . 226

8.11 On/Offset Detection Results for Individual Fire and Harvest Events. . 228

8.12 Example of ground-truth change patterns (left) and the change se-

quences predicted by our approach (right). . . . . . . . . . . . . . . 231

8.12 The figure shows detection results on the complete image plane en-

compassing the forest area under investigation . . . . . . . . . . . . . 233

8.13 Three small portions of patch sequences are shown in the above figure.234

List of Algorithms

1 Region Growing Algorithm for Depth-Based Segmentation . . . . . . 33

2 Rough Estimation of Shadow-less Image by Color-transfer . . . . . . 66

3 Bayesian Shadow Removal . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Parameter Learning using the Structured SVM Formulation . . . . . 115

5 Iterative optimization for parameters (θ, ξ) . . . . . . . . . . . . . . 159

Publications During the Candidature

Journal Publications (Refereed)

1. Salman H. Khan, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri.

“Automatic Shadow Detection and Removal from a Single Image.” IEEE

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), IEEE,

vol.38, no. 3, pp. 431-446, March 2016, doi:10.1109/TPAMI.2015.2462355.

[IF: 5.8]

IEEE TPAMI is the most cited journal in computer vision according to SJR

(SCImago Journal and Country Rank 1). It is the second highest ranked jour-

nal in computer science (among ∼ 1500 journals). The review process in this

journal is very rigorous with an acceptance rate of ∼ 15%. In 2014 (the year

in which this paper was submitted), TPAMI received 1018 submissions, out of

which 160 were accepted by November 2015 2.

2. Salman H. Khan, Mohammed Bennamoun, Ferdous Sohel, Roberto Togneri,

and Imran Naseem. “Integrating Geometrical Context for Semantic Labeling

of Indoor Scenes using RGBD Images.” International Journal of Computer

Vision (IJCV), 1-20, Springer, 2015. [IF: 3.8]

3. Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Roberto Togneri,

and Ferdous Sohel. “A Discriminative Representation of Convolutional Fea-

tures for Indoor Scene Recognition.” IEEE Transactions on Image Processing

(TIP), IEEE, 2016. [IF: 3.6]

“Cost Sensitive Learning of Deep Feature Representations from Imbalanced

Data.” IEEE Transactions on Pattern Analysis and Machine Intelligence

(TPAMI), IEEE, 2015. (Submitted) [IF: 5.8]

5. Salman H. Khan, Xuming He, Mohammed Bennamoun and Fatih Porikli.

“Forest Change Detection in Incomplete Satellite Images with Deep Convo-

lutional Networks.” Remote Sensing of Environment (RSE), Elsevier, 2016.

(Submitted) [IF: 6.4]

1http://www.scimagojr.com/journalrank.php?area=1700&category=1707&country=

all&year=2014&order=sjr&min=0&min_type=cd2https://www.computer.org/csdl/trans/tp/2016/02/07374795.pdf

Conference Publications (Refereed)

“Automatic feature learning for robust shadow detection.” In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

pp. 1939-1946. IEEE, 2014.

“Geometry driven semantic labeling of indoor scenes.” In Proceedings of the

European Conference on Computer Vision (ECCV), pp. 679-694. Springer

International Publishing, 2014.

Based on this paper, we were invited by Aditya Khosla (MIT), Silvio Savarese

(Stanford University), James Hays (Brown University), and Jianxiong Xiao

(Princeton) to submit a paper at a CVPR 2015 workshop entitled SUNw: Scene

Understanding Workshop, which provides a yearly summary and compiles a

yearbook to summarize new progress in the field.

“Geometry driven semantic labeling of indoor scenes (II).” In Proceedings of

the Scene Understanding Workshop (SUNw) in conjunction with the IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,

2015. (Invited Paper)

9. Salman H. Khan, Xuming He, Mohammed Bannamoun, Ferdous Sohel, and

Roberto Togneri. “Separating Objects and Clutter in Indoor Scenes.” In Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), pp. 4603-4611. IEEE, 2015.

10. Salman H. Khan, Xuming He, Mohammed Bannamoun, Fatih Porikli, Ferdous

Sohel, and Roberto Togneri. “Weakly Supervised Change Detection in a Pair

of Images.” In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), IEEE, 2016. (Submitted)

Non-Lead Author Publications (Refereed)

Non-lead author publications are not presented in this thesis.

11. Munawar Hayat, Salman H. Khan, Mohammed Bennamoun, Senjian An, “A

Spatial Layout and Scale Invariant Feature Representation for Indoor Scene

Classification.” IEEE Transactions on Image Processing (TIP), IEEE, 2016.

(In Revision RQ) [IF: 3.6]

12. Senjian An, Munawar Hayat, Salman H. Khan, Mohammed Bennamoun, Farid

Boussaid, Ferdous Sohel, “Contractive Rectifier Networks for Nonlinear Max-

imum Margin Classification”, In Proceedings of the IEEE International Con-

ference on Computer Vision (ICCV), 2015.

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), In-

ternational Journal of Computer Vision (IJCV) and IEEE Transactions on Image

Processing (TIP) are respectively the 1st, 2nd and 3rd most cited journals in Com-

puter Vision and Pattern Recognition. Elsevier Remote Sensing of Environment

(RSE) is the most cited journal in Remote Sensing.

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

is the best conference in Computer Vision, followed by the European Conference on

Computer Vision (ECCV) and the IEEE Conference on Computer Vision (ICCV).

During my PhD, I have had the privilege to present my research in all of these three

top-ranked conferences.

The above mentioned rankings are according to Google Metric 3.

3https://scholar.google.com.au/citations?view_op=top_venues&hl=en&vq=eng_

computervisionpatternrecognition and https://scholar.google.com.au/citations?

view_op=top_venues&hl=en&vq=eng_remotesensing.

Contribution of Candidate to Published Work

My contribution in all the first-authored papers was 85%. I conceived ideas, devel-

oped them into mature techniques, validated them through experiments and wrote

significant part of all the papers. My co-authors helped me through continuous

discussions providing me with their useful feedback during the course of my work.

They also reviewed my papers and improved the paper writing by providing their

useful comments and suggestions.

1CHAPTER 1

Introduction

It’s not what you look at that matters, it’s what you see.

H. D. Thoreau (1817-1862)

Current computer vision algorithms lack the ability to develop a higher level of

understanding of the visual content, which appears in the images and videos. As

an example, highly sophisticated and well-suited approaches have been developed to

segment an image into smaller parts, detect and track objects in a scene, recognize

human faces in images, read text in natural scenes and to classify an image into one

of the many categories. However, these algorithms do not fulfil the ultimate goal

of visual scene understanding, which aims to design algorithms which can perform

high-level reasoning about the scene type, object categories, the semantic classes

that are present in the image, their interactions, their spatial and geometric layout

and the illumination conditions in the scene. For example, given an indoor scene

(Fig., 1), a computer algorithm should be able to answer intelligent questions e.g.,

“which objects are occluded by the sofa?”, “how can we exit from the room?”,

“where are we located in the house?”, “in which direction a light source is located”,

and so on.

This dissertation contributes towards the bigger goal of holistic (or total) scene

understanding by proposing methods to effectively incorporate contextual informa-

Figure 1.1: Computer vision algorithms perform well on individual tasks, but lack a

full visual understanding to be able to answer intelligent questions about the scene.

2 Chapter 1. Introduction

tion. We cannot overstate the fact that contextual cues are an integral part of

human visual reasoning and understanding. By looking at the contextual infor-

mation, humans develop a perception of an object’s size, its geometric orientation,

physical location and even its category. For example, it is extremely challenging to

predict an object’s class, scale, location and orientation by just looking at that spe-

cific object in Fig. 1.2 (top row). However, if we consider its context as well, we can

very easily reason about the object and its properties (bottom row). We can even

determine the prevalent situation in a scene by combining contextual information

(e.g., a road is blocked or there is an emergency situation).

Although, contextual information makes a lot of sense to humans and it is in

fact an integral component of our day to day reasoning, modern computer vision

and machine learning techniques are currently inept at efficiently and optimally

incorporating all the relevant contextual information in order to perform highly

intelligent reasoning about the real world. This is mainly due to the complex and

ambiguous nature of this problem where the contextual relationships are not always

Figure 1.2: Contextual information is important for scene understanding tasks. If we

look at the individual objects in the above figure (top row), we cannot identify their

semantic class and their physical attributes. However, by considering their context,

we can easily understand scene information and can reason about the object’s class,

location, geometry, support surfaces, material affordance and other properties. The

above images are taken from the NYU and MIT-67 Indoor datasets.

1.1. Background and Definitions 3

easy to model. Moreover, only a limited amount of data is available during the

learning process and contextual information appears in a huge number of different

configurations and varieties, making it extremely challenging to learn and take into

consideration all the useful relationships between the scene components.

In this dissertation, we present solutions to three crucial problems under the um-

brella of visual scene understanding. First , we propose novel methods to enhance

feature and classifier learning from the raw data. We investigate well-engineered

systems based on hand-crafted feature representations for scene understanding. We

also propose new feature representations based on deep neural networks, which are

automatically learned in a supervised manner. Second , we propose new methods

and models for structured prediction, where we incorporate a variety of contextual

cues while reasoning about the semantic class, location, geometry or spatial extent

of an object. These models are built upon hand-crafted or automatically learned

feature representations to perform high level reasoning, and they are useful for devel-

oping a better understanding of scenes. Third , we contribute towards the solution

of the limited data problem by proposing new frameworks to learn features from only

weak labels and to automatically deal with the class imbalance problem. We also

address this issue by presenting two new annotated datasets which were collected

during the course of our work.

1.1 Background and Definitions

To simplify the material presented in this dissertation, we provide a brief de-

scription of the key-words used in this document.

Scene Understanding: The scene understanding problem aims to interpret the

visual data in semantic terms by studying the constituent scene elements and their

relationships. The visual content interpretation provided by the scene understanding

problem is closer to what humans perceive and understand from images and videos.

Semantic Labeling: Relates to the problem of partitioning an image into a set

of regions and the assignment of a semantically meaningful category to each region.

Scene Categorization: Given an input image, a scene categorization frame-

work decides on the group (e.g., indoor, bedroom or office scene) to which it belongs.

Geometric Reasoning: The problem of reasoning about objects whose geome-

try is estimated using basic geometric primitives (e.g., rectangle, square or cuboid).

This can help in applications such as robotic manipulation, object grasping and path

finding.

Volumetric Reasoning: The problem of geometric reasoning by treating 3D

objects as cuboids with definite area and volume. Volumetric reasoning provides a

physically plausible understanding of scenes.

Class Imbalance: Deals with the problem which arises when some of the classes

are heavily under-represented compared to some other frequently occurring classes

in a dataset. Such a dataset is termed as ‘imbalanced’ or ‘skewed’ dataset.

Supervised Learning: Supervised learning is a process in which a learner

is shown the examples of input-output pairs. In other words, a learner is taught

directly the relationships between input and output variables.

Weakly Supervised Learning: This type of learning involves weak supervi-

sory information which does not fully specify the required output from the learner

during the learning process. As an example, we will categorize an object localisation

problem as a weakly supervised task, if only image level object presence/absence in-

formation is available during the training process. Note that the precise location of

the object is unknown but the learner will be required to predict the object location

after training.

Change Detection: Deals with the analysis of two or more images to find any

interesting changes and their locations. The changes in the set of images may be

due to several reasons including object motion, growth, decay and actions.

High-level Reasoning: A term pertaining to image analysis and interpretation

for scene understanding. This problem reasons about the scene in a form which is

more close to human understanding of scenes as opposed to low-level vision which

only performs image processing or reasons about local pixels.

Clutter Identification: The problem of localization and segmentation of jum-

bled or cluttered regions in a scene. In indoor scenes, clutter usually refers to useless

image regions where no object of interest is present.

Deep Learning: The process of learning representations using deep neural

networks. Normally, deep neural networks refer to multi-layer networks with more

than 2 hidden layers. We refer the interested reader to [? ] for a comprehensive

introduction on this topic.

Graphical Models: A model which defines a joint probability distribution over

a set of random variables. The graphical model can be either directed (Bayesian

models) or undirected (Markov Random Fields). Missing edges in the graph imply

conditional independence between the random variables. For a thorough introduc-

tion on this topic, we refer the reader to [? ].

Structured Learning: The process of learning weights associated with the

nodes and connections of a probabilistic graphical model (structured prediction

1.2. Contributions 5

model).

Directed Acyclic Graph (DAG): A type of graph which contains only di-

rected edges and there are no cycles (or close paths) between the random variables.

An example of DAG is a graph defined by a Bayesian or belief network.

Conditional Random Field (CRF) Model: A CRF model defines a joint

probabilistic distribution over a set of random variables which are connected through

a graph structure with undirected edges. The joint distribution for the case of CRFs

is conditioned on a set of random variables.

Shadow Matting: The process of separating the shadow from the original

image using a matte indicating the location of shadows.

Convolutional Neural Network: A special type of neural network where the

weights in each layer are defined as filters which are convolved with the layer inputs.

Sparse Coding: An approach to represent a representation in terms of a small

number of descriptors from a very large set of descriptors.

Dictionary Learning: The problem of choosing a limited set of descriptors to

form a dictionary which can be used to describe a large number of representations

in terms of associations with the elements of the dictionary.

Cost-sensitive Learning: The problem of learning the class-specific costs

which are used to deal with class-imbalanced datasets. Cost-sensitive learning gives

importance to the less frequent classes by learning appropriate weights.

Data Augmentation: The process of generating synthetic data from the al-

ready available examples and including it in the training set to enhance the learning

process. This technique is commonly used in deep neural networks to avoid over-

fitting.

Expectation-Maximization Framework: Iteratively maximises the data like-

lihood by estimating the hidden states in the model at each step. This algorithm

is gauranteed to converge to a local maximum to provide a Maximum Likelihood

Estimate (MLE).

Spectral Data: The surface reflectance data acquired from remote sensing

satellites that arrange the information into several spectral bands.

1.2 Contributions

The major contributions of this thesis are as follows:

� We propose a novel probabilistic model to perform semantic labeling of in-

door scenes by incorporating the depth information in the local, pairwise and

higher order energies defined on pixels. (Chapter 2, Published in ECCV’14,

CVPRW’15 and IJCV’15)

� An automatic method has been proposed to accurately detect shadows in

unconstrained images using a deep neural network model. We also present

an automatic Bayesian approach to effectively remove the detected shadows.

(Chapter 3, Published in CVPR’14 and TPAMI’16)

� A new CRF model to incorporate rich interactions between objects and super-

pixels has been proposed. The proposed model allows to jointly estimate the

objects’ spatial layout and clutter in indoor scenes. (Chapter 4, Published in

CVPR’15)

� We develop a novel feature representation based on convolutional features

from deep neural networks to accurately predict the scene type of an input

image. Our approach takes into account the semantic and spatial contextual

information. (Chapter 5, Accepted in TIP’16)

� To address the class imbalance problem in some of the widely used datasets,

we propose an automatic framework to learn improved feature representations

and classifier weights using a proposed deep neural network training algorithm.

(Chapter 6, Submitted in TPAMI)

� We propose a novel method to detect interesting changes in a pair of images

without full pixel-level supervision. Our technique is based on a structured

prediction framework which jointly detects and localises change events. (Chap-

ter 7, Submitted in CVPR’16)

� This dissertation also presents a new method for land-cover change detec-

tion in the spectral data using spatial and temporal contextual information.

The proposed approach recovers the missing information in satellite imagery

and accurately detects changes in a time-lapse sequence using a deep network

model. (Chapter 8, Submitted in RSE)

In the next section, we provide a brief overview of the above mentioned contri-

butions, which are arranged in the form of separate chapters in this dissertation.

1.3 Thesis Overview

This thesis presents a number of novel solutions relating to feature learning and

structured prediction to develop a better understanding of scenes. This disserta-

1.3. Thesis Overview 7

tion is arranged as a set of publications, each of which addresses a different but a

closely linked sub-problem in scene understanding. Although, we explore a number

of different computer vision tasks e.g., classification, segmentation, detection, and

geometry estimation, the underlying tools are consistent throughout the thesis, and

therefore the central theme remains almost the same all through this document. In

short, this thesis presents new methods for both,

� The development of better hand-crafted and learned feature representations

(Chapter 2, 4 and 5), and

� The design of improved models for structured prediction (Chapter 2, 3, 4, 5,

6, 7 and 8).

Since, the explored tasks and application domains are different, we provide relevant

problem descriptions and a detailed literature review in each chapter of this thesis.

In the description below, we provide a brief overview of each of the chapters that

will follow after this introduction.

1.3.1 Geometry Driven Semantic Understanding of Scenes (Chapter 2)

This chapter deals with scene labeling, which is a fundamental task in scene

understanding. In this task, each of the smallest discrete elements in an image

(pixels or voxels) is assigned a semantically-meaningful class label.

We note that inexpensive structured light sensors can capture rich information

from indoor scenes, and scene labeling problems provide a compelling opportunity to

make use of this information. In this chapter we present a novel Conditional Random

Field (CRF) model to effectively utilize depth information for semantic labeling of

indoor scenes. At the core of the model, we propose a novel and efficient plane

detection algorithm which is robust to erroneous depth maps. The CRF formulation

defines local, pairwise and higher order interactions between image pixels. These

are briefly described below:

a) At the local level, we propose a novel scheme to combine energies derived from

appearance, depth and geometry-based cues. The proposed local energy also

encodes the location of each object class by considering the approximate geometry

of a scene.

b) For the pairwise interactions, we learn a boundary measure which defines the

spatial discontinuity of object classes across an image.

c) To model higher-order interactions, the proposed energy treats smooth surfaces

as cliques and encourages all the pixels on a surface to take the same label.

We show that the proposed higher-order energies can be decomposed into pairwise

sub-modular energies and efficient inference can be made using the graph-cuts algo-

rithm. We follow a systematic approach which uses structured learning to fine-tune

the model parameters. We rigorously test our approach on SUN3D and both ver-

sions of the NYU-Depth database. Experimental results show that our work achieves

superior performance to state-of-the-art scene labeling techniques.

1.3.2 Automatic Shadow Detection and Removal (Chapter 3)

This chapter addresses the shadow detection and removal problem. Shadows are

a frequently occurring natural phenomenon, whose detection and manipulation are

important in many computer vision (e.g., visual scene understanding) and computer

graphics (e.g., augmented reality) applications. Shadows can help in high-level scene

understanding tasks because they provide several useful clues about the scene and

object characteristics (e.g., the number of light sources, their location, object shape

and size).

We present a framework to automatically detect and remove shadows in real

world scenes from a single image. Previous works on shadow detection put a lot of

effort in designing shadow variant and invariant hand-crafted features. In contrast,

the proposed framework automatically learns the most relevant features in a super-

vised manner using multiple convolutional deep neural networks (ConvNets). The

features are learned at the super-pixel level and along the dominant boundaries in

the image. The predicted posteriors based on the learned features are fed to a con-

ditional random field model to generate smooth shadow masks. Using the detected

shadow masks, we propose a Bayesian formulation to accurately extract shadow

matte and subsequently remove shadows. The Bayesian formulation is based on a

novel model which accurately models the shadow generation process in the umbra

and penumbra regions. The model parameters are efficiently estimated using an

iterative optimization procedure. The proposed framework consistently performed

better than the state-of-the-art on all major shadow databases collected under a

variety of conditions.

1.3.3 Joint Estimation of Clutter and Objects’ Spatial Layout (Chap-

ter 4)

This chapter focuses on volumetric reasoning for indoor scenes. We live in a

three dimensional world where objects interact with each other according to a rich

set of physical, geometrical and spatial constraints. Therefore, merely recognizing

objects or segmenting an image into a set of semantic classes does not always provide

a meaningful interpretation of the scene and its properties. A better understanding

of real-world scenes requires a holistic perspective, exploring both semantic and 3D

structures of objects as well as the rich relationship among them [79, 275, 129, 309].

To this end, one fundamental task is that of the volumetric reasoning about generic

3D objects and their 3D spatial layout.

An objects’ spatial layout estimation and clutter identification are two important

tasks to understand indoor scenes. We propose to solve both of these problems in

a joint framework using RGBD images of indoor scenes. In contrast to recent ap-

proaches which focus on either one of these two problems, we perform ‘fine grained

structure categorization’ by predicting all the major objects and simultaneously

labeling the cluttered regions. A conditional random field model is proposed to in-

corporate a rich set of local appearance, geometric features and interactions between

the scene elements. We take a structural learning approach with a loss of 3D lo-

calisation to estimate the model parameters from a large annotated RGBD dataset,

and a mixed integer linear programming formulation for inference. We demonstrate

that the proposed approach is able to detect cuboids and estimate cluttered re-

gions across many different object and scene categories in the presence of occlusion,

illumination and appearance variations.

1.3.4 A Discriminative Representation of Convolutional Features (Chap-

ter 5)

This chapter proposes a novel method that captures the discriminative aspects of

an indoor scene to correctly predict its semantic category (e.g., bedroom or kitchen).

This categorization can greatly assist in context-aware object and action recognition,

object localization, and robotic navigation and manipulation [292, 284]. However,

due to the large variabilities between images of the same class and the confusing sim-

ilarities between images of different classes, the automatic categorization of indoor

scenes represents a very challenging problem [219, 292].

This chapter presents a novel approach that exploits rich mid-level convolutional

features to categorize indoor scenes. Traditional convolutional features retain the

global spatial structure, which is a desirable property for general object recognition.

We, however, argue that the structure-preserving property of the CNN activations

is not of substantial help in the presence of large variations in scene layouts, e.g., in

indoor scenes. We propose to transform the structured convolutional activations to

another highly discriminative feature space. The representation in the transformed

space not only incorporates the discriminative aspects of the target dataset but also

encodes the features in terms of the general object categories that are present in

indoor scenes. To this end, we introduce a new large-scale dataset of 1300 object

categories that are commonly present in indoor scenes. The proposed approach

achieves a significant performance boost over previous state-of-the-art approaches

on five major scene classification datasets.

1.3.5 Cost-Sensitive Learning of Deep Feature Representations from Im-

balanced Data (Chapter 6)

This chapter tackles the class imbalance problem in classifier learning. Class

imbalance is a common problem in the case of real-world object detection, classifi-

cation and segmentation tasks. The data of some classes is abundant making them

an over-represented majority, while data of other classes is scarce, making them an

under-represented minority. This skewed distribution of class instances forces the

classification algorithms to be biased towards the majority classes. As a result, the

characteristics of the minority classes are not adequately learned.

In this work, we propose a cost sensitive deep neural network which can auto-

matically learn robust feature representations for both the majority and minority

classes. During training, the learning procedure jointly optimizes the class depen-

dent costs and the neural network parameters. The proposed approach is applicable

to both binary and multi-class problems without any modification. Moreover, as

opposed to data level approaches for class imbalance, we do not alter the original

data distribution which results in a lower computational cost during the training

process. We report the results of our experiments on six major image classification

datasets and show that the proposed approach significantly outperforms the baseline

algorithms. Comparisons with popular data sampling techniques and cost sensitive

classifiers demonstrate the superior performance of the proposed method.

1.3.6 Weakly Supervised Change Detection in a Pair of Images (Chap-

ter 7)

This chapter handles the weakly supervised learning to simultaneously detect

and localise changes. Identifying changes of interest in a given set of images is a

fundamental task in computer vision with numerous applications in fault detection,

disaster management, crop monitoring, visual surveillance, and scene understanding

(or analysis) in general.

Conventional change detection methods use strong supervision and therefore

require a large number of images to learn background models. The few recent ap-

proaches that attempt change detection between two images either use handcrafted

features or depend strongly on tedious pixel-level labeling by humans.

In this chapter, we present a weakly supervised approach that needs only image-

level labels to simultaneously detect and localize changes in a pair of images. To

this end, we employ a deep neural network with DAG topology to learn patterns of

change from image-level labeled training data. On top of the initial CNN activations,

we define a CRF model to incorporate the local differences and the dense connections

between individual pixels. We apply a constrained mean-field algorithm to estimate

the pixel-level labels, and use the estimated labels to update the parameters of the

CNN in an iterative EM framework. This enables imposing global constraints on

the observed foreground probability mass function. The evaluations on four large

benchmark datasets demonstrate superior detection and localization performance.

1.3.7 Forest Change Detection in Incomplete Satellite Images with Deep

Convolutional Networks (Chapter 8)

The last chapter of this dissertation deals with the data recovery and change

detection problem in multi-temporal satellite imagery. Land cover change detection

and analysis is highly important for ecosystem management and socio-economic

studies at regional, national and international scale. In particular, forest change de-

tection is crucial for continuous environmental monitoring required to closely inves-

tigate pressing environmental issues such as natural resource depletion, biodiversity

loss and deforestation. It can also provide critical information to help in disaster

management, policy making, area planning and efficient land management.

In this study, we have analysed data from remote sensing satellites to detect

forest changes over a period of 17 years (1999-2015). Since the original data suf-

fers from severe artifacts, we first devise a pre-processing mechanism to recover

the missing surface reflectance information. The data filling process makes use of

accurate data available in nearby time instances followed by sparse reconstruction

based de-noising. To detect interesting changes, we build multi-resolution profile

of an area and generate a refined set of bounding boxes enclosing potential change

regions. In contrast to competing methods which use hand-crafted feature represen-

tations, we use automatically learned feature representations learned using a deep

neural network. Based on these highly discriminative features, our method auto-

matically detect forest changes and predict their on/offset timings. The proposed

approach achieves state-of-the-art results compared to several competitive base-line

procedures. We also qualitatively analyzed the changes detected in the unlabeled

regions, and found the predictions from our approach to be accurate in most cases.

13CHAPTER 2

Integrating Geometrical Context for Semantic

Labeling of Indoor Scenes using RGBD Images1

Things are not always as they seem; the first appearance deceives many.

Plato (Phaedrus, 370 BC)

Abstract

Inexpensive structured light sensors can capture rich information from indoor

scenes, and scene labeling problems provide a compelling opportunity to make use

of this information. In this chapter we present a novel Conditional Random Field

(CRF) model to effectively utilize depth information for semantic labeling of indoor

scenes. At the core of the model, we propose a novel and efficient plane detection

algorithm which is robust to erroneous depth maps. Our CRF formulation defines

local, pairwise and higher order interactions between image pixels. At the local level,

we propose a novel scheme to combine energies derived from appearance, depth and

geometry-based cues. The proposed local energy also encodes the location of each

object class by considering the approximate geometry of a scene. For the pairwise

interactions, we learn a boundary measure which defines the spatial discontinuity

of object classes across an image. To model higher-order interactions, the proposed

energy treats smooth surfaces as cliques and encourages all the pixels on a surface

to take the same label. We show that the proposed higher-order energies can be

decomposed into pairwise sub-modular energies and efficient inference can be made

using the graph-cuts algorithm. We follow a systematic approach which uses struc-

tured learning to fine-tune the model parameters. We rigorously test our approach

on SUN3D and both versions of the NYU-Depth database. Experimental results

show that our work achieves superior performance to state-of-the-art scene labeling

techniques.

Keywords : scene parsing, graphical models, geometric reasoning, structured learn-

2.1 Introduction

1Published in International Journal of Computer Vision (IJCV), pp 1-20, Springer, 2015. A

preliminary version of this research was published in Proceedings of the European Conference on

Computer Vision (ECCV), pp. 679-694. Springer, 2014.

14 Chapter 2. Geometry Driven Semantic Labeling of Indoor Scenes

The main goal of scene understanding is to equip machines with human-like

visual interpretation and comprehension capabilities. A fundamental task in this

process is that of scene labeling, which is also well-known as scene parsing. In this

task, each of the smallest discrete elements in an image (pixels or voxels) is assigned

a semantically-meaningful class label. In this manner, the scene labeling problem

unifies the conventional tasks of object recognition, image segmentation, and multi-

label classification [53]. A high-performance scene labeling framework is useful for

the design and development of context-aware personal assistant systems, content-

based image search engines and domestic robots, among several other applications.

From a scene-labeling viewpoint, scenes can broadly be classified into two groups:

indoor and outdoor. The task of indoor scene labeling is relatively difficult in com-

parison to its outdoor counterpart [218]. There are many different types of indoor

scenes (e.g. consider a corridor, a bookstore or a kitchen), and it is non-trivial to

handle them all in a unified way. Moreover, in contrast to common outdoor scenes,

indoor scenes more often contain illumination variations, clutter and a variety of

objects with imbalanced representations. In many outdoor scenes, common classes

(e.g. ground, sky and vegetation) do not exhibit much variability, whereas objects

in indoor scenes can change their appearance significantly between different images

(e.g. a bed may change appearance due to different bedsheets). Such difficulties can

prove challenging when performing scene labeling purely from color (RGB) images.

However, with the advent of consumer-grade sensors such as the Microsoft Kinect

that capture co-registered color (RGB) and depth (D) images of indoor scenes, a

much richer source of information has become available [85]. A number of popu-

lar and relevant databases e.g., NYU-Depth [241], RGBD Kinect [143] and SUN3D

[291] have been acquired using the Kinect sensor. These notable efforts have opened

the door to the development of improved schemes for labeling indoor scenes from

RGBD images.

Various recent works have focused on the use of RGBD images for labeling in-

door scenes. [132] used KinectFusion [106] to create a 3D point cloud and then

densely labeled it using a Markov Random Field (MRF) model. [241] provided a

Kinect-based dataset for indoor scene labeling and achieved decent semantic labeling

performance using a Conditional Random Field (CRF) model with SIFT features

and 3D location priors. Although they showed that depth information has signifi-

cant potential to improve scene labeling performance, their own work was limited to

depth-based features and priors, and did not explore the possibilities of effectively

utilising the scene geometry or exploiting long-range interactions between pixels.

2.1. Introduction 15

In this work, we develop a novel depth-based geometrical CRF model to efficiently

and effectively incorporate depth information in the context of scene labeling. We

propose that depth information can be used to explore the geometric structure of

the scene, which in turn will help with the scene labeling task. We propose to in-

corporate depth information in all the components of our hierarchical probabilistic

model (unary, pairwise and higher-order). Our model uses both intensity and depth

information for efficient segmentation.

For the purpose of integrating depth information, we begin with the modifica-

tion of unary potentials. First, we incorporate geometric information in the most

important energy of our CRF model, namely the appearance energy. In this local

energy, we encode both appearance and depth-based characteristics in the feature

space. These features are used to predict the local energies in a discriminative fash-

ion. Note that in general, man-made environments contain a lot of flat structures,

because they are easier to manufacture than curved ones. Therefore we extract

planes, which are the fundamental geometric units of indoor scenes, using a new

smoothness constraint based ‘region growing algorithm’ (see Sec. 2.5). Compared

to other plane detection methods (e.g., [221, 242]), our method is robust to large

holes which can potentially appear in the Kinect’s depth maps (Sec. 2.5). The ge-

ometric as well as the appearance based characteristics of these planar patches are

used to provide unary estimates. We propose a novel ‘decision fusion scheme’ to

combine the pixel and planar based unary energies. This scheme first uses a number

of contrasting opinion pools and finally combines them using a Bayesian framework

(see Sec. 2.3.1). Next, we consider the location based local energy that encodes

the possible spatial locations of all classes. Along with the conventional 2D location

prior, we propose to use the planar regions in each image to channelize the location

energy (see Sec. 2.3.1).

Our approach also incorporates depth information in the pairwise and higher-

order clique potentials. We propose a novel ‘spatial discontinuation energy’ in the

pairwise smoothness model. This energy combines evidence from several edge de-

tectors (such as depth edges, contrast based edges and different super-pixel edges)

and learns a balanced combination of these, using a quadratic cost function min-

imization procedure based on the manually segmented images of the training set

(see Sec. 2.4.1). Finally, we propose a higher-order term in our CRF model which

is defined on cliques that encompass planar surfaces. The proposed Higher-Order

Energy (HOE) increases the expressivity of the random field model by assimilating

the geometrical context. This encourages all pixels inside a planar surface to take a

consistent labeling. We also propose a logarithmic penalty function (see Sec. 2.3.3)

and prove that the HOE can be decomposed into sub-modular energy functions (see

Appendix A).

To efficiently learn the parameters of our proposed CRF model, we use a max-

margin learning algorithm which is based on a one-slack formulation (Sec. 2.4.1).

The rest of the chapter is organized as follows. We discuss related work in the

next section and propose a random field model in Sec. 2.3. We then outline our

parameter learning procedure in Sec. 6.3.4. In Sec. 2.5, the details of our proposed

geometric modeling approach are presented. We evaluate and compare our proposed

approach with related methods in Sec. 4.6 and the chapter finally concludes in Sec.

2.2 Related Work

The use of range or depth sensors for scene analysis and understanding is increas-

ing. Recent works employ depth information for various purposes e.g., semantic seg-

mentation [132], object grasping [223, 125], door-opening [220] and object placement

[112]. For the case of semantic labeling, works such as [241, 242] demonstrate the

potential depth information has to help with vision-related tasks. However, they do

not go beyond the depth-based features or priors. In this chapter, we show how to

incorporate depth information into the various components of a random field model

and then evaluate the contribution made by each component in enhancing semantic

labeling performance [129]. Our framework is particularly inspired by the works

on semantic labeling of RGBD data [241, 242], considering long-range interactions

[131], parametric learning [253, 262] and geometric reconstruction [221].

The scene parsing problem has been studied extensively in recent years. Prob-

abilistic graphical models, e.g. MRFs and CRFs, have been successfully applied to

model context and provide a consistent labeling [91, 75, 154, 98]. Some of these

methods, e.g. [75], work on a pixel grid, whilst others perform inference at the

super-pixel level [98]. [91] combined local, regional and global cues to formulate

multi-scale CRFs to address the image labeling problem. Hierarchical MRFs are

employed in [141] to perform joint inference on pixels and super-pixels. [98] trained

their CRF on separate clusters of similar scenes and used the clusters with standard

CRF to label street images. [241] showed that when segmenting RGBD data, it is

possible to achieve better results by making use of all the available channels (includ-

ing depth) than by relying on RGB alone. They used features extracted from the

depth channel and a 3D location prior to incorporate depth information. However,

2.3. Proposed Conditional Random Field Model 17

the question of how to incorporate depth information in an optimal manner remains

unanswered and warrants further investigation. Moreover, although works such as

[241, 294] use depth-based features to enhance segmentation performance, they do

not incorporate depth information into the higher-order components of the CRF.

Another important challenge in scene labeling is to take account of long-range

context in the scene when making local labeling decisions. [53] extracted dense

features at a number of scales and thereby encoded multiple regions of increasing

size and decreasing resolution at each pixel location. Other works have incorporated

long-range context by generating a number of segmentations at various scales (of-

ten arranged as trees) to propose many possible labelings (e.g., [141, 27]). HOEs

have been employed to model long-range smoothness [131], shape-based information

[164, 77], cardinality-based potential [280] and label co-occurrences [142]. While

densely-connected pairwise models such as [133] are suitable for fine-grained seg-

mentation, indoor scenes rarely require such full connectivity because most of the

candidate classes exhibit definite boundaries unlike e.g. trees or cat fur. In contrast

to previously-proposed HOEs, we propose using the geometrical structure of the

scenes to model high-level interactions.

Currently popular parameter estimation methods include partition function

approximations [239], cross validation [239] or simply hand picked parameters [241].

We used a one-slack formulation [117] of the parameter learning technique of [253],

which gives a more efficient optimization of the cost function compared to the n-

slack formulation employed in [262, 253]. Further, we extend the parameter estima-

tion problem to consider multiple edge-based energies and learn parameters using a

quadratic program.

Our geometric reconstruction scheme is close to the one used by [294] to

create semantic 3D models of indoor scenes and the smoothness constraint-based

segmentation technique of [221]. Whilst both these schemes use data from accurate

laser scanners, we improved their algorithm to make it suitable to tackle the less

accurate depth data acquired by a low-cost Microsoft Kinect sensor that operates

in real time. Our proposed algorithm relaxes the smoothness constraint in missing

depth regions and considers more reliable appearance cues to define planar surfaces.

2.3 Proposed Conditional Random Field Model

As a prelude to the development of a hierarchical appearance model and a HOE

defined over planes (Fig. 2.1), we first outline briefly the conditional random field

model and its components. We use a CRF to capture the conditional distribution

Conditional Random Field Model

el’s

hnodes

of output classes given an input image. The CRF model takes into consideration

the color, location, texture, boundaries and layout of pixels to reason about a set of

semantically-meaningful classes. The CRF model is defined on a graph composed of

a set of vertices V and a set of edges E . We want the model to capture not only the

interactions between direct neighbours in the graph, but also long-range interactions

between nodes that form part of the same planar regions (Fig. 2.2). To achieve this,

we treat our problem as a graphical probabilistic segmentation process in which a

graph G(I) = 〈V , E〉 is defined over an image I [14].

The set of vertices V represents individual pixels in a graph defined on I. If

the set cardinality (#V) is T then the vertex set represents all the pixels: V =

{pi : i ∈ [1,T]}. Similarly, E represents a set of edges which connect adjacent

vertices in G(I). These edges are undirected based on the assumption of conditional

independence between the nodes. The goal of multi-class image labeling is to

segment an image I by labeling each pixel pi with its correct class label `i ∈ L. The

set of all possible classes is given by L = {1, ..., L} and the total number of classes

is #L = L.

If the estimated labeling of an image I is represented by a vector y, where

y = (yi : i ∈ [1,T]) ∈ LT is composed of discrete random variables associated with

each vertex in G(I), we have the likelihood of labeling y decomposed into node and

maximal clique potentials as follows:

P(y|x; w) =1

∏i∈V

θwuu (yi,x)∏{i,j}∈E

θwpp (yij,x)∏c∈C

θwcc (yc,x) (2.1)

where, x denotes the observations made from an image I, Z(w) is a normalizing

constant known as the partition function, w represents a vector which parametrizes

the model and wu, wp and wc are the components of w which parametrize the

unary, pairwise and higher-order potential functions. The variables yi, yij and yc

represent the labeling over node i, pairwise clique {i, j} and the higher-order clique

c respectively. The potential functions associated with yi, yij and yc are denoted by

θu, θp and θc, respectively. The conditional distribution in Eq. 2.1 for each possible

labeling y ∈ LT can be represented by an exponential formulation in terms of Gibbs

energy: P(y|x; w) = 1Z(w)

exp(−E(y,x; w)). This energy can be defined in terms of

log-likelihoods:

E(y,x; w) = − log(P(y|x; w) Z(w)) (2.2)

=∑i∈V

ψu(yi,x; wu) +∑{i,j}∈E

ψp(yij,x; wp) +∑c∈C

ψc(yc,x; wc). (2.3)

These three terms in Eq. 3.2, in which the Gibbs energy has been decomposed

(using Eq. 2.1) are called the unary, pairwise and higher order energies respectively

(Fig. 2.2). These energies are related to the potential functions defined in Eq.

2.1 by: θwkk (yk,x) = exp(−ψ(yk,x; wk)) with k ∈ {u, p, c}. We will describe the

unary, pairwise and higher order energies in Sec. 2.3.1, Sec. 2.3.2 and Sec. 2.3.3,

respectively.

In the inference stage, the most likely labeling is chosen using Maximum a Pos-

teriori (MAP) estimation over possible labelings y ∈ LT, and denoted y∗:

y∗ = argmaxy∈LT

P(y|x; w) (2.4)

Since the partition function Z(w) does not depend on y, Eq. 2.4 can be reformulated

as an energy minimization problem, as follows:

y∗ = argminy∈LT

E(y,x; w) (2.5)

The parameter vector w, introduced in Eq. 3.8, is learnt using a max-margin

criterion (see Sec. 2.4.1 for details).

2.3.1 Unary Energies

The unary energy in Eq. 3.2 is further decomposed into two components, ap-

pearance energy and location energy (Fig. 2.1):

∑i∈V

ψu(yi,x; wu) =∑i∈V

appearance︷︸︸︷φi(yi,x; wapp

u ) +∑i∈V

location︷︸︸︷φi(yi, i; w

locu ) (2.6)

We describe both terms in the following sections.

Proposed Appearance Energy

The proposed appearance energy (first term) in Eq. 3.3 is defined over the pixels

and the planar regions (Fig. 2.1). We use the class predictions defined over the

planar regions to improve the posterior defined over the pixels. In other words,

planar features are used to reinforce beliefs for some dominant planar classes (e.g.,

walls, blinds, floor and ceiling). To combine the local appearance and the geometric

information, we use a hierarchical ensemble learning method (Fig. 2.3). Our tech-

nique combines two axiomatic ensemble learning approaches; linear opinion pooling

(LOP) and the Bayesian approach. Note that we have outputs from a pixel based

classifier which operates on pixels, and a planar regions based classifier which works

Figure 2.2: A factor graph representation for our CRF model. The bottom layer

represents pixels and the top layer represents planar regions. Each circle represents

a latent class variable while black boxes represent terms in the CRF model (Eq.

on planar regions. With these outputs, we first fuse them using a simple LOP which

produces a weighted combination of both classifier outputs,

P(yi|x1, . . . ,xm) =m∑j=1

κjPj(yi|xj), (2.7)

where xj denotes the representation of an image in different feature spaces, Pjdenotes probability of a class yi given a feature vector xj, κj : j ∈ [1,m] denotes

the weights and m = 2. Note that instead of using a single set of weights, we use

multiple configurations of weights, each with a small component of random noise,

to obtain several contrasting opinions. After unifying beliefs based on contrasting

opinions, the Bayesian rule is used to combine them in the subsequent stage. To try

a number of weighting options (r configurations of weights κ) to generate contrasting

opinions o = [P(yi|x)κT]r, we can represent our ensemble of probabilities as2,

P(yi|o1, . . . ,or) =P(o1, . . . ,or|yi)P(yi)

P(o1, . . . ,or).

Since o1, . . . ,or are independent measurements given yi, we have,

P(yi|o1, . . . ,or) =P(o1|yi) . . .P(or|yi)P(yi)

P(o1, . . . ,or).

Again applying the Bayes rule and after simplification we get,

P(yi|o1, . . . ,or) = ρP(yi|o1) . . .P(yi|or)

P(yi)r−1. (2.8)

2In this work we set r = 3 and κ is set to [0.25, 0.75], [0.5, 0.5] and [0.75, 0.25] respectively in

each case. This choice is based on the validation set (see Sec. 7.5.3).

Here, P(yi) is the prior and ρ is a constant which depends on the data [50] and is

given by

ρ =P(o1) . . .P(or)

P(o1, . . . ,or).

The appearance energy is therefore defined by:

φi(yi,x; wappu ) = wapp

u logP(yi|o1, . . . ,or), (2.9)

where, wappu is the parameter of the appearance energy. This energy is dependent

on the output of two Randomized Decision Forest (RDF) classifiers which give the

posterior probabilities P(yi|xi). These classifiers capture the important characteris-

tics of an image using a set of features, which encode information about the shape,

the texture, the context and the geometry. The appearance energy proves to be the

most important one for the scene labeling problem as shown in the results section

(Sec. 4.6).

Features for Local Appearance Energy:

The local appearance energy is modeled in a discriminative fashion using a trained

classifier (RDF in our case). We extract features densely at each point and then

aggregate them at the super-pixel level using a simple averaging operation. It

must be noted that the feature aggregation is done on the super-pixels in order to

reduce the computational load and to ensure that similar pixels are modeled by a

unified representation in the feature space. The super-pixels are obtained using the

Felzenszwalb graph-based segmentation method [57]. We use a scale of 10 with a

minimum region size of 200 pixels. This parameter selection is based on prior tests

which were performed on a validation set (Sec. 7.5.3).

A rich feature set is extracted which includes local binary patterns (LBP) [197],

texton features [239], SPIN images [118], scale invariant feature transform (SIFT)

[176], color SIFT, depth SIFT and histogram of gradients (HOG) [43]. These low-

level features help in differentiating between the distinct classes commonly found

in indoor scenes. LBP is a strong texture classification feature which captures the

relation between a pixel and its neighbors in the form of an encoded binary word.

LBP is extracted from a 10x10 region around a pixel and the normalized histogram

is converted to a 59 dimensional vector. For the calculation of texton features, we

first convolve the image with a filter bank of even and odd symmetric oriented energy

kernels at four different scales (0.5, 0.6, 0.72, 0.86) with four different orientations

( 0, 0.79, 1.57 and 2.35 radians). The Gaussian second derivative and the Hilbert

transform of the Gaussian second derivative are used as the even and odd symmetric

filters respectively. This creates a filter-bank consisting of a total of 32 filters of

Bed Blind

Bookshelf Cabinet Ceiling

Floor Picture

Sofa Table

Wall Television

Window Counter

Person Books Door

Clothes Sink Bag Box

Utensils Other

Blind Bookshelf

Cabinet Ceiling

Floor Picture

Sofa Table

Wall Television

Window Counter

Person Books Door

Utensils Other

Blind Bookshelf

Cabinet Ceiling

Floor Picture

Sofa Table

Wall Television

Window Counter

Person Books Door

Utensils Other

varying sizes (11x11, 13x13, 15x15 and 17x17). Next, image pixels are grouped into

k = 32 textons by clustering the filter-bank responses into 32 groups. This gives a

96 dimensional vector which is composed of filter responses.

SPIN images are extracted by considering a radius of r = 8 around a pixel with

8 bins. This gives us a 64 dimensional vector. SIFT descriptors of length 128 are

extracted on a 40x40 patch both for the case of simple SIFT and depth SIFT. We

followed the same procedure as detailed in [241] to calculate the depth SIFT. To

incorporate the color information into the local SIFT, we use the opponent angle,

hue and spherical angle method of [264]. The parameters are set in a way similar

to [264] and this gives a 111 dimensional vector. We extract a 36 dimension

HOG feature vector on a 4x4 region quantized into 9 orientation bins. The HOG

is computed by finding gradients separately for each color channel and including

only the maximum magnitude gradient among all channel gradients. In the final

histogram, all gradients are quantized by their orientation and weighted by their

magnitude. Trilinear interpolation is used to place each gradient in the appropriate

spatial and orientation bin.

These features form a high dimensional space (~640 dimensions) and it becomes

computationally intensive to train the classifier with all these features. Moreover,

some of these features are redundant while some others have a lower accuracy. We

therefore employ the genetic search algorithm from the Weka attribute selector tool

[82] to find the most useful set of 256 features on the validation set (Sec. 7.5.3). This

feature subset selection effectively reduces the classifier training time to one third

of what it was originally. Also, the performance of the lower-dimensional feature

vector is comparable to that of the original feature set, e.g., on the validation set

from NYU v1, we noted only 0.03% decrease in accuracy.

Features for Appearance Model on Planes:

One of the most important features is the plane orientation which is characterized by

the direction of its normal. We include the area and height (maximum z-axis value)

of the planar region in the feature set to characterise its extent and position. Since

these measures may vary significantly and a relative measure is needed, we normalize

each value with respect to the largest instance in the scene. Color histograms in the

HSV and CIE LAB color spaces are also included. The responses to various filters

are calculated and aggregated at the planar level (in the same manner as textons).

The RDF classifier is trained using these features and used to predict the posterior

on planar regions.

Unary Classifiers:

Separate RDF classifiers are trained, one for the extracted local features on super-

pixels and the other for the planar regions. The RDF classifier creates an ensemble

of trees during the training phase and combines their outputs for predictions [24].

For our purpose, we directly obtain the class probabilities P(yi|x) by averaging

the decisions over all tress. We use the RDF classifiers to predict the unary cost

(Eq. 2.9) in the CRF model (Fig. 2.2) because of their efficiency and inherent

multi-class classification ability. We trained both RDFs with 100 trees and 500

randomly-sampled variables as candidates at each split. This configuration was set

empirically taking into account the trade-off between reasonable performance and

efficient training of the RDFs.

Proposed Location Energy

The unary location prior (second term) in Eq. 3.3 models the class label distribu-

tion based on the location of the pixels in an image. This energy is useful during

the segmentation process since it encodes the probability of the spatial presence of

a class. The location energy is defined for each class and every pixel location in the

image plane:

φ(yi, i; wlocu ) = wloc

u logFloc(yi, i), (2.10)

where, wlocu parameterises the location energy and the function Floc(yi, i) is depen-

dent on both the location and the geometry of a pixel (Fig. 2.1).

Our formulation of Floc(yi, i) is based on the idea that the location of a class

(which has a characteristic geometric orientation) can further be made specific if any

geometric information about the scene is available. For example, it is highly unlikely

to have a bed or floor at some locations in an image, where we know a vertical plane

exists. Therefore, we seek to minimize the location prior on the regions where the

geometric properties of an object class do not match with observations made from

the scene. First, we average the class occurrences over the ground truth of the

training set for each class (yi) [241, 239]. This can be represented by the ratio of

the class occurrences at the ith location to the total number of occurrences:

Floc(yi, i) =N{yi,i} + α

Ni + α, (2.11)

where α is a constant which corresponds to the weak Dirichlet prior on the location

energy [239]. Next, we incorporate the geometric information into the location prior.

For this, we extract the planar regions, which occur in an indoor scene, and divide

them into two distinct geometrical classes: horizontal and vertical regions. Since

the Kinect sensor gives the pitch and roll for each image, the intensity and depth

Figure 2.4: Learning Location Prior using Geometrical Context: (a) Original image.

(b) The normal location prior for wall is shown. (c) It shows how the prior (b) is

combined with the planar information to channelize the general location information

of a class by considering the scene geometry. Note that white color in (b) and (c)

shows high probability.

images in the NYU-Depth dataset are rotated appropriately to remove any affine

transformations. This positions the horizon (estimated using the accelerometer)

horizontally at the center of each image. We use this horizon to split the horizontal

geometric class into two subclasses, the ‘above-horizon’ and ‘below-horizon’ regions.

For each planar object class, we retain the 2D location prior in the regions where

the geometric properties of the class match with those of the planar region, and

decrease its value by a constant factor in the regions where that class cannot be

located. For example, the roof cannot lie on a horizontal plane in the below-horizon

region or a vertical region. This effectively reduces the class location prior to only

those regions which are consistent with the geometrical context. It must be noted

that this elimination procedure is only carried out for planar classes e,g., roof, floor,

bed and blinds. After that, the location prior is smoothed using a Gaussian filter and

the actual prior distribution is normalized in such a way that a uniform distribution

across different classes is obtained. The prior distribution is normalized to give∑iFloc(yi, i) = 1/L, where L is the total number of classes. Examples of the

resulting location priors are shown in Fig. 2.4.

2.3.2 Pairwise Energies

The pairwise energy in Eq. 3.2 is defined on the edges E (Fig. 2.2). This energy

is defined in terms of an edge-sensitive Potts model [23],

ψp(yij,x; wp) = wTp φp1(yi, yj)φp2(x). (2.12)

The first function (φp1) is a class transition energy and the second one (φp2) is

the spatial discontinuation energy. These functions are defined in the following

subsections (Sec. 2.3.2 and 2.3.2 respectively).

Class Transition Energy

The class transition energy in Eq. 3.5 is a simple zero-one indicator function which

enforces a consistent labeling. The function is defined as:

φp1(yi, yj) = a1yi 6=yj =

{0 if yi = yj

a otherwise

For this work we used a = 10. This parameter selection was based on the validation

set (Sec. 7.5.3).

Proposed Spatial Discontinuation Energy

The spatial discontinuation energy in Eq. 3.5 encourages label transitions at nat-

ural boundaries in the image [239, 227]. It is defined as a combination of edges

from the intensity image, depth image and the super-pixel edges extracted using

Mean-shift [64] and Felzenswalb [57] segmentation: φp2(x) = wTp2φedges(x). Weights

assigned to each edge-based energy are learned using a quadratic program (see Sec.

2.4.1). In simple terms, edges which match with the manual annotations to a large

extent contribute more in the energy φp2 . The edge-based energy is given by:

φedges(x) =[βx exp(− σij〈σij〉

), βd exp(−σdij〈σdij〉

βsp-fwFsp-fw(x), βsp-msFsp-ms(x), α]T, (2.13)

where, σij = ‖xi − xj‖2, σdij = ‖xdi − xdj‖2 and 〈.〉 denotes the average contrast in

an image. xi and xdi shows the color and depth image pixels respectively. Fsp-ms

and Fsp-fw are indicator functions which give all zeros except at the boundaries of

the Mean-shift [64] or Felzenswalb [57] super-pixels respectively. The output is a

binary image containing ones at the super-pixel boundaries. The inclusion of a

constant α = 1 allows a bias to be learned to remove small isolated parts during the

segmentation process. For our case, we set βx = βd = 150 and βsp-ms = βsp-fw = 5

based on the validation set (see Sec. 7.5.3).

2.3.3 Proposed Higher-Order Energies

A useful strategy to enhance the representational power of a CRF model is to

introduce high-order energies (Eq. 2.1). These energies are dependent on a relatively

large number of dimensions of the output labeling vector y and therefore incorporate

long-range interactions (Fig. 2.2). HOEs try to eliminate inconsistent variables in a

Figure 2.5: Robust Higher-Order Energy: When the number of inconsistent nodes in

a clique increases, the penalty term defined over the clique increases in a logarithmic

fashion.

clique. On the other hand, these energies try to encourage all the variables in a clique

to take the dominant label. The robust P n model [131] poses this encouragement

in a soft manner while the P n Potts model [130] presents this requirement in a

hard fashion. In the robust P n model some pixels in a clique may retain different

labelings. Hence, it is a linear truncated function of the number of inconsistent

variables in a clique. We define our proposed HOE which works in a similar manner

as the robust HOE [131]:

ψc(yc,x; wc) = wc min`∈LFc(τc), (2.14)

where, Fc(.) is a function which takes the number of inconsistent pixels τc = #c −n`(yc) as its argument. Here, n` is a function which computes the number of pixels

in clique c taking the label `. The non-decreasing concave function Fc is defined

as: Fc(τc) = λmax − (λmax − λ`)exp(−ητc), where η = η0/Q` and η0 = 5 (Fig. 2.5).

Here η0 is the slope parameter which decides the rate of increase of the penalty,

with the increase in the number of pixels disagreeing with the dominant label. The

parameters λmax and λ` define the penalty range which is typically set to 1.5 and

0.15 respectively. Q` is the truncation parameter which provides the bound for

the maximum number of disagreements in a clique. The higher-order cliques are

formed using the depth-based segmentation method (Sec. 2.5). Details about the

disintegration of the HOE (Eq. 2.14) are given in Appendix A to describe how the

graph cuts algorithm can be applied.

2.4. Structured Learning and Inference 29

2.4 Structured Learning and Inference

The task of indoor scene labeling involves making joint predictions over many

complex yet correlated and structured outputs. The CRF model defined in the

previous section (Sec. 2.3) explicitly models the correlations over the output space

and performs approximate inference at test time. However, the CRF model con-

tains a number of energies, parametrized by weights which we learn using a S-SVM

formulation. The learning procedure is outlined as follows.

2.4.1 Learning Parameters

Unary, pairwise and high order terms (Eq. 3.2 and Fig. 2.1, 2.2) in the CRF

model introduce many parameters which need a more principled tuning procedure

rather than simple hand-picked values, cross validation learning or a piecewise train-

ing mechanism. In this work, we use a structured large-margin learning method

(S-SVM) to efficiently adjust the probabilistic model parameters. Instead of using

an n-slack formulation of the cost function, we use a single slack formulation, which

results in more efficient learning [117]. Given N training images, the training set

can be represented in the form of ordered pairs of image data x and labelings y:

T = {(xn,yn), n ∈ [1, . . . , N ]}. If ξ ∈ R+ is a single slack variable, the following

margin re-scaled cost function is solved to compute the parameter vector w∗:

(w∗, ξ∗) = argminw,ξ

2‖w‖2 + Cξ (2.15)

subject to;

N∑n=1

[E(y,xn; w)− E(yn,xn; w)] ≥ 1

N∑n=1

∆(y,yn)− ξ (2.16)

∀n ∈ [1..N ],∀y ∈ L : y 6= yn, C > 0,

wi ≥ 0 : ∀wi ∈ {w}\wu ,

where, C is the regularization constant, ∆(y,yn) is the Hamming loss function

and the parameter vector w consists of the appearance energy weight (wappu ), the

location energy weight (wlocu ), the pairwise energy weight (wp) and the weight for

HOE (wc). Due to the large number of constraints in Eq. 2.16, a cutting plane

algorithm ([117], Algorithm 4) is used for training which only considers the most

violated constraints to solve our optimization problem. It can be proved that the

algorithm converges after O(1/ε) steps with the guarantee that the objective value

(once the final solution is reached) differs by at most ε from the global minimum

[262]. The two major steps in this algorithm are the quadratic optimization step,

which is solvable by off-the-shelf convex optimization problem solvers and the loss-

augmented prediction step, which can be solved by graph cuts.

Once suitable parameters for the CRF are learned, the parameters for the edge-

based energies are learned which results in a balanced representation of each edge

in the pairwise energy. In our approach, instead of a simple contrast-based energy,

we define a weighted combination of various possible edge-based energies (such as

based on depth edges, contrast-based edges, super-pixels edges) to accommodate

information from all these sources (see Sec. 2.3.2 and Eq. 2.13). We start with a

heuristic-based initialization and iterate over the training samples to learn a more

balanced representation between the different edge-based energies. The weights for

edges are restrained to be non-negative so that the energy remains sub-modular.

This condition is necessary because the graph cuts based exact inference methods

can be applied only to sub-modular energy minimization problems.

We use structured learning to learn weights for the spatial discontinuation energy

(Sec. 2.3.2). The corresponding quadratic program is given as follows:

argmax‖wp2‖=1

γ (2.17)

s.t.; {Econ, Edep, Efel-sp, Ems-sp} − Egrd ≥ γ, {wp2} ≥ 0,

where, Egrd is the energy when the spatial discontinuation energy is based on the

manually identified edges from the training images. Energies for the case when the

spatial discontinuation energy is based on image contrast, image depth, Felzenswalb

or mean-shift super-pixels are represented as Econ, Edep, Efel-sp or Ems-sp respectively.

The cost function given in Eq. 3.7 is optimized in a similar way to that described in

([117], Algorithm 4). After learning, it turns out that the contrast and depth-based

edge energies are more reliable and therefore play a dominant role in the spatial

discontinuation energy.

2.4.2 Inference in CRF

Once the CRF energies have been learned along with their parameters, the next

step is to find the most probable labeling. As discussed earlier in Sec. 2.3, this turns

out to be an energy minimization problem (Eq. 3.8). Since our energy function is

sub-modular, this energy minimization problem can be solved via the expansion

move algorithms (alpha-expansion or alpha-beta swap graph cuts algorithm) of [22].

The main idea is to decompose the energy minimization problem into a series of

2.5. Planar Surface Detection 31

binary minimization problems which can themselves be solved efficiently. The al-

gorithm starts with an arbitrary initial labeling and at each step the move is only

made if it results in an overall minimization of the cost function [23, 22].

2.5 Planar Surface Detection

Indoor environments are predominantly composed of structures which can be

decomposed into planar regions, such as walls, ceilings, cupboards and blinds. These

flat surfaces are easier to manufacture and thus appear frequently in man-made

environments (Sec. 2.6.2). We extract the dominant planes which best fit the

sparse point clouds of indoor images (obtained from RGBD data) and use them in

our model-based representation (Fig. 2.1). It must be noted that the depth images

produced by a Kinect contain many missing values e.g., along the outer boundaries

of an image or when the scene contains a black or a specular surface. Traditional

plane detection algorithms (e.g. [242, 221]) either make use of dense 3D point clouds

or simply ignore the missing depth regions. In contrast, we propose an efficient

plane detection algorithm which is robust to missing depth values (often termed as

holes) in the Kinect depth map. We expect that the inference made on the improved

planar regions will help us achieve a better semantic labeling performance (see Sec.

2.6.2).

Our method3 first aligns the 3D points with the principal directions of the room.

Next, surface normals are computed at each point. Contiguous points in space are

then clustered by a region growing algorithm (Algorithm 1) which groups the 3D

points in a way to maintain their continuity and smoothness. It is robust to erro-

neous normal orientations caused due to big holes mostly present along the borders

of the depth image acquired via Kinect sensor (Fig. 2.7). The basic idea is to make

use of appearance-based cues when the depth information is not reliable. The algo-

rithm begins with a seed point and at each step, a region is grown by including the

points in the current region with normals pointing in the same direction. Iteratively,

the region is extended and the newly included points are treated as seeds in the sub-

sequent iteration. To deal with erroneous sensor measurements along the border

and any other regions with missing depth measurements, we relax the smoothness

constraint and use major line segments present in the image to decide about the

region continuity.

3Plane detection code is available at author’s webpage: http://www.csse.uwa.edu.au/

~salman

(a) (b) (c)

(f)(e)(d)

Figure 2.6: An illustrative example showing the results of the planar surface detec-

tion algorithm. An original image (a) and its depth map (b) are used as inputs to

the algorithm which uses appearance (c) and depth-based cues (d) to provide an

initial (e) and a final segmentation map (f).

Performance Evaluation

Method EPC Acc. E+NPC Acc.

[242] 0.69± 0.09 0.67± 0.10

[221] 0.60± 0.12 0.57± 0.14

This chapter 0.76± 0.09 0.81± 0.07

Timing Comparison (averaged for NYU v2)

(for Matlab prog. running on single core, thread)

[242] [221] This chapter

41 sec 73 sec 3.1 sec

Table 2.1: Comparison of plane detection results on the NYU-Depth v2 dataset. We

report detection accuracies for ‘exactly planar classes’ (EPC) and ‘exact and nearly

planar classes’ (E+NPC). Efficiency of the proposed method is also compared with

related approaches.

Algorithm 1 Region Growing Algorithm for Depth-Based Segmentation

Input: Point cloud = {P}, Depth map = {D}, RGB image = {I}, Edge matching

threshold eth, Normalized boundary matching threshold bth

Output: Labeled planar regions = {R}1: Calculate point normals: {N} ← Fnormal(D)

2: Remove inconsistencies by low-pass filtering: {Nsm} ← N ∗ ksm // ksm is the

smoothing kernel

3: Cluster 3D points with similar normal orientations: {Nclu} ← Fk−means(Nsm)

4: Initialize: R← Nclu

5: Line segment detector: {L} ← FLSD(I)

6: Diffused line map: {Lsm} ← L ∗ k′sm7: Identify planar regions with missing depth values: {M} ← Fholes(Nclu,D)

8: Find adjacency links for each cluster in Nclu: Aclu

9: Identify all unique neighbors of clusters in M: Unb

10: From Unb, separate correct and faulty clusters into Ncor and Ninc respectively

11: Initialize available cluster list: Lavl ← Ncor

12: Initialize label propagation list: Lprp ← ∅13: while list Lavl is not empty do

14: Randomly draw a cluster from available Ncor: ridx

15: Identify ridx neighbors (Nr−idx) with faulty depth values using Aclu and M

16: for each neighbor nr−idx in Nr−idx do

17: Find mutual boundary (bm) of ridx and nr−idx

18: Calculate edge strength at bm using Lsm: estr

19: Calculate normalized boundary matching cost: bstr = bm/ Area of nr−idx

20: if estr < eth ∧ bstr > bth then

21: nr−idxadd−−→ Ncor, nr−idx

add−−→ Lavl

22: ridxrem−−→ Lavl, nr−idx

rem−−→ Ninc

23: Update Lprp with ridx and nr−idx. If nr−idx was previously replaced,

use the updated value.

24: ridxrem−−→ Lavl

25: for any leftover clusters in Ninc do

26: Randomly draw a cluster from available Ninc: r′idx

27: Execute similar steps (from line 15 to 24) for r′idx

28: Update R according to Lprp

29: return {R}

The line segment detector (LSD) [272] is used to extract the major line segments.

These line segments are grouped according to their vanishing points. Line segments

in the direction of the major vanishing points contribute more in separating re-

gions during the smoothness constraint-based plane detection process. However, we

found empirically that the use of any simple edge detection method (e.g., Canny edge

detector) in our algorithm gives nearly identical performance with much better effi-

ciency. We further increased the efficiency by replacing iterative region growing with

k-means clustering for regions having valid depth values. The planar patches are

grown from regions with valid depth values towards regions having missing depths.

In this process, segmentation boundaries are predominantly defined by the appear-

ance based edges in an image. Since the majority of the pixels have correct orienta-

tion, fitting a plane decreases the orientation errors and the approximate orientation

of major surfaces is retained. An added benefit of our algorithm is that curved sur-

faces are approximated by planes rather than missed out during the region-growing

process.

Once the regions have been grown to their full extent, small regions are dropped,

and only regions with a significant number of pixels are retained. After that, planes

are fitted onto the set of points belonging to each region using TLS (Total Least

Square) fitting. Least-square plane fitting is a non-linear problem, but it reduces to

an eigenvalue problem in the case of planar patches. This makes the plane fitting

process highly efficient. It is important to note that although indoor surfaces are not

strictly limited to planes, we assume that we are dealing with planar regions during

the plane fitting process. It turns out that this assumption is not a hard constraint

since the majority of the surfaces in an indoor environment are either strictly planar

(e.g., walls, ceilings) or nearly planar (e.g., beds, doors).

We show a qualitative comparison of our approach with other plane detection

techniques in Fig. 2.7. Note that our approach provides a depth-based segmentation

and then fits planes to the approximate geometry of the region (3rd row, Fig. 2.7).

This makes it possible to identify better planar region candidates compared to [242]

(2nd row, Fig. 2.7). We show a quantitative performance and efficiency comparison

in Table 2.1. For the performance evaluation, we report the achieved accuracy when

a valid planar region was identified for a strictly planar semantic class (EPC, Table

2.1). To quantify the validity of a detected planar region, we check its alignment with

the three dominant and perpendicular room directions. We also report the accuracy

with which a valid planar region was identified for the exactly (e.g., walls, ceilings)

and nearly planar (e.g., blinds, beds) semantic classes (E+NPC, Table 2.1). The

results demonstrate that our algorithm is superior to other region growing algorithms

(e.g., [221]) which are suitable for the segmentation of dense point clouds and fail

to deal with erroneous depth measurements from the Kinect sensor (Table 2.1).

2.6 Experiments and Analysis

2.6.1 Datasets

We evaluated our framework on the NYU-Depth datasets (v1 and v2) and the

SUN3D dataset. All these are recent RGBD datasets for indoor scenes acquired

using the Microsoft Kinect structured light sensor. The NYU-Depth dataset is

the only one of its kind and comes with manual annotations acquired via Amazon

Mechanical Turk. The dataset comes in two releases. The first version (v1) of

NYU-Depth [241] consists of 64 different indoor scenes categorized into 7 major

scene types and contains 2284 labeled frames. The second version (v2) of NYU-

Depth [242] consists of 464 different indoor scenes classified into 26 major scene

types and contains 1449 labeled frames. SUN3D is a large-scale indoor RGBD video

dataset [291]; however, it is still under development and only a small portion has

been labeled. We extracted labeled key-frames from the SUN3D database which

amounted to 83 images. We evaluated our method on the labeled portions of the

NYU v1, v2 and SUN3D datasets.

2.6.2 Results

In the NYU-Depth v1 dataset, around 1400 different object classes are present

in all indoor scenes. Since not all object classes have a sufficient representation, we

follow the procedure in [241] to cluster the existing annotations into the 13 most

frequently occurring classes. This clustering is performed using the Wordnet Natural

Language Toolkit (NLTK). In the NYU-Depth v2 dataset, around 900 different

object classes are present overall. We used a similar procedure to cluster existing

annotations into the 22 most frequently occurring classes. Moreover, we report

results on 40 classes to show how our performance compares when the number of

semantic classes is increased. For the SUN3D dataset, 32 classes are present

in the labeled images we acquired. We clustered them into 13 major classes using

Wordnet. In all three datasets, a supplementary class labeled ‘other ’ is also included

to model rarely-occurring objects. In our evaluations, we exclude all unlabeled

regions. For all the three datasets, roughly a train/test split of 60%/40% was used.

A relatively small validation set consisting of 50 random images was extracted from

each dataset (except for SUN3D where we used the parameters of NYU-Depth v1).

2.6. Experiments and Analysis 37

cludin

)52.8±

44.4±

41.9±

)60.9±

%51.1±

%47.6±

)63.3±

%52.5±

%48.3±

65.2±

%53.6±

%48.9±

)68.6±

55.3±

%51.5±

70.5±

%58.0±

%53.7±

70.6±

%58.3±

%54.2±

.,‘w

all’

‘cei

Bookshelf

Cabinet

Ceiling

Picture

Television

Window

Unlabeled

Accuracy

.,‘w

all’

oor’,

‘floo

Bookshelf

Cabinet

Ceiling

Picture

Television

Window

Counter

Person

Clothes

Utensils

Unlabeled

AccuracyMean

Accuracy

.,‘w

all’

ng’,

‘whi

Cabinet

Window

Bookshelf

Picture

Counter

Blinds

Shelves

Curtain

Dresser

Pillow

Mirror

Floormat

Clothes

Ceiling

Refrigerator

Television

Shower

curtain

Whiteboard

Person

Nightstand

Toilet

Bathtub

structureOther

furniture

Otherprops

Unlabeled

AccuracyMean

Accuracy

cludin

This validation set was used with the genetic search algorithm (Sec. 2.3.1) for

the selection of useful features and for the choice of the initial estimates of the

parameters which give the best performance. Afterwards, these parameters were

optimized during the learning process as described in Sec. 2.4.1.

We use two popular evaluation metrics to assess our results, ‘global accuracy ’

and ‘class accuracy ’ (see Table 2.2). Global accuracy measures the average number

of super-pixels which are correctly classified in the test set. Class accuracy measures

the average of the correct class predictions which is essentially equal to the mean of

the values occurring along the diagonal of the confusion matrix. We extensively

evaluated our approach on both versions of the NYU-Depth dataset and on the

SUN3D dataset. Our experimental results are reported in Tables 2.2, 2.3, 2.4 and

2.5. Comparisons with state-of-the-art techniques are reported in Tables 2.6, 2.7,

2.8 , 2.9 and 2.10. Sample labelings for NYU-Depth v1 and v2 and SUN3D are

presented in Figs. 2.8, 2.9 and 2.10 respectively. Although the unlabeled portions

in the annotated images are not considered during our evaluations, we observed that

the labeling scheme mostly predicts accurate class labels (see Figs. 2.8 and 2.9).

Ablation Study

We report our results in terms of average pixel and class accuracies in Table 2.2.

The first row shows the performance when a simple unary energy defined on pixels

using an ensemble of features is used. We achieve pixel and class accuracies of

52.8% and 53.4% respectively on NYU-Depth v1. The corresponding accuracies

for NYU-Depth v2 and SUN3D are 44.4%, 39.2% and 41.9%, 40.0% respectively.

Starting from this baseline, we were able to obtain significant improvements. Upon

the introduction of the planar appearance model, the pixel and class accuracies

increased by 10.5% and 9.3% from their previous values for NYU-Depth v1 (row

3, Table 2.2). Similarly for NYU-Depth v2, an increase of 8.1% and 3.2% is noted

for pixel and class accuracies respectively. Finally for the SUN3D database, we

achieve an increase of 6.4% and 2.6% in pixel and class accuracies respectively.

Note that a simple averaging operation on the pixel and planar appearance energies

(equivalently an LOP with weights [12, 1

2]) gives less accurate results (row 2, Table

2.2). The addition of the CRF and the proposed location energy enforce a better

label consistency which results in an improvement of 7.2% and 3.8% for NYU-Depth

v1, 5.5% and 2.5% for NYU-Depth v2, 5.4% and 2.1% for SUN3D datasets. The

introduction of HOEs gives a slight boost in accuracy. This is logical since the

introduction of cardinality-based HOEs improves segmentation accuracies for porous

and fine structures such as trees and cat fur, respectively. The classes which are

considered in this work usually have solid structures with definite and well-defined

boundaries. However, when we consider the segmentation performance around the

boundary regions, the HOEs give a significant increase in accuracy (Fig. 2.11).

Comparisons

For NYU-Depth v1, we compare our framework with [241] (Table 2.6). With the

same set of classes used in [241], we achieved a 13.2% improvement in terms of

average class accuracy. We also report the average global accuracy which gives a

better absolute measurement of performance. The class-wise accuracies for NYU-

Depth v1 are shown in Table 2.3 and the complete confusion matrix is presented in

Fig. 2.12. It can be seen that we perform really well on planar classes such as wall,

ceiling, blinds and table.

For the case of NYU-Depth v2, we compare our framework with recent multi-

scale convolutional network based techniques [53, 39]. Whereas in [53, 39] evalu-

ations were performed on just 13 classes, we use a broader range of 22 classes to

report our results (see Table 2.4). To compare with the class sofa, we report the

mean accuracies of the sofa and chair classes for a fair comparison (if we sum up

the class occurrences of the chair and sofa which are reported in [39], the combined

class frequency supports such a comparison). We compare the furniture class in [39]

with our cabinet class based on the details given in [39]. Overall, we get superior

performance compared to [53, 39] and also achieve best class accuracies for 19/22

classes.

On the NYU-Depth v2 dataset, [242] defined just four semantic classes: furniture,

ground, structure and props. The choice of these classes was based on the need to

infer the support relationships between objects. We evaluate our method on the

4-class segmentation task as well. As shown in Table 2.8, we achieved the best

performance overall. In particular, we performed well on planar classes such as floor

and structures. In terms of pixel and class accuracies, we noted an improvement of

2.2% and 1.3% respectively. We also compare our results with [80] in terms of the

weighted average Jaccard index (WAJI). Our system’s performance is lower than

that of [80], which is based on a very strong but computationally-expensive contour

detection technique called gPb [6] (Table 2.9). Finally, we compare our results on

a 40-class semantic labelling task (Table 2.10). We note that the RGBD version of

the R-CNN model proposed in [81] performs best. Their approach however, uses

external data (Imagenet) for pre-training and uses synthetic 3D CAD models from

the Internet to generate training data.

One may wonder why the incorporation of geometrical context in the CRF model

Figure 2.10: Examples of the semantic labeling results on the SUN3D dataset. The

top row shows the intensity images, the bottom row are the ground truths and the

middle row are our labeling results. The representative colors are shown in the figure

legend at the bottom. (Best viewed in color)

Table 2.6: Comparison of the results on the NYU-Depth v1 Dataset: With the same

set of classes used in [241], we achieve a ∼ 13% improvement in terms of average

class accuracy.

MethodNYU-Depth v1

ClassesGlobal Accuracy Class Accuracy

[241] 59.8± 11.5% 53.7± 2.9% 13

This chapter 70.6± 13.8% 66.5% 13

Table 2.7: Comparison of results on the NYU-Depth v2 Dataset: With nearly two

times the number of classes used in [53, 39], we get 6% and 9% improvement in

terms of average class and global accuracies respectively.

MethodNYU-Depth v2

ClassesGlobal Accuracy Class Accuracy

[53] 51.0± 15.2% 35.8% 13

[39] 52.4± 15.2% 36.2% 13

This chapter 58.3± 15.9% 45.1% 22

works and gives such high accuracies? In v1 of the NYU-Depth dataset, there are

eight out of 13 classes (cabinet, ceiling, floor, picture, table, wall, bed, blind) which

are planar and out of the remaining classes, four (tv, sofa, bookshelf, window) are

loosely planar. The planar classes correspond to 77.21% while the loosely planar

classes correspond to 22.79% of the total labeled data. Second, the floor or wall or

other classes may have varying textures across different images. However, with depth

information in place, we can determine the correct class of the object. Similarly for

v2 of the NYU-Depth dataset, there are nearly ten out of 22 classes (bed, blind,

cabinet, ceiling, floor, picture, table, wall, counter, door) which are planar and out

of the remaining classes 6 are loosely planar (tv, sofa, bookshelf, window, box, sink).

The planar classes correspond to 62.2% while the loosely planar classes correspond

to 14.3% of the total labeled data. There is a similar trend on the SUN3D database.

Timing Analysis

Our approach is efficient at test time, since the proposed graph energies are sub-

modular and approximate inference can be made using graph-cuts. Empirically, we

Table 2.8: Comparison of results on the NYU-Depth v2 Dataset (4-class labeling

task): Our method achieved best performance in terms of average pixel and class

accuracies for the 4-class segmentation task. We also get the best classification

performance on structure class.

MethodSemantic Classes Pixel Class

Floor Struct. Furn. Prop. Acc. Acc.

[242] 68 59 70 42 58.6 59.6

[53] 68.1 87.8 51.1 29.9 63 59.2

[39] 87.3 86.1 45.3 35.5 64.5 63.5

[26] 87.9 79.7 63.8 27.1 67.0 64.3

This chapter 87.1 88.2 54.7 32.6 69.2 65.6

0 5 10 15 2050

Width of Area Surrounding the Boundaries (Pixels)

FEFE+PAMFE+PAM+PLPFE+PAM+PLP+Grid CRFFE+PAM+PLP+CRF(HOP)

Figure 2.11: The error rate decreases as more area surrounding the class boundaries

is considered. The introduction of HOE improves the segmentation accuracy around

the boundaries.

task): Our method achieved the second best performance in terms of weighted

average Jaccard index (WAJI).

Perf. SC-[242] LP-[242] [226] SVM-[80] This chapter

WAJI 56.31 53.4 59.19 64.81 62.66

task): Our method achieved second best performance in terms of weighted average

Jaccard index (WAJI).

Perf. SC-[242] [226] SVM-[80] CNN-[81] This chapter

WAJI 38.2 37.6 43.9 47.0 42.1

rounded

found average testing time per image to be ∼ 1.6 sec for NYU-Depth v1, ∼ 1.7 sec

for NYU-Depth v2 and ∼ 1.4 sec for the SUN3D database. For parameter learning

on the training set, it took ∼ 17 hrs for NYU-Depth v1, ∼ 12 hrs for NYU-Depth

v2 and ∼ 45 min for the SUN3D database. The RDF training took ∼ 4 hrs, ∼ 2

hrs and ∼ 7 mins on the NYU-Depth v1, v2 and SUN3D databases respectively.

2.6.3 Discussion

It may be of interest to know why we used a hierarchical ensemble learning

scheme to combine posteriors defined on pixels and planar regions. We prefer to

use the proposed scheme because it combines the posteriors on the fly and thus saves

a reasonable amount of training time. Alternate ensemble learning methods such

as Boosting and Bagging require considerable training data and take much time. It

must be noted that we used graph-cuts for making approximate inference during

the S-SVM training. This method is not always precisely accurate. Moreover, only

a limited set of constraints (the working set) from the original infinite number of

constraints are used during training. These approximations can sometimes lead to

unsatisfactory performance. However, we minimized this behavior by initializing

the parameters with values that gave the best performance on the validation set.

This heuristic worked well for our case and enhanced the labeling accuracy.

It can be seen that indoor scene labeling is a challenging problem due to the

diverse nature of the scenes. The major reason for the low reported scene labeling

accuracies (see Table 2.2) is the presence of a large number of objects with varying

textures and layouts across different images. These varied appearances of objects

cause many ambiguities. Also there are many bland regions in the scenes, which

introduce an additional challenge for a correct segmentation. Many times class errors

are due to the confusion between two similar classes e.g., as evident in the confusion

matrices (Fig. 2.12), door is usually confused with wall, blind with window, sink with

counter and sofa with bed. Despite the incorporation of the geometrical context,

an unusual confusion occurs between ceiling and wall. The reason is that the depth

estimates in the regions close to the upper boundary of the scenes were not accurate

and this is the typical location where the ceiling normally occurs in the majority of

the scenes. The planes extracted in this region give a horizontal orientation (instead

of vertical) which contributes to this misclassification, aided by the fact that the

walls and ceilings usually have similar appearances.

The NYU corpus captures natural indoor scene conditions which are common

in everyday life scenarios. As an example, the dataset contains large illumination

variations (e.g., for scenes of offices, stores) which correctly capture the indoor con-

ditions. Some misclassifications are possibly due to these illumination variations

and specular surfaces e.g., the window or the reflecting mirror was confused with

the light source. Another major challenge relates to the long-tail distribution of

object categories, where a small number of categories appear frequently in indoor

scenes while others are rare. For example, the top ten most frequent classes out

of a total of 894 classes in the NYU v2 dataset constitutes over 65% of the total

labelled data. This translates into a somewhat unbalanced dataset with an insuffi-

cient representation of many semantic classes in the training set [226]. The labeled

portion of the SUN3D database was insufficient for training (because the database

is under development). This explains why the achieved accuracies for this database

are on the low side (see Table 2.2, Fig. 2.12). The availability of more and higher

quality training data for each class will certainly improve the performance of scene

labeling frameworks. The removal of unwanted artifacts such as illumination varia-

tions and shadows can also help in improving the segmentation accuracy [124]. In

short, the challenging indoor scene classification task is far from being solved and

requires further investigation both in terms of new techniques and data for testing

and bench-marking.

2.7 Conclusion

This chapter presented a novel CRF model for semantic labeling of indoor scenes.

The proposed model uses both appearance and geometry information. The geometry

of indoor planar surfaces was approximated using a proposed robust region grow-

ing algorithm for segmentation. The approximate geometry was combined with

appearance-based information and a location prior in the unary term. A learned

combination of boundaries was used to define the spatial discontinuity across an im-

age. The proposed model also captured long-range interactions by defining cliques

on the dominant planar surfaces. The parameters of our model were learned using

a single slack formulation of the rescaled margin cutting plane algorithm. We ex-

tensively evaluated our scheme on both versions of the NYU-Depth and the recent

SUN3D database and reported comparisons and improvements over existing works.

As a future work, we will extend the proposed model to holistically reason about

indoor scenes and to understand the rich interactions between scene elements.

51CHAPTER 3

Automatic Shadow Detection and Removal from

a Single Photograph1

Everything that we see is a shadow cast by that which we do not see.

Martin Luther King, Jr.(1929-1968)

Abstract

We present a framework to automatically detect and remove shadows in real

world scenes from a single image. Previous works on shadow detection put a lot

of effort in designing shadow variant and invariant hand-crafted features. In con-

trast, our framework automatically learns the most relevant features in a supervised

manner using multiple convolutional deep neural networks (ConvNets). The fea-

tures are learned at the super-pixel level and along the dominant boundaries in the

image. The predicted posteriors based on the learned features are fed to a condi-

tional random field model to generate smooth shadow masks. Using the detected

shadow masks, we propose a Bayesian formulation to accurately extract shadow

matte and subsequently remove shadows. The Bayesian formulation is based on a

novel model which accurately models the shadow generation process in the umbra

and penumbra regions. The model parameters are efficiently estimated using an

iterative optimization procedure. Our proposed framework consistently performed

better than the state-of-the-art on all major shadow databases collected under a

variety of conditions.

Keywords : Feature Learning; Bayesian shadow removal; Conditional Random

Field; ConvNets; Shadow detection; Shadow matting

3.1 Introduction

Shadows are a frequently occurring natural phenomenon, whose detection and

manipulation are important in many computer vision (e.g., visual scene understand-

ing) and computer graphics applications. As early as the time of Da Vinci, the prop-

erties of shadows were well studied [42]. Recently, shadows have been used for tasks

1Published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),

IEEE, vol.38, no. 3, pp. 431-446, March 2016, doi:10.1109/TPAMI.2015.2462355. A preliminary

version of this research was published in the Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pp. 1939-1946, IEEE, 2014.

52 Chapter 3. Automatic Shadow Detection and Removal from a Single Photograph

related to object shape [189, 198], size, movement [123], number of light sources and

illumination conditions [234]. Shadows have a particular practical importance in

augmented reality applications, where the illumination conditions in a scene can be

used to seamlessly render virtual objects and their casted shadows. Contrary to the

above mentioned assistive roles, shadows can also cause complications in many fun-

damental computer vision tasks. For instance, they can degrade the performance of

object recognition, stereo, shape reconstruction, image segmentation and scene anal-

ysis. In digital photography, information about shadows and their removal can help

to improve the visual quality of photographs. Shadows are also a serious concern

for aerial imaging and object tracking in video sequences [216].

Despite the ambiguities generated by shadows, the Human Visual System (HVS)

does not face any real difficulty in filtering out the degradations caused by shadows.

We need to equip machines with such visual comprehension abilities. Inspired by

the hierarchical architecture of the human visual cortex, many deep representation

learning architectures have been proposed in the last decade. We draw our moti-

vation from the recent successes of these deep learning methods in many computer

vision tasks where learned features out-performed hand-crafted features [86]. On

that basis, we propose to use multiple convolutional neural networks (ConvNets) to

learn useful feature representations for the task of shadow detection. ConvNets are

biologically inspired deep network architectures based on Hubel and Wiesel’s [99]

work on the cat’s primary visual cortex. Once shadows are detected, an automatic

shadow removal algorithm is proposed which encodes the detected information in

the likelihood and prior terms of the proposed Bayesian formulation. Our formu-

lation is based on a generalized shadow generation model which models both the

umbra and penumbra regions. To the best of our knowledge, we are the first to use

‘learned features’ in the context of shadow detection, as opposed to the common

carefully designed and hand-crafted features. Moreover, the proposed approach

detects and removes shadows automatically without any human input (Fig. 3.1).

Our proposed shadow detection approach combines local information at image

patches with the local information across boundaries (Fig. 3.1). Since the regions

and the boundaries exhibit different types of features, we split the detection proce-

dure into two respective portions. Separate ConvNets are consequently trained for

patches extracted around the scene boundaries and the super-pixels. Predictions

made by the ConvNets are local and we therefore need to exploit the higher level

interactions between the neighboring pixels. For this purpose, we incorporate local

beliefs in a Conditional Random Field (CRF) model which enforces the labeling

consistency over the nodes of a grid graph defined on an image (Sec. 3.3). This

removes isolated and spurious labeling outcomes and encourages neighboring pixels

to adopt the same label.

Using the detected shadow mask, we identify the umbra (Latin meaning shadow),

penumbra (Latin meaning almost-shadow) and shadow-less regions and propose a

Bayesian formulation to automatically remove shadows. We introduce a generalized

shadow generation model which separately defines the umbra and penumbra gener-

ation process. The resulting optimization problem has a relatively large number

of unknown parameters, whose MAP estimates are efficiently computed by alter-

natively solving for the parameters (Eq. 3.26). The shadow removal process also

extracts smooth shadow matte that can be used in applications such as shadow

compositing and editing (Sec. 3.4).

A preliminary version of this research (which solely focuses on shadow detection)

appeared in [127]. In addition, the current study includes: (1) a new approach to

estimate shadow statistics, (2) automatic shadow removal and shadow matte extrac-

tion, (3) a substantial number of additional experiments, analysis and limitations,

(4) possible applications in many computer vision and graphics tasks.

3.2 Related Work and Contributions

Shadow Detection: One of the most popular methods to detect shadows is

to use a variety of shadow variant and invariant cues to capture the statistical

and deterministic characteristics of shadows [312, 144, 111, 78, 233]. The extracted

features model the chromatic, textural [312, 144, 78, 233] and illumination [111, 204]

properties of shadows to determine the illumination conditions in the scene. Some

works give more importance to features computed across image boundaries, such as

intensity and color ratios across boundaries and the computation of texton features

on both sides of the edges [265, 144]. Although these feature representations are

useful, they are based on assumptions that may not hold true in all cases. As an

example, chromatic cues assume that the texture of the image regions remains the

same across shadow boundaries and only the illumination is different. This approach

fails when the image regions under shadows are barely visible. Moreover, all of

these methods involve a considerable effort in the design of hand-crafted features for

shadow detection and feature selection (e.g., the use of ensemble learning methods

to rank the best features [312, 144]). Our data-driven framework is different and

unique: we propose to use deep feature learning methods to ‘learn the most relevant

features’ for shadow detection.

3.2. Related Work and Contributions 55

Owing to the challenging nature of the shadow detection problem, many simplis-

tic assumptions are commonly adopted. Previous works made assumptions related

to the illumination sources [234], the geometry of the objects casting shadows and

the material properties of the surfaces on which shadows are cast. For example,

Salvador et al. [233] consider object cast shadows while Lalonde et al. [144] only

detect shadows that lie on the ground. Some methods use synthetically generated

training data to detect shadows [203]. Techniques targeted for video surveillance ap-

plications take advantage of multiple images [58] or time-lapse sequences [119, 101]

to detect shadows. User assistance is also required by many proposed techniques

to achieve their attained performances [238, 21]. In contrast, our shadow detection

method makes absolutely ‘no prior assumptions’ about the scene, the shadow prop-

erties, the shape of objects, the image capturing conditions and the surrounding

environments. Based on this premise, we tested our proposed framework on all of

the publicly available databases for shadow detection from single images. These

databases contain common real world scenes with artifacts such as noise, compres-

sion and color balancing effects.

Shadow Removal and Matting: Almost all approaches that are employed to

either edit or remove shadows are based on models that are derived from the image

formation process. A popular choice is to physically model the image into a de-

composition of its intrinsic images along with some parameters that are responsible

for the generation of shadows. As a result, the shadow removal process is reduced

to the estimation of the model parameters. Finlayson et al. [61, 60] addressed this

problem by nullifying the shadow edges and reintegrating the image, which results

in the estimation of the additive scaling factor. Since such global integration (which

requires the solution of a 2D Poisson equation [61, 59]) causes artifacts, the integra-

tion along a 1D Hamiltonian path [63] is proposed for shadow removal. However,

these and other gradient based methods (such as [172, 191]) do not account for the

shadow variations inside the umbra region. To address this shortcoming, Arbel and

Hel-Or [5] treat the illumination recovery problem as a 3D surface reconstruction

and use a thin plate model to successfully remove shadows lying on curved surfaces.

Alternatively, information theory based techniques are proposed in [139, 59] and a

bilateral filtering based approach is recently proposed in [297] to recover intrinsic

(illumination and reflectance) images. However, these approaches either require user

assistance, calibrated imaging sensors, careful parameter selection or considerable

processing times. To overcome these shortcomings, some reasonably fast and accu-

rate approaches have been proposed which aim to transfer the color statistics from

the non-shadow regions to the shadow regions (‘color transfer based approaches’ e.g.,

[225, 285, 238, 286, 290]). Our proposed shadow removal algorithm also belongs to

the category of color transfer based approaches. However, in contrast to previous

related works, we propose a generalized image formation model which enables us

to deal with non-uniform umbra regions as well as soft shadows. Color transfer is

also made at multiple spatial levels , which helps in the reduction of noise and color

artifacts. An added advantage of our approach is our ability to separate smooth

shadow matte from the actual image.

Several assumptions are made in the shadow removal literature due to the ill-

posed nature of recovering the model parameters for each pixel. The camera sensor

parameters are needed in [297, 61]. Multiple narrow-band sensor outputs for each

scene are required in [297], while [189] employs a sequence of images to recover the

intrinsic components. Lambertian surface and Planckian lightening assumptions are

made in [297]. Though several approaches work just on a single image, they require

considerable user interaction to identify either tri-maps [35], quad-maps [285, 286],

gradients [156] or exact shadow boundaries [172, 191]. Su and Chen [251] tried

to minimize the user effort by specifying the complete shadow boundary from the

user provided strokes. In contrast, our framework does not require any form of

user interaction and makes no assumption regarding the camera or scene properties

(except that the object surfaces are assumed to be Lambertian).

The key contributions of our work are outlined below:

� We propose a new approach for robust shadow detection combining both re-

gional and across-boundary learned features in a probabilistic framework in-

volving CRFs (Sec. 3.3).

� Our proposed method automatically learns the most relevant feature repre-

sentations from raw pixel values using multiple ConvNets (Sec. 3.3).

� We propose a generalized shadow formation model along with automatic color

statistics modeling using only detected shadow masks (Sec. 3.4.1 and 3.4.2).

� Our proposed Bayesian formulation for the shadow removal problem integrates

multi-level color transfer and the resulting cost function is efficiently optimized

to give superior results (Sec. 3.4.3 and 3.4.4).

� We performed extensive quantitative evaluation to prove that the proposed

framework is robust, less-constrained and generalisable across different types

of scenes (Sec. 4.6).

3.3. Proposed Shadow Detection Framework 57

l Filt

3.3 Proposed Shadow Detection Framework

Given a single color image, we aim to detect and localize shadows precisely at

the pixel level (see block diagram in Fig. 3.2). If y denotes the desired binary

mask encoding class relationships, we can model the shadow detection problem as

a conditional distribution:

P(y|x; w) =1

Z(w)exp(−E(y,x; w)) (3.1)

where, the parameter vector w includes the weights of the model, the manifest

variables are represented by x where xi denotes the intensity of pixel i ∈ {pi}1×N

and Z(w) denotes the partition function. The energy function is composed of two

potentials; the unary potential ψi and the pairwise potential ψij:

E(y,x; w) =∑i∈V

ψi(yi,x; wi) +∑

(i,j)∈E

ψij(yij,x; wij) (3.2)

In the following discussion, we will explain how we model these potentials in a CRF

framework.

3.3.1 Feature Learning for Unary Predictions

The unary potential in Eq. 3.2 considers the shadow properties both at the

regions and at the boundaries inside an image.

ψi(yi,x; wi) =

region︷︸︸︷φri (yi,x; wr

boundary︷︸︸︷φbi(yi,x; wb

i ) (3.3)

We define each of the boundary and regional potentials, φr and φb respectively, in

terms of probability estimates from the two separate ConvNets,

φri (yi,x; wri ) = −wr

i logPcnn1(yi|xr)

φbi(yi,x; wbi ) = −wb

i logPcnn2(yi|xb)(3.4)

This is logical because the features to be estimated at the boundaries are likely to

be different from the ones estimated inside the shadowed regions. Therefore, we

train two separate ConvNets, one for the regional potentials and the other for the

boundary potentials.

The ConvNet architecture used for feature learning consists of alternating con-

volution and sub-sampling layers (Fig. 3.3). Each convolutional layer in a ConvNet

consists of filter banks which are convolved with the input feature maps. The sub-

sampling layers pool the incoming features to derive invariant representations. This

Figure 3.3: ConvNet Architecture used for Automatic Feature Learning to Detect

Shadows.

layered structure enables ConvNets to learn multilevel hierarchies of features. The

final layer of the network is fully connected and comes just before the output layer.

This layer works as a traditional MLP with one hidden layer followed by a logistic

regression output layer which provides a distribution over the classes. Overall, after

the network has been trained, it takes an RGB patch as an input and processes it

to give a posterior distribution over binary classes.

ConvNets operate on equi-sized windows, so it is required to extract patches

around desired points of interest. For the case of regional potentials, we extract

super-pixels by clustering the homogeneous pixels2. Afterwards, a patch (Ir) is

extracted by centering a τs×τs window at the centroid of each superpixel. Similarly

for boundary potentials, we first apply a Bilateral filter and then extract boundaries

using the gPb technique [6]. We traverse each boundary with a stride λb and extract

a τs × τs patch at each step to incorporate local context3. Therefore, ConvNets

operate on sets of boundary and super-pixel patches, xr = {Ir(i, j)}1×|Fslic(x)| and

xb = {Ib(i, j)}1×|FgPb(x)|

respectively, where |.| is the cardinality operator. Note

that we include synthetic data (generated by artificial linear transformations [32])

during the training process. This data augmentation is important not only because

it removes the skewed class distribution of the shadowed regions but it also results

2In our implementation we used SLIC [2], due to its efficiency.3the step size is λb = τs/4 to get partially overlapping windows.

in an enhanced performance. Moreover, data augmentation helps to reduce the

overfitting problem in ConvNets (e.g., in [36]) which results in the learning of more

robust feature representations.

During the training process, we use stochastic gradient descent to automatically

learn feature representations in a supervised manner. The gradients are computed

using back-propagation to minimize the cross entropy loss function [147]. We set

the training parameters (e.g., momentum and weight decay) using a cross valida-

tion process. The training samples are shuffled randomly before training since the

network can learn faster from unexpected samples. The weights of the ConvNet

were initialized with randomly drawn samples from a Gaussian distribution of zero

mean and a variance that is inversely proportional to the fan-in measure of neurons.

The number of epochs during the training of ConvNets is set by an early stopping

criterion based on a small validation set. The initial learning rate is heuristically

chosen by selecting the largest rate which resulted in the convergence of the training

error. This rate is decremented by a factor of υ = 0.5 after every 20 epochs.

The ConvNet trained on boundary patches learn to separate shadow and re-

flectance edges while the ConvNet trained on regions can differentiate between

shadow and non-shadow patches. For the case of the regions, the posteriors pre-

dicted by ConvNet are assigned to each super pixel in an image. However, for the

boundaries, we first localize the probable shadow location using the local contrast

and then average the predicted probabilities over each contour generated by the

Ultra-metric Contour Maps (UCM) [6].

3.3.2 Contrast Sensitive Pairwise Potential

The pairwise potential in Eq. 3.2 is defined as a combination of the class tran-

sition potential φp1 and the spatial transition potential φp2 :

ψij(yij,x; wij) = wijφp1(yi, yj)φp2(x). (3.5)

The class transition potential takes the form of an Ising prior:

φp1(yi, yj) = α1yi 6=yj =

{0 if yi = yj

α otherwise(3.6)

The spatial transition potential captures the differences in the adjacent pixel inten-

sities:

φp2(x) = [exp(− ‖xi − xj‖2

βx〈‖xi − xj‖2〉)] (3.7)

where, 〈·〉 denotes the average contrast in an image. The parameters α and βx were

derived using cross validation on each database.

edShad

oundar

3.3.3 Shadow Contour Generation using CRF Model

We model the shadow contour generation in the form of a two-class scene parsing

problem where each pixel is labeled either as a shadow or a non-shadow. This

binary classification problem takes probability estimates from the supervised feature

learning algorithm and incorporates them in a CRF model. The CRF model is

defined on a grid structured graph topology, where graph nodes correspond to image

pixels (Eq. 3.2). When making an inference, the most likely labeling is found using

the Maximum a Posteriori (MAP) estimate (y∗) upon a set of random variables

y ∈ LN . This estimation turns out to be an energy minimization problem since the

partition function Z(w) does not depend on y:

y∗ = argmaxy∈LN

P(y|x; w) = argminy∈LN

E(y,x; w) (3.8)

The CRF model proved to be an elegant source to enforce label consistency and the

local smoothness over the pixels. However, the size of the training space (labeled

images) makes it intractable to compute the gradient of the likelihood. Therefore

the parameters of the CRF cannot be found by simply maximizing the likelihood

of the hand labeled shadows. Hence, we use the ‘margin rescaled algorithm’ to

learn the parameters (w in Eq. 3.8) of our proposed CRF model (see Fig 3 in [253]

for details). Because our proposed energies are sub-modular, we use graph-cuts for

making efficient inferences [22]. In the next section, we describe the details of our

shadow removal and matting framework.

3.4 Proposed Shadow Removal and Matting Framework

Based on the detected shadows in the image, we propose a novel automatic

shadow removal approach. A block diagram of the proposed approach is presented

in Fig. 3.4. The first step is to identify the umbra, penumbra and the corresponding

non-shadowed regions in an image. We also need to identify the boundary where

the actual object and its shadow meet. This identification helps to avoid any errors

during the estimation of shadow/non-shadow statistics (e.g., color distribution). In

previous works (such as [286, 5, 238]), this process has been carried out manually

through human interaction. We, however, propose a simple procedure to automati-

cally estimate the umbra, penumbra regions and the object-shadow boundary.

Heuristically, the object-shadow boundary is relatively darker compared to other

shadow boundaries where differences in light intensity are significant. Therefore,

given a shadow mask, we calculate the boundary normals at each point. We

3.4. Proposed Shadow Removal and Matting Framework 63

Figure 3.5: Detection of Object and Shadow Boundary: We use the gradient profile

along the direction perpendicular to a boundary point (four sample profiles are

plotted on the anti-diagonal of above figure) to separate the object-shadow boundary

(shown in red in lower right image).

oundar

cluster the boundary points according to the direction of their normals. This results

in separate boundary segments which join to form the boundary contour around

the shadow. Then, the boundary segments in the shadow contour with a minimum

relative change in intensity are classified to represent the object-shadow boundary.

If %cb denotes the mean intensity change along the normal direction at a boundary

segment b of the shadow contour c, all boundary segments s.t. %cb/%cmax ≤ 0.5

are considered to correspond to the segments which separate the object and its

cast shadow. This simple procedure performs reasonably well for most of our test

examples (Fig. 3.5). In the case where the object shadow boundary is not visible,

no boundary portion is classified as an object shadow boundary and the shadow-less

statistics are taken from all around the shadow region. In most cases, this does not

affect the removal performance as long as the object-shadow boundary is not very

large compared to the total shadow boundary.

To estimate the umbra and penumbra regions, the boundary is estimated at each

point of the shadow contour by fitting a curve and finding the corresponding normal

direction. This procedure is adopted to extract accurate boundary estimates instead

of local normals which can result in erroneous outputs at times. We propagate the

boundaries along the estimated normal directions until the intensity change becomes

insignificant (Fig. 3.6). This results in an approximation of the penumbra region.

We then exclude this region from the shadow mask and the remaining region is

considered as the umbra region. The region immediately adjacent to the shadow

region, with twice the width of the penumbra region is treated as the non-shadow

region. Note that our approach is based on the assumption that the texture remains

approximately the same across the shadow boundary.

3.4.1 Rough Estimation of Shadow-less Image by Color-transfer

The rough shadow-less image estimation process is based on the one adopted by

the color transfer techniques in [225] and [286]. As opposed to [225, 286], we perform

a multilevel color transfer and our method does not require any user input. The

color statistics of the shadowed as well as the non-shadowed regions are modeled

using a Gaussian mixture model (GMM). For this purpose, a continuous probability

distribution function is estimated from the histograms of both regions using the

Expectation-Maximization (EM) algorithm. The EM algorithm is initialized using

an unsupervised clustering algorithm (k-means in our implementation) and the EM

iterations are carried out until convergence. We treat each of the R, G and B

channels separately and fit mixture models to each of the respective histograms. It

Algorithm 2

RoughEstimation(S, N)

1: hS, hN ← Get histogram of color distribution

in S,N

2: gS, gN ← Fit GMM on hS, hN using EM algorithm

3: for each j ∈ [0, J ]

Channel wise color transfer between corresponding

Gaussians using Eqs. 3.9, 3.10.

Get probability of a pixel/super-pixel to belong to

a Gaussian component using Eq. 3.11.

Calculate overall transfer for each color channel

using Eq. 3.12.4: Combine multiple transfers:

C∗(x, y) = 1J+1

∑j Cj(x, y)

5: Calculate probability of a pixel to be shadow

or non-shadow:

pS(x, y) =∑K

k=1 ωkS

|DkN(x,y)||DkS (x,y)|+|DkN(x,y)|

6: Modify color transfer using Eq. 3.13

7: Improve result from above step using Eq. 3.14

return (I(x, y))

is considered that the estimated Gaussians, in the shadow and non-shadow regions,

correspond to each other when arranged according to their means. Therefore, the

color transfer is computed among the corresponding Gaussians using the following

pair of equations:

DkS(x, y) =I(x, y)− µkS

σkS(3.9)

Ck(x, y) = µkN + σkNDS(x, y) (3.10)

where D(·) measures the normalized deviation for each pixel, S and N denote the

shadow and non-shadow regions respectively. The index k is in range [1, K], where

K denotes the total number of Gaussians used to approximate the histogram of S.

The probability that a pixel (with coordinates x, y) belongs to a certain Gaussian

component can be represented in terms of its normalized deviation:

pkG(x, y) =

(|DkS(x, y)|

K∑k=1

|DkS(x, y)|+ ε

(3.11)

The overall transfer is calculated by taking the weighted sum of transfers for all

Gaussian components:

Cj=0(x, y) =K∑k=1

pkG(x, y)Ck(x, y). (3.12)

The color transfer performed at each pixel location (i.e. at level j = 0) using

Eq. 3.12 is local, and it thus, does not accurately restore the image contrast in

the shadowed regions. Moreover, this local color transfer is prone to noise and

discontinuities in illumination. We therefore resort to a hierarchical strategy which

restores color at multiple levels and combines all transfers which results in a better

estimation of the shadow-less image. A graph based segmentation procedure [57]

is used to group the pixels. This clustering is performed at J levels, which we

set to 4 in the current work based on the performance on a small validation set,

where we noted an over-smoothing and a low computational efficiency when J ≥ 5.

Since, the segment size is kept quite small, it is highly unlikely that the differently

colored pixels will be grouped together. At each level j ∈ [1, J ], the mean of each

cluster is used in the color transfer process (using Eqs. 3.9, 3.10) and the resulting

estimate (Eq. 3.12) is distributed to all pixels in the cluster. This gives multiple

color transfers Cj(x, y) at J different resolutions plus the local color transfer i.e.

Cj=0(x, y). At each level, a pixel or a super-pixel is treated as a discrete unit during

the color transfer process. The resulting transfers are integrated to produce the final

outcome: C∗(x, y) = 1J+1

∑Jj=0 Cj(x, y). This process helps in reducing the noise. It

also restores a better texture and improves the quality of the restored image. It

should be noted that our hierarchical strategy helps in successfully retaining the self

shading patterns in the recovered image compared to previous works (Sec. 3.5.3).

To avoid possible errors due to the small non-shadow regions that may be present

in the selected shadow region S, we calculate the probability of a pixel to be shadowed

using: pS(x, y) =∑K

k=1 ωkSp

kS(x, y), where ωkS is the weight of Gaussians (learned by

the EM algorithm) and pkS(x, y) = |DkN|/(|DkS|+ |DkN|). The color transfer is modified

C ′(x, y) = (1− pS(x, y))IS(x, y) + pS(x, y)C∗(x, y) (3.13)

However, the penumbra region pixels will not get accurate intensity values. To

correct this anomaly, we define a relation which measures the probability (in a

naive sense) of a pixel to belong to the penumbra region. Since the penumbra

region occurs around the shadow boundary, we define it as: bS(x, y) = d(x, y)/dmax.

The penumbra region is recovered using the exemplar based inpainting approach

of Criminisi et al. [40]. The resulting improved approximation of the shadow-less

image is,

I(x, y) = (1− bS(x, y))E(x, y) + bS(x, y)C ′(x, y) (3.14)

where, E is the inpainted image.

In our approach, the crude estimate of a shadow-less image (Eq. 3.14) is further

improved using Bayesian estimation (Sec. 3.4.3). But first we need to introduce the

proposed shadow generation model used in our Bayesian formulation (Sec. 3.4.2).

3.4.2 Generalised Shadow Generation Model

Unlike previous works (such as [238, 290, 78, 286, 172]), which do not differentiate

between the umbra and the penumbra regions during the shadow formation process,

we propose a model which treats both types of shadow regions separately. It is

important to make such distinction because the umbra and penumbra regions exhibit

distinct illumination characteristics and have a different influence from the direct

and indirect light (Fig. 3.6).

Let us suppose that we have a scene with illuminated and shadowed regions.

A normal illuminated image can be represented in terms of two intrinsic images

according to the image formation model of Barrow et al. [10]:

I(x, y) = L(x, y)R(x, y) (3.15)

where L and R are the illumination and reflectance respectively and x, y denote the

pixel coordinates. The illumination intrinsic image takes into account the illumi-

nation differences such as shadows and shading. We assume that a single source

of light is casting the shadows. The ambient light is assumed to be uniformly dis-

tributed in the environment due to the indirect illumination caused by reflections.

Therefore,

I(x, y) = (Ld(x, y) + Li(x, y))R(x, y) (3.16)

A cast shadow is formed when the direct illumination is blocked by some obstructing

object resulting in an occlusion. A cast shadow can be described as the combination

of two regions created by two distinct phenomena, umbra (U) and penumbra (P).

Umbra is surrounded by the penumbra region where the light intensity changes

sharply from dark to illuminated. The occlusion which casts the shadow block all

of the direct illumination and parts of the indirect illumination to create the umbra

region. We can represent this as;

Iu(x, y) = β′(x, y)Li(x, y)R(x, y) ∀x, y ∈ U (3.17)

Original Image with a

Selected Patch

Shadow Patch

Crude Estimate of Shadow-less Patch using Wu et al. [34]

Crude Estimate of Shadow-less Patch with

Local Color Transfer (Sec. 4.1, Eq. 12)

Crude Estimate of Shadow-less Patch with

Multi-level Color Transfer (Sec. 4.1, Eq. 14)

(i) (ii) (iii) (iv)

Figure 3.7: Multi-level Color Transfer: (from left to right) (i) Two example images

(a and b), with selected shadow regions. (ii) The recovered shadow-less patch using

the technique of Wu et al. [33]. To highlight the difference with the original patch,

we also show the difference image in color. (iii) The result of the local color transfer

and its difference with the original patch. (iv) The result of the multi-level color

transfer. Note that the multi-level transfer removes noise and preserves the local

texture.

efinal

∵ Ld(x, y) ≈ 0 ∀x, y ∈ U

where, β′(x, y) is the scaling factor for the U region. Using Eq. 3.16 and 3.17, we

I(x, y) =Iu(x, y)

β′(x, y)+ α(x, y) (3.18)

Iu = I(x, y)β′(x, y)− α(x, y)β′(x, y) (3.19)

where, α(x, y) = Ld(x, y)R(x, y).

For the case of the penumbra region, all direct light is not blocked, rather its

intensity decreases from a fully lit region towards the umbra region. Since the

major source of change is the direct light, we can neglect the variation caused by

the indirect illumination in the penumbra region. Therefore,

Ip(x, y) = (β′′(x, y)Ld(x, y) + Li(x, y))R(x, y) (3.20)

∵ ∆Li(x, y) ≈ 0 ∀x, y ∈ P

where, β′′(x, y) is the scaling factor for the P region. Using Eq. 3.16 and 3.20, we

Ip(x, y) = I(x, y)− α(x, y)(1− β′′(x, y)). (3.21)

3.4.3 Bayesian Shadow Removal and Matting

Having formulated the shadow generation model, we can now describe the esti-

mation procedure of the model parameters in probabilistic terms. We represent our

problem in a well-defined Bayesian formulation and estimate the required parame-

ters using maximum a posteriori estimate (MAP):

{α∗, β∗} = argmaxα,β

P(α, β |U,P,N) (3.22)

= argmaxα,β

P(U,P,N|α, β)P(α)P(β)

P(U,P,N)(3.23)

= argmaxα,β

P`(U,P,N|α, β) + P`(α) + P`(β)− P`(U,P,N) (3.24)

where, P` = logP(·) is the log likelihood and U,P and N represent the umbra,

penumbra and non-shadow regions respectively. The last term in the above equa-

tion can be neglected during optimization because it is independent of the model

parameters. Therefore:

= argmaxα,β

P`(U,P,N|α, β) + P`(α) + P`(β) (3.25)

Let Is(x, y) ∀x, y ∈ {U ∪ P} represent the complete shadow region. Then, the first

term in Eq. 3.25 can be written as a function of Is since the parameters α and β

do not affect the region N, therefore:

= argmaxα,β

P`(Is|α, β) + P`(α) + P`(β) (3.26)

The first term in Eq. 3.26 can be modeled by the difference between the current

pixel values and the estimated pixel values, as follows:

P`(Is|α, β) = −∑{x,y}∈S

|Is(x, y)− Is(x, y)|2

2σ2Is

−∑{x,y}∈S

π(x, y)η(x, y)|I(x, y)− I(x, y)|2

(3.27)

where, η(x, y) = 1− λ(x,y)λmax

and π is an indicator function which switches on for the

penumbra region pixels. λ(·) is the distance metric which quantifies the shortest

distance between a valid shadow boundary (i.e., excluding the object-shadow bound-

ary). The estimated shadowed image (Is) can be decomposed as follows using Eqs.

3.19 and 3.21.

Is(x, y) =

(I(x, y)− α(x, y))β′(x, y) ∀{x, y} ∈ U ⊂ S

I(x, y)− α(x, y)(1− β′′(x, y)) ∀{x, y} ∈ P ⊂ S

It can be noted that P`(Is|α, β) models the error caused by the estimated parameters

and encourages the recovered pixel values (Is(x, y)) to lie close to (Is(x, y)) with

variance σ2I following a Gaussian distribution. However, in the above formulation,

there are nine unknowns for each pixel located inside the shadowed region. If we

had a smaller scale problem (e.g., finding the precise shadow matte in the penumbra

region by Chuang et al. [35]), we could have directly solved for the unknowns. But

in our case, the large number of variables makes the likelihood calculation rather

difficult and time consuming, especially when the number of shadowed pixels is large.

We therefore resort to optimize the crude shadow-less image (I(x, y)) calculated in

Sec. 4.1, Eq. 14.

The prior P`(β) can be modeled as a Gaussian probability distribution centered

at the mean (β) of the neighboring pixels. This helps in estimating a smoothly

varying beta mask. So,

P`(β) = −∑{x,y}

|β(x, y)− β(x′, y′)|2

2σ2β

, (x′, y′) ∈ N (x, y) (3.28)

The prior P`(α) can also be modeled in a similar fashion. However, we require α to

model the variations in the penumbra region as well. Therefore, an additional term

(called the ‘image consistency term’) is introduced in the prior P`(α) to smooth the

estimated shadow-less image along the boundaries and to incorporate feedback from

the previously estimated crude shadowless image. Therefore,

P`(α) = −∑{x,y}

|α(x, y)− α(x′, y′)|2

2σ2α

2σ2I∑

{x,y}∈S

(1− λ(x, y)

λmax)|I(x, y)− I(x, y)|2, (x′, y′) ∈ N (x, y) (3.29)

In the image consistency term (second term in Eq. 3.29), I(x, y) will take different

values according to Eqs. 3.19 and 3.21:

I(x, y) =

Iu/β′(x, y) + α(x, y) ∀{x, y} ∈ U

Ip(x, y) + α(x, y)(1− β′′(x, y)) ∀{x, y} ∈ P

3.4.4 Parameter Estimation

In spite of the crude shadow image estimation, it can be seen from Eq. 3.27

that the objective function is not linear or quadratic in term of the unknowns. To

apply the gradient based energy optimization procedure, we simplify our problem

by breaking it into two sub-optimization problems and apply an iterative joint op-

timization as follows:

For the umbra region,

β′(x, y) =γ2ββ(x′, y′)− γ2

I [α(x, y)Is(x, y)− I(x, y)Is(x, y)]

γ2β − γ2

I [2 I(x, y)α(x, y)− α2(x, y)− I2(x, y)](3.30)

For the penumbra:

β′′(x, y) =αγ2Is [∆(x, y) + α] + γ2

ββ′′ + αγ2

Iη(x, y)[∆(x, y) + α]

α2γ2Is + γ2

β + α2γ2Iη(x, y)

(3.31)

where, γ = σ−1. To optimize α, the parameter β is held constant and the first

order partial derivative is taken with respect to α and is set to zero. We get the

following set of equations:

For the umbra region:

α(x, y) =γ2αα(x′, y′)− γ2

I [β′(x, y)Is(x, y)− I(x, y)β′2(x, y)]

γ2α + γ2

Iβ′2(x, y)

(3.32)

Algorithm 3

BayesianRemoval(U,P,N, I)

β ← 1, α← 0, ε0 ← 10−3

while δ > ε0

for each {x, y} ∈ S

if {x, y} ∈ U

{Approximate β∗ using Eq. 3.30

Approximate α∗ using Eq. 3.32

else if {x, y} ∈ P

{Approximate β∗ using Eq. 3.31

Approximate α∗ using Eq. 3.33

δ ← α∗ − α + β∗ − βreturn (α, β)

For the penumbra:

α(x, y) =−γ2Is(1− β

′′)∆(x, y) + γ2αα− γ2

I(1− β′′)η(x, y)∆(x, y)

γ2Is(1− β′′)2 + γ2

α + γ2I(1− β′′)2η(x, y)

(3.33)

where, ∆(x, y) = Is(x, y) − I(x, y). We iteratively perform this procedure on each

pixel in the shadow region until convergence.

3.4.5 Boundary Enhancement in a Shadow-less Image

The resulting shadow-less image exhibits traces of shadow boundaries in some

cases. To remove these artifacts, we divide the shadow boundary into a group of

segments, where each segment contains nearly similar colored pixels. The boundary

segments which belong to the object shadow boundary are excluded from further

processing. For each non-object shadow boundary segment, we perform Poisson

smoothing [210] to conceal the shadow boundary artifacts.

We evaluated our technique on three widely used and publicly available datasets.

For the qualitative comparison of shadow removal, we also evaluate our technique

on a set of commonly used images in the literature.

−84.8

This chapter

0%84.9

oundar

1%87.0

2%92.3

oundar

3.5.1 Datasets

UCF Shadow Dataset is a collection of 355 images together with their man-

ually labeled ground truths. Zhu et al. have used a subset of 255/355 images for

shadow detection [312].

CMU Shadow Dataset consists of 135 consumer grade images with labels for only

those shadow edges which lie on the ground plane [144]. Since our algorithm is not

restricted to ground shadows, we tested our approach on the more challenging cri-

terion of full shadow detection which required the generation of new ground truths.

UIUC Shadow Dataset contains 108 images each of which is paired with its cor-

responding shadow-free image to generate a ground truth shadow mask [78].

Test/Train Split: For UCF and UIUC databases, we used the split mentioned

in [312, 78]. Since CMU database [144] did not report the split, we therefore used

even/odd images for training/testing (following the procedure in Jiang et al. [111]).

3.5.2 Evaluation of Shadow Detection

Results

We assessed our approach both quantitatively and qualitatively on all the major

datasets for single image shadow detection. We demonstrate the success of our

shadow detection framework on different types of scenes including beaches, forests,

street views, aerial images, road scenes and buildings. The databases also contain

shadows under a variety of illumination conditions such as sunny, cloudy and dark

environments. For quantitative evaluation, we report the performance of our frame-

work when only the unary term (Eq. 3.3) was used for shadow detection. Further,

we also report the per-pixel accuracy achieved using the CRF model on all the

datasets. This means that labels are predicted for every pixel in each test image

and are compared with the ground-truth shadow masks. For the UCF and CMU

datasets, the initial learning rate of η0 = 0.1 was used, for the UIUC dataset we set

η0 = 0.01 based on the performance on a small validation set. After every 20 epochs

the learning rate was decreased by a small factor β = 0.5 which resulted in a best

performance.

Table 3.1 summarizes the overall results of our framework and shows a compar-

ison with several state-of-the-art methods in shadow detection. It must be noted

that the accuracy of Jiang’s method [111] (on the CMU database) is given by the

Equal Error Rate (EER). All other accuracies represent the highest detection rate

achieved, which may not necessarily be an EER. Using the ConvNets and the CRF,

we were able to get the best performance on the UCF, CMU and UIUC databases

with a respective increase of 0.50%, 4.48% and 4.55% compared to the previous

best results4. For the case of the UCF dataset, a gain of 0.5% accuracy may

look modest. But it should be noted that the previous best methods of Zhu et al.

[312] and Guo et al. [78] were only evaluated on a subset (255/355 images). In

contrast, we report results on the complete dataset because the exact subset used

in [312, 78] is not known. Compared to Jiang et al. [111], which is evaluated on

the complete dataset, we achieved a relative accuracy gain of 8.56%. On five sets

of 255 randomly selected images from the UCF dataset, our method resulted in an

accuracy of 91.4± 4.2% which is a relative gain of 1.3% over Guo et al. [78].

Table 3.2 shows the comparison of class-wise accuracies. The true positives

(correctly classified shadows) are reported as the number of predicted shadow pixels

which match with the ground-truth shadow mask. True negative (correctly classified

non-shadows) are reported as the number of predicted non-shadow pixels which

match with the ground-truth non-shadow mask. It is interesting to see that our

framework has the highest shadow detection performance on the UCF, CMU and

UIUC datasets. For the case of CMU dataset, our approach got a relatively lower

non-shadow region detection accuracy of 90.9% compared to 96.4% of Lalonde et

al. [144]. This is due to the reason that [144] only consider ground shadows and

thus ignore many false negatives. In contrast, our method is evaluated on more

challenging case of general shadow detection i.e. all types of shadows. The ROC

curve comparisons are shown in Fig. 3.9. The plotted ROC curves represent the

performance of the unary detector since we cannot generate ROC curves from the

outcome of the CRF model. Our approach achieves the highest AUC measures for

all datasets (Fig. 3.9).

Some representative qualitative results are shown in Fig. 3.10 and Fig. 3.11.

The proposed framework successfully detects shadows in dark environments (Fig.

3.10: 1st row, middle image) and distinguishes between dark non-shadow regions and

shadow regions (Fig. 3.10: 2nd row, 2nd and 5th image from left). It performs equally

well on satellite images (Fig. 3.10: last column) and outdoor scenes with street views

(Fig. 3.10: 1st row, 3rd and 5th images; 2nd row, middle image), buildings (Fig. 3.10:

1st column) and shadows of animals and humans (Fig. 3.10: 2nd column).

Discussion

The previously proposed methods (e.g., Zhu et al. [312], Lalonde et al. [144]) that

use a large number of hand-crafted features, not only require a lot of effort in their

4Relative increase in performance is calculated by: 100×(our accuracy − previous

best)/previous best.

(a) UCF Shadow Dataset (b) CMU Shadow Dataset

(c) UIUC Shadow Dataset

Figure 3.9: ROC curve comparisons of proposed framework with previous works.

Tested onTrained on

UCF CMU UIUC

UCF − 80.3% 80.5%

CMU 77.7% − 76.8%

UIUC 82.8% 81.5% −

Table 3.3: Results when ConvNets were trained and tested across different datasets.

Methods/Datasets Shadows Non-Shadows

UCF Dataset

− BDT-BCRF (Zhu et al. [312]) 63.9% 93.4%

− Unary-Pairwise (Guo et al. [78]) 73.3% 93.7%

− Bright Channel-MRF 68.3% 89.4%

(Panagopoulos et al. [204])

− ConvNet(Boundary+Region) 72.5% 92.1%

− ConvNet(Boundary+Region)-CRF 78.0% 92.6%

CMU Dataset

− BDT-CRF-Scene Layout 73.1% 96.4%

(Lalonde et al. [144])

UIUC Dataset

− Unary-Pairwise (Guo et al. [78]) 71.6% 95.2%

Table 3.2: Class-wise accuracies of our proposed framework in comparison with the

state-of-the-art techniques. Our approach gives the highest accuracy for the class

‘shadows’.

design but also require long training times when ensemble learning methods are used

for feature selection. As an example, Zhu et al. [312] extracted different shadow

variant and invariant features alongside an additional 40 classification results from

the Boosted Decision Tree (BDT) for each pixel as their features. Their approach

required a huge amount of memory (∼9GB for 125 training images of average size

of approximately 480 × 320). Even after parallelization and training on multiple

processors, they reported 10 hours of training with 125 images. Lalonde et al.

[144] used 48 dimensional feature vectors extracted at each pixel and fed these to a

boosted decision tree in a similar manner as Zhu et al. [312]. Jiang et al. included

illumination features on top of the features that are used by Lalonde et al. [144].

Although, enriching the feature set in this manner improves the performance, it

not only takes much more effort to design such features but it also slows down the

detection procedure. In contrast, our feature learning procedure is fully automatic

and requires only ∼1GB memory and approximately one hour training for each of

es(1st,3

(2nd,4

the UCF, CMU and UIUC databases. The proposed approach is also efficient at

test time because the ConvNet feature extraction and unary potential computation

take an average of 1.3±0.35 sec per image on the UCF, CMU and UIUC databases.

The graph-cut inference step used for the CRF energy minimization is also fast and

takes 0.21± 0.03 sec per image on average. Overall, our technique takes 2.8± 0.81

sec per image for shadow detection. In comparison, the method by Guo et al. [78]

takes 40.05± 10 sec per image for shadow detection.

Figure 3.11: Examples of Ambiguous Cases: (From left to right) Our framework

misclassified a dark non-shadow region, texture-less black window glass, very thin

shadow region and trees due to complex self shading patterns. (Best viewed in color)

We extensively evaluated our approach on all available databases and our pro-

posed framework turned out to be fairly generic and robust to variations. It achieved

the best results on all the single image shadow databases known to us. In con-

trast, previous techniques were only tested on a portion of database [144], one [312]

or at most two databases [78]. Another interesting observation was that the pro-

posed framework performed reasonably well when our ConvNets were trained on one

dataset and tested on another dataset. Table 3.3 summarizes the results of cross-

dataset evaluation experiments. These performance levels show that the feature

representations learned by the ConvNets across the different datasets were com-

mon to a large extent. This observation further supports our claim regarding the

generalization ability of the proposed framework.

In our experiments, objects with dark albedo turned out to be a difficult case

for shadow detection. Moreover, some ambiguities were caused by the complex self

shading patterns created by tree leaves. There were some inconsistencies in the

manually labeled ground-truths, in which a shadow mask was sometimes missing

for an attached shadow. Narrow shadowy regions caused by structures like poles

and pipes also proved to be a challenging case for shadow detection. Examples of

the above mentioned failure cases are shown in Fig. 3.11.

3.5.3 Evaluation of Shadow Removal

For a quantitative evaluation of our shadow removal framework, we used all

images from the UIUC Shadow dataset which come with their corresponding shadow-

free ground truths [78]. The qualitative results of our method are evaluated against

the common evaluation images used in the literature for a fair comparison. To

further illustrate the performance of our algorithm, we also included qualitative

results on some example images from UIUC, UCF and CMU shadow datasets.

Quantitative Evaluation

Table 3.4 presents the per pixel root mean square error (RMSE) for the UIUC

dataset, calculated in LAB color space [78]. The first row gives the actual error

between the same image, with and without shadow. The difference between the two

versions of the same image is calculated for both the shadow and the lit regions.

Note that the error is large for the shadowed region (as expected), but it is not zero

for the lit regions for two reasons: the shadow masks are not perfect and there is a

little difference in the light intensity due to the change in the ambient light for the lit

regions when the object casting shadow is present. We achieved an average RMSE

error of 6.8 compared to 7.4 and 12.6 achieved by the methods of Guo et al. [78] and

Wu et al. [286], respectively. Following Guo et al. [78], we also include the removal

performance when the ground truth (GT) shadow masks are used for removal. This

gives a more precise estimate of the performance of the recovery algorithm. When

we evaluated our method using GT masks, our method achieved an error of 6.1

compared to 6.4 and 9.7 reported by [78] and [286] respectively. We also tested

the removal results without the Bayesian optimization, which resulted in an RMSE

error of 7.9. This is high compared to the results achieved after optimization. In

summary, our method achieved a reduction in error of 8.1% (removal using the

detected masks) and 4.6% (removal using ground truths) compared to the approach

of Guo et al. in [78].

Qualitative Evaluation

For the qualitative evaluation, we show some example images and their correspond-

ing recovered images along with the shadow masks in Fig. 3.12. It can be seen that

our method works well under different settings e.g., outdoor images (first five images

from the left) and indoor images (first two images from the right). The complex

texture in the shadow regions is preserved and the arbitrary shadow matte are pre-

Methods Shadow Lit All

Reg. Reg. Reg.

− Actual Error 42.0 4.6 13.7

1a. Removal (Wu et al. [286]) with 28.2 7.6 12.6

Automatic Shadow Detection

1b. Removal (Wu et al. [286]) using GT 21.3 5.9 9.7

2a. Removal (Guo et al. [78]) 13.9 5.4 7.4

2b. Removal using GT (Guo et al. [78]) 11.8 4.7 6.4

3a. Removal without Bayesian Refinement 15.2 5.5 7.9

3b. Removal with Bayesian Refinement 12.1 5.1 6.8

3c. Removal using GT 10.5 4.7 6.1

Table 3.4: Quantitative Evaluation: RMSE per pixel for the UIUC Subset of Images.

(The smaller RMSE the better)

ethods:

ethods

requir

cisely recovered. Note that while our method can remove hard and smooth shadows

(e.g., 1st, 5th and 6th image from left), it also works well for the soft and variable

shadows (e.g., 2nd, 3rd and 4th image from left). Overall, the results are visually

pleasing and the extracted shadow matte are smooth and accurate.

Comparisons

We provide a qualitative comparison with two distinct categories of shadow removal

methods. First, we show comparisons (see Fig. 3.13) with the state-of-the-art

shadow removal methods which are either fully automatic (e.g., [78, 61]) or require

minimal user input (e.g., [238, 290]). From left to right we show the original image

along with the results from Finlayson et al. [61], Shor and Lischinski [238], Xiao et

al. [290], Guo et al. [78] and our technique. In comparison to the previous automatic

and semi-automatic (requiring minimal user input) methods, our approach produces

cleaner recovered images (second column from the right) along with an accurate

shadow matte (right most column).

Since, there are only very few automatic shadow removal methods in the liter-

ature, we also compare our approach with the most popular approaches but which

require user input (see Fig. 3.14). From left to right, we show our recovered images

(bottom row) along with the results from Wu et al. [286], Liu and Gleicher [172],

Arbel and Hel-Or [5], Vicente and Samaras [270], Fredembach and Finlayson [63]

and Kwatra et al. [139]. For the ’puzzled child ’ image, it can be seen that the

contrast of the recovered region is much better than the one recovered by Wu et

al. [286]. The shadow-less image has no trace of strong shadow boundaries and the

recovery in the penumbra region is smooth due to introduction of α in the model

and the exclusion of the spatial affinity term [286] or boundary nullification [285]

during the rough shadow-less image estimation process. Similar effects can be seen

with the other images; e.g., in 3rd image from the left, the result of Arbel and Hel-Or

[5] has a high contrast while our result is smooth and successfully retains texture.

Similarly, for the case of the 4th, 5th and 6th images from the left, our shadow removal

result is visually pleasing and considerably better than the recent state-of-the-art

methods. Note however that the recovery result of the 2nd image from the left has

an over-smoothing effect, probably because the color distributions of differently col-

ored shadowed regions could not be separated during the Gaussian fitting process.

Overall, the results are quite reasonable considering that the algorithm does not

require any user assistance and it does not make any prior assumptions such as a

Planckian light source or a narrow-band camera.

ethods

Failure Cases and Limitations

Our shadow removal technique does not perform well on curved surfaces and in the

case of highly non-uniform shadows (e.g., Fig. 3.15: 1st and 3rd image from left).

Since, we apply a multi-level color transfer scheme, very fine texture details of image

regions with similar appearance can be removed during this transfer process (e.g.,

Fig. 3.15: 2nd image from left). For the cases of shadows in dark environments, our

method appears to increase the contrast of the recovered region. These limitations

are due to the constraints imposed on the shadow generation model, where the

higher order statistics are ignored during the shadow generation process (Eqs. 3.19

and 3.21).

Discussion

Our method does not require any user input and it automatically removes shadow

after its detection. The proposed shadow removal approach makes comparatively

fewer assumptions about the scene type, the type of light source or camera. The

only assumptions are that of Lambertian surfaces and the correspondence between

the shadow and the non-shadow region color distributions. The shadow removal

method of [285, 286] cannot separate the shadow from shading. With the inclusion

of the image consistency term in P`(Is|α, β), we are able to deal with the shading

by introducing a penalty on the distribution of the shadow effect through the pa-

rameters β and α. The proposed shadow removal approach takes 82.2 ± 25 sec for

each image on the UIUC database. The main overhead during the shadow removal

process is the Bayesian refinement step (which is required mainly for shadow mat-

ting). It takes 73.6±20 sec out of 82.2±25 sec per image on the UIUC database. In

comparison, the method by Guo et al. [78] takes 104.7± 18 sec for shadow removal.

The main overhead in their removal process is also due to Levin et al.’s matting

algorithm [155] which takes around 91.4± 11 sec per image.

3.5.4 Applications

Shadow detection, removal and matting have a number of applications. A direct

application is the generation of visually appealing photographs and the removal of

unwanted shadows. Some other applications include:

Shadow Compositing: Fig. 3.16a shows examples of shadow compositing.

The extracted shadow matte can be used to depict a realistic image compositing.

For example, the first image from the left did not originally contain the flying bird

and its shadow. If we had added just the bird, it would have looked unrealistic.

With the addition of a texture-free shadow matte, the photograph looks natural

Figure 3.15: Examples of Failure Cases: Our technique does not perfectly remove

shadows on curved surfaces, highly non-uniform shadows and shadows in dark en-

vironments. (Best viewed in color and enlarged)

and realistic. In the remaining three images, we combine extracted shadows with

the original images to create fake effects.

Image Editing: Fig. 3.16b shows how a detected shadow can be edited to

create fake effects. For example, shadow direction/length can be modified to give a

fake impression of illumination source or time of day.

Image Parsing: Fig. 3.16c shows how shadow removal can increase the accu-

racy of segmentation methods (e.g., [129, 125]). The segmentations are computed

using the graph based technique of [57] (we used a minimum region size of 600).

It can be seen that shadows change the appearance of a class (e.g., ground in this

case) and thus can introduce errors in the segmentation process.

Boundary Detection: We tested a recently proposed boundary detector [46]

on the original and recovered image (Fig. 3.16d). The boundaries identified in the

recovered image are more accurate. Since shadows do not constitute an object class,

the recovered image can help in achieving more accurate object detection proposals

and consequently a higher recognition performance.

(a) Shadow Compositing

(b) Image Editing

(c) Image Parsing

(d) Boundary Detection

Figure 3.16: Different Applications of Shadow Detection, Removal and Matting.

(Best viewed in color and enlarged)

3.6. Conclusion 91

3.6 Conclusion

We presented a data-driven approach to learn the most relevant features for the

detection of shadows from a single image. We demonstrated that our framework

performs the best on a number of databases regardless of the shape of objects casting

shadows, the environment and the type of scene. We also proposed a shadow re-

moval framework which extracts the shadow matte along with the recovered image.

A Bayesian formulation constitutes the basis of our shadow removal procedure and

thereby makes use of an improved shadow generation model. Our shadow detection

results show that a combination of boundary and region ConvNets incorporated in

the CRF model provides the best performance. For shadow removal, the multi-level

color transfer followed by the Bayesian refinement performs well on unconstrained

images. The proposed framework has a number of applications including image edit-

ing and enhancement tasks. In our future work, we will use the proposed shadow

detection framework together with the scene geometry (as in [144]) and object prop-

erties to reason about high-level scene understanding tasks (as in [203]). The use

of our proposed framework for shadow detection in video sequences will also be

explored to take advantage of the spatio-temporal properties of moving shadows.

93CHAPTER 4

Separating Objects and Clutter in Indoor Scenes

via Joint Reasoning1

Out of clutter, find simplicity.

Albert Einstein (1879-1955)

Abstract

Objects’ spatial layout estimation and clutter identification are two important