Understanding Real World Geometry

Understanding Real World Geometry

Ankur Handa

University of Cambridge and Imperial College London

July 25, 2016

Ankur Handa Understanding Real World Geometry

Outline

BackgroundSceneNet: Repository of Labelled Synthetic Indoor ScenesResults on Semantic Segmentation on Real World Datagvnn: Neural Network Library for Geometric VisionFuture Ideas


Background

What is the goal?Our initial goal was to segment 3D scenes in real-time.Overlay the per-pixel segmentations onto the 3D map.We are only interested in depth based semantic segmentationhere mainly to understand the role of geometry.


Background

How do we get enough training data?Real world datasets are limited in size e.g. NYUv2 and SUNRGB-D have 795 and 5K images for training respectively.We can leverage computer graphics to generate the desireddata.There are lots of CAD repositories of objects available butnone contains scenes.We put together a repository of labelled 3D scenes calledSceneNet and train per-pixel segmentation on synthetic data.


SceneNet

Repository of labelled 3D synthetic scenes.

Hosted at http://robotvault.bitbucket.org


SceneNet

SceneNet Basis ScenesRoom Type Number of ScenesLiving Room 10

Bedroom 11Office 15

Kitchens 11Bathrooms 10

In total we have 57 very detailed scenes with about 3700 objects.We build upon these scenes to create unlimited number of sceneslater for large scale training data.


SceneNet

A very detailed living room. Each object in this scene is labelleddirectly in 3D.


SceneNet

dimensions: 27x25x2.7m3


Generating Unlimited Number of Synthetic Scenes

Generating indoor scenes as energy minimisation problemBounding box constraint.

Each object must maintain a safe distance from the other.Pairwise Constraint.

Objects that co-occur should not be more than a fixeddistance from each other, e.g. beds and nightstands

Visibility Constraint.Objects with visibility constraint must not have anythingjoining their line of sight joining their centers, e.g. sofa,tableand TV must not having anything in between them.

Distance to Wall Constraint.Some objects are more likely to occur next to walls e.g.cupboards, sofa etc.

Use simulated annealing to solve this energy minimisationproblem.



wall

othe

rpro

pflo

orbe

d

pictur

e

pillo

w

othe

rfurn

iturelam

p

nigh

t sta

nd

blinds

chairbo

x

book

s

clot

hes

dres

serdo

or

pape

r

windo

w

curta

inde

skba

g

ceilin

g

book

shelf

cabine

t

table

shelve

s

othe

rstru

ctur

e

television

sofa

floor

mat

mirr

or

towel

pers

on

coun

ter

refri

dger

ator

show

er cur

tain

white

boar

dto

iletsink

bath

tub

wall

otherprop

floor

bed

picture

pillow

otherfurniture

lamp

night stand

blinds

chair

box

books

clothes

dresser

door

paper

window

curtain

desk

bag

ceiling

bookshelf

cabinet

table

shelves

otherstructure

television

sofa

floor mat

mirror

towel

person

counter

refridgerator

shower curtain

whiteboard

toilet

sink

bathtub

Co-occurrence statistics of objectsfrom NYUv2 bedrooms.

InterpretationsHeat-map of objectco-occurrence.Training set in NYUv2 isused to obtain thesestatistics.Shows that bed, picture,pillow and nightstandco-occur together.We sample new scenes fromthese object co-occurrencestatistics.



Random placement With Pairwise only All constraints

ConstraintsObjects that co-occur in real world should be placed together,e.g. beds and nightstands.Visibility constraint - nothing in the line of sight betweensofa-table and TV.



Very simple scenes with curtains.



Common room obtained from hierarchically optimising chairs andtable combination.



Living Room. Note that the inset objects are from Stanford Scenesand have been grouped together already.



Living Room. Grouped objects from Stanford Scenes.



Data CollectionCollecting trajectories with joy-stick (though we haverandomised them now).Using OpenGL glReadPixels to obtain the depth as well asground truth labels.



Depth Height Angle with gravity

AssumptionsHere we only study the segmentation from just depth-data inthe form of DHA images i.e. depth, height from ground planeand angle with gravity vector.Allows us to study the effects of geometry in isolation.Texturing takes time and needs careful photorealisticsynthesis. But we have added a custom raytracer to studyRGB-D based segmentation now.


Network Architecture

SegNet: [Badrinarayanan, Handa, and Cipolla, arXiv 2015]Saves pooling indices explicitly in the encoder and passes onto the decoder for upsampling.Later, similar ideas emerged in DeconvNet from HyeonwoohNoh et al. ICCV2015 and Zhao et al. Stacked What-WhereAuto-encoders, arxiv2015.


Results on Semantic Segmentation on Real World Data

Adding noise to depth-map Denoising depth-map Angle with gravity maps

Adapting synthetic data to match real world dataSynthetic depth-maps are injected with appropriate noisemodels.We then denoise these depth-maps to ensure that inputimages look similar to the NYUv2 dataset.Side by side comparison of synthetic bedroom with one of theNYUv2 bedrooms.



NYUv2 test images



Ground truth segmentation



Training on NYUv2 training set only



Training on SceneNet and fine-tuning on NYUv2



Training on NYUv2 only

Training on SceneNet and fine-tuning on NYUv2What we learnt? [ICRA 2016, CVPR 2016]

Computer graphics is a great source for collecting data.We need to work on designing or learning new loss functions.A loss function that takes local and global context together toavoid any objects mislabelled as others.



What we learnt? [ICRA 2016, CVPR 2016]We perform better than state-of-the-art on functionalcategories of objectsSuggesting shape is a strong cue for segmentation.We fall behind on objects that explicitly need RGB data.


SceneNet and Physics

Adding PhysicsAllows us to create scenes with arbitrary clutter.Placing pens and other small objects on tables is relativelyeasy with physics.We don’t want chairs always in upright positions. Physicsallows us to explore the space of poses of objects relativelyeasy.


SceneNet and Raytracing

Low-res on laptop High quality

Adding RaytracingSufficient Photorealism is desirable.Depth sensors have limited range.RGB is universal and allows us to model various differentcameras.Need good approximation of shadows, lighting, and variousother global lighting artefacts.


gvnn: Geometric Vision with Neural Networks

What is gvnn?A new library inspired by Spatial Transformer Networks(STN).Includes various different transformations often used ingeometric computer vision.These transformations are implemented as new layers thatallow backpropagation as in STN.Brings together the domain knowledge in geometry within theneural network.


Different new layers in gvnn

What is gvnn?Original Spatial Transformer (NIPS 2015) had 2Dtransformations mainly.Added SO3, SE3 and Sim3 layers for global transformation onthe image.Optical flow, over-parameterised Optical flow, Slanted planedisparity.Pin-Hole Camera Projection layer.Per-pixel SO3, SE3 and Sim3 for non-rigid registration.Different robust M-estimators.Very useful for 3D alignment, unsupervised warping with opticflow or disparity and geometric invariance for placerecognition.


Global Transformations in gvnn

SO3 Layer (SE3 and Sim3 follow easily from here)

∂C∂v = ∂C

∂R(v) ·∂R(v)∂v (1)

∂R(v)∂vi

= vi [v]× + [v× (I− R)ei ]×||v||2 R (2)

NotesC is the cost function being minimised.v = (v1, v2, v3) is the SO3 vector. Note the derivative is notat vi ≈ 0 and ei is the i th column of the Identity matrix.


Per-pixel 2D Transformations in gvnn

Optical flow and Over-parameterised Optical Flow

(x ′y′

)=

(x + txy + ty

)(3)

(x ′y′

)=

a0 a1 a2

a3 a4 a5

x

y1

(4)

NotesOver-parameterised optical flow also needs extraregularisation.2D and 6D transformations per-pixel. Useful for warpingimages (and unsupervised learning), Patraucean, et al.ICLRW2016.


Per-pixel 2D Transformations in gvnn

Disparity and Slated Plane disparity

(x ′y′

)=

(x + d

y

)(5)

d = ax + by + c (6)

NotesFitting slanted planes at each pixel.Again, very useful for warping images (and unsupervisedlearning), Garg et al. ECCV2016.


Camera Projection Layer in gvnn

Pin Hole Camera Projection Layer

π

uvw

=

fx uw + px

fy vw + py

(7)

∂C∂p = ∂C

∂π(p) ·∂π(p)∂p (8)

∂π

uvw

∂

uvw

=

fx 1w 0 −fx u

w2

0 fy 1w −fy v

w2

(9)


Non-rigid per-pixel transformation

Non-rigid registration for point clouds: 6DoF and 10DoFPer-pixel 6DoF and 10DoF transformations.

x ′iy′iz ′i

= Ti

xiyizi1

(10)

x′i = si(Ri(xi − pi) + pi) + ti (11)

Also useful for volumetric spatial transformers when using voxelgrid representation.


M-estimators

NotesDifferent M-estimators for robust outlier rejection.Very useful when doing regression.Implemented as layers in gvnn.


Sanity Checks on End-to-End Visual Odometry (SO3)

NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.Initial experiments with supervised learning.



NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.



NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.Aim to train on large scale data from SceneNet and RGBvideos for unsupervised learning.



δupdate = fsiamese(Ikt , It+1) (12)

δk = δk−1 + δupdate (13)Ik

t = fwarping(It , δk) (14)I0

t = It (15)δ0 = 0 (16)


Sanity Checks on End-to-End Visual Odometry (SE3)

NotesNeeds explicitly the depth to do warping.This can either come from a sensor or an extra CNN/RNNmodule that learns depth.


Sanity Checks on End-to-End Visual Odometry (SE3)

(a) Prediction (b) Ground Truth (c) Residual (difference)


Where can deep learning help geometry?

Deep learning with geometry: Sample exampleCNNs provide relatively stable features for images of samescene taken across different lighting conditions.These images cannot be aligned with geometry basedper-pixel dense image alignment methods.We can put this together with change detection segmentationto also reason about dynamic scenes.Easy to collect data with SceneNet.


Future Ideas

Physical scene understanding with SceneNet to reason abouthow the dynamics of the scene is going to evolve over time.Understanding dynamic scenes - change detection forpersistent 4D mapping.

Imagine two images of the scene taken at different times of theday from relatively similar positions.STN-SE3 could be used to align the images.Aligned images can be fed into a segmentation network thatgives per-pixel change mask.Joint end-to-end training.

Attend-Infer-Repeat style 3D scene understanding in the formof instances of objects and their 6DoF poses even for clutteredscenes, with recurrent neural networks.


References

ReferencesSceneNet: an Annotated Model Generator for Indoor SceneUnderstanding, Ankur Handa, Viorica Patraucean, SimonStent, Roberto Cipolla, ICRA 2016Understanding Real World Indoor Scenes with Synthetic Data,Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan,Simon Stent, Roberto Cipolla, CVPR 2016gvnn: Neural Network Library for Geometric Vision, submittedto ECCVWorkshop.


Acknowledgements

Thank youVijay Badrinarayanan, Viorica Patraucean, Simon Stent, andRoberto Cipolla, University of CambridgeZoubin Ghahramani, University of CambridgeJohn McCormac and Andrew Davison, Imperial CollegeLondonDaniel Cremers, TUM Munich

Thank you for listening.


Understanding Real World Geometry

Documents

Transcript of Understanding Real World Geometry