Understanding Real World Geometry

49
Understanding Real World Geometry Ankur Handa University of Cambridge and Imperial College London July 25, 2016 Ankur Handa Understanding Real World Geometry

Transcript of Understanding Real World Geometry

Understanding Real World Geometry

Ankur Handa

University of Cambridge and Imperial College London

July 25, 2016

Ankur Handa Understanding Real World Geometry

Outline

BackgroundSceneNet: Repository of Labelled Synthetic Indoor ScenesResults on Semantic Segmentation on Real World Datagvnn: Neural Network Library for Geometric VisionFuture Ideas

Ankur Handa Understanding Real World Geometry

Background

What is the goal?Our initial goal was to segment 3D scenes in real-time.Overlay the per-pixel segmentations onto the 3D map.We are only interested in depth based semantic segmentationhere mainly to understand the role of geometry.

Ankur Handa Understanding Real World Geometry

Background

How do we get enough training data?Real world datasets are limited in size e.g. NYUv2 and SUNRGB-D have 795 and 5K images for training respectively.We can leverage computer graphics to generate the desireddata.There are lots of CAD repositories of objects available butnone contains scenes.We put together a repository of labelled 3D scenes calledSceneNet and train per-pixel segmentation on synthetic data.

Ankur Handa Understanding Real World Geometry

SceneNet

Repository of labelled 3D synthetic scenes.

Hosted at http://robotvault.bitbucket.org

Ankur Handa Understanding Real World Geometry

SceneNet

SceneNet Basis ScenesRoom Type Number of ScenesLiving Room 10

Bedroom 11Office 15

Kitchens 11Bathrooms 10

In total we have 57 very detailed scenes with about 3700 objects.We build upon these scenes to create unlimited number of sceneslater for large scale training data.

Ankur Handa Understanding Real World Geometry

SceneNet

A very detailed living room. Each object in this scene is labelleddirectly in 3D.

Ankur Handa Understanding Real World Geometry

SceneNet

dimensions: 27x25x2.7m3

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Generating indoor scenes as energy minimisation problemBounding box constraint.

Each object must maintain a safe distance from the other.Pairwise Constraint.

Objects that co-occur should not be more than a fixeddistance from each other, e.g. beds and nightstands

Visibility Constraint.Objects with visibility constraint must not have anythingjoining their line of sight joining their centers, e.g. sofa,tableand TV must not having anything in between them.

Distance to Wall Constraint.Some objects are more likely to occur next to walls e.g.cupboards, sofa etc.

Use simulated annealing to solve this energy minimisationproblem.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

wall

othe

rpro

pflo

orbe

d

pictur

e

pillo

w

othe

rfurn

iturelam

p

nigh

t sta

nd

blinds

chairbo

x

book

s

clot

hes

dres

serdo

or

pape

r

windo

w

curta

inde

skba

g

ceilin

g

book

shelf

cabine

t

table

shelve

s

othe

rstru

ctur

e

television

sofa

floor

mat

mirr

or

towel

pers

on

coun

ter

refri

dger

ator

show

er cur

tain

white

boar

dto

iletsink

bath

tub

wall

otherprop

floor

bed

picture

pillow

otherfurniture

lamp

night stand

blinds

chair

box

books

clothes

dresser

door

paper

window

curtain

desk

bag

ceiling

bookshelf

cabinet

table

shelves

otherstructure

television

sofa

floor mat

mirror

towel

person

counter

refridgerator

shower curtain

whiteboard

toilet

sink

bathtub

Co-occurrence statistics of objectsfrom NYUv2 bedrooms.

InterpretationsHeat-map of objectco-occurrence.Training set in NYUv2 isused to obtain thesestatistics.Shows that bed, picture,pillow and nightstandco-occur together.We sample new scenes fromthese object co-occurrencestatistics.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Random placement With Pairwise only All constraints

ConstraintsObjects that co-occur in real world should be placed together,e.g. beds and nightstands.Visibility constraint - nothing in the line of sight betweensofa-table and TV.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Very simple scenes with curtains.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Common room obtained from hierarchically optimising chairs andtable combination.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Living Room. Note that the inset objects are from Stanford Scenesand have been grouped together already.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Living Room. Grouped objects from Stanford Scenes.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Data CollectionCollecting trajectories with joy-stick (though we haverandomised them now).Using OpenGL glReadPixels to obtain the depth as well asground truth labels.

Ankur Handa Understanding Real World Geometry

Generating Unlimited Number of Synthetic Scenes

Depth Height Angle with gravity

AssumptionsHere we only study the segmentation from just depth-data inthe form of DHA images i.e. depth, height from ground planeand angle with gravity vector.Allows us to study the effects of geometry in isolation.Texturing takes time and needs careful photorealisticsynthesis. But we have added a custom raytracer to studyRGB-D based segmentation now.

Ankur Handa Understanding Real World Geometry

Network Architecture

SegNet: [Badrinarayanan, Handa, and Cipolla, arXiv 2015]Saves pooling indices explicitly in the encoder and passes onto the decoder for upsampling.Later, similar ideas emerged in DeconvNet from HyeonwoohNoh et al. ICCV2015 and Zhao et al. Stacked What-WhereAuto-encoders, arxiv2015.

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

Adding noise to depth-map Denoising depth-map Angle with gravity maps

Adapting synthetic data to match real world dataSynthetic depth-maps are injected with appropriate noisemodels.We then denoise these depth-maps to ensure that inputimages look similar to the NYUv2 dataset.Side by side comparison of synthetic bedroom with one of theNYUv2 bedrooms.

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

NYUv2 test images

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

Ground truth segmentation

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

Training on NYUv2 training set only

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

Training on SceneNet and fine-tuning on NYUv2

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

Training on NYUv2 only

Training on SceneNet and fine-tuning on NYUv2What we learnt? [ICRA 2016, CVPR 2016]

Computer graphics is a great source for collecting data.We need to work on designing or learning new loss functions.A loss function that takes local and global context together toavoid any objects mislabelled as others.

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

What we learnt? [ICRA 2016, CVPR 2016]We perform better than state-of-the-art on functionalcategories of objectsSuggesting shape is a strong cue for segmentation.We fall behind on objects that explicitly need RGB data.

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

What we learnt? [ICRA 2016, CVPR 2016]We perform better than state-of-the-art on functionalcategories of objectsSuggesting shape is a strong cue for segmentation.We fall behind on objects that explicitly need RGB data.

Ankur Handa Understanding Real World Geometry

Results on Semantic Segmentation on Real World Data

What we learnt? [ICRA 2016, CVPR 2016]We perform better than state-of-the-art on functionalcategories of objectsSuggesting shape is a strong cue for segmentation.We fall behind on objects that explicitly need RGB data.

Ankur Handa Understanding Real World Geometry

SceneNet and Physics

Adding PhysicsAllows us to create scenes with arbitrary clutter.Placing pens and other small objects on tables is relativelyeasy with physics.We don’t want chairs always in upright positions. Physicsallows us to explore the space of poses of objects relativelyeasy.

Ankur Handa Understanding Real World Geometry

SceneNet and Raytracing

Low-res on laptop High quality

Adding RaytracingSufficient Photorealism is desirable.Depth sensors have limited range.RGB is universal and allows us to model various differentcameras.Need good approximation of shadows, lighting, and variousother global lighting artefacts.

Ankur Handa Understanding Real World Geometry

gvnn: Geometric Vision with Neural Networks

What is gvnn?A new library inspired by Spatial Transformer Networks(STN).Includes various different transformations often used ingeometric computer vision.These transformations are implemented as new layers thatallow backpropagation as in STN.Brings together the domain knowledge in geometry within theneural network.

Ankur Handa Understanding Real World Geometry

Different new layers in gvnn

What is gvnn?Original Spatial Transformer (NIPS 2015) had 2Dtransformations mainly.Added SO3, SE3 and Sim3 layers for global transformation onthe image.Optical flow, over-parameterised Optical flow, Slanted planedisparity.Pin-Hole Camera Projection layer.Per-pixel SO3, SE3 and Sim3 for non-rigid registration.Different robust M-estimators.Very useful for 3D alignment, unsupervised warping with opticflow or disparity and geometric invariance for placerecognition.

Ankur Handa Understanding Real World Geometry

Global Transformations in gvnn

SO3 Layer (SE3 and Sim3 follow easily from here)

∂C∂v = ∂C

∂R(v) ·∂R(v)∂v (1)

∂R(v)∂vi

= vi [v]× + [v× (I− R)ei ]×||v||2 R (2)

NotesC is the cost function being minimised.v = (v1, v2, v3) is the SO3 vector. Note the derivative is notat vi ≈ 0 and ei is the i th column of the Identity matrix.

Ankur Handa Understanding Real World Geometry

Per-pixel 2D Transformations in gvnn

Optical flow and Over-parameterised Optical Flow

(x ′y′

)=

(x + txy + ty

)(3)

(x ′y′

)=

a0 a1 a2

a3 a4 a5

x

y1

(4)

NotesOver-parameterised optical flow also needs extraregularisation.2D and 6D transformations per-pixel. Useful for warpingimages (and unsupervised learning), Patraucean, et al.ICLRW2016.

Ankur Handa Understanding Real World Geometry

Per-pixel 2D Transformations in gvnn

Disparity and Slated Plane disparity

(x ′y′

)=

(x + d

y

)(5)

d = ax + by + c (6)

NotesFitting slanted planes at each pixel.Again, very useful for warping images (and unsupervisedlearning), Garg et al. ECCV2016.

Ankur Handa Understanding Real World Geometry

Camera Projection Layer in gvnn

Pin Hole Camera Projection Layer

π

uvw

=

fx uw + px

fy vw + py

(7)

∂C∂p = ∂C

∂π(p) ·∂π(p)∂p (8)

∂π

uvw

uvw

=

fx 1w 0 −fx u

w2

0 fy 1w −fy v

w2

(9)

Ankur Handa Understanding Real World Geometry

Non-rigid per-pixel transformation

Non-rigid registration for point clouds: 6DoF and 10DoFPer-pixel 6DoF and 10DoF transformations.

x ′iy′iz ′i

= Ti

xiyizi1

(10)

x′i = si(Ri(xi − pi) + pi) + ti (11)

Also useful for volumetric spatial transformers when using voxelgrid representation.

Ankur Handa Understanding Real World Geometry

M-estimators

NotesDifferent M-estimators for robust outlier rejection.Very useful when doing regression.Implemented as layers in gvnn.

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SO3)

NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.Initial experiments with supervised learning.

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SO3)

NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SO3)

NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SO3)

NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SO3)

NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.Aim to train on large scale data from SceneNet and RGBvideos for unsupervised learning.

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SO3)

δupdate = fsiamese(Ikt , It+1) (12)

δk = δk−1 + δupdate (13)Ik

t = fwarping(It , δk) (14)I0

t = It (15)δ0 = 0 (16)

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SE3)

NotesNeeds explicitly the depth to do warping.This can either come from a sensor or an extra CNN/RNNmodule that learns depth.

Ankur Handa Understanding Real World Geometry

Sanity Checks on End-to-End Visual Odometry (SE3)

(a) Prediction (b) Ground Truth (c) Residual (difference)

Ankur Handa Understanding Real World Geometry

Where can deep learning help geometry?

Deep learning with geometry: Sample exampleCNNs provide relatively stable features for images of samescene taken across different lighting conditions.These images cannot be aligned with geometry basedper-pixel dense image alignment methods.We can put this together with change detection segmentationto also reason about dynamic scenes.Easy to collect data with SceneNet.

Ankur Handa Understanding Real World Geometry

Future Ideas

Physical scene understanding with SceneNet to reason abouthow the dynamics of the scene is going to evolve over time.Understanding dynamic scenes - change detection forpersistent 4D mapping.

Imagine two images of the scene taken at different times of theday from relatively similar positions.STN-SE3 could be used to align the images.Aligned images can be fed into a segmentation network thatgives per-pixel change mask.Joint end-to-end training.

Attend-Infer-Repeat style 3D scene understanding in the formof instances of objects and their 6DoF poses even for clutteredscenes, with recurrent neural networks.

Ankur Handa Understanding Real World Geometry

References

ReferencesSceneNet: an Annotated Model Generator for Indoor SceneUnderstanding, Ankur Handa, Viorica Patraucean, SimonStent, Roberto Cipolla, ICRA 2016Understanding Real World Indoor Scenes with Synthetic Data,Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan,Simon Stent, Roberto Cipolla, CVPR 2016gvnn: Neural Network Library for Geometric Vision, submittedto ECCVWorkshop.

Ankur Handa Understanding Real World Geometry

Acknowledgements

Thank youVijay Badrinarayanan, Viorica Patraucean, Simon Stent, andRoberto Cipolla, University of CambridgeZoubin Ghahramani, University of CambridgeJohn McCormac and Andrew Davison, Imperial CollegeLondonDaniel Cremers, TUM Munich

Thank you for listening.

Ankur Handa Understanding Real World Geometry