Post on 16-Mar-2023
Understanding Real World Geometry
Ankur Handa
University of Cambridge and Imperial College London
July 25, 2016
Ankur Handa Understanding Real World Geometry
Outline
BackgroundSceneNet: Repository of Labelled Synthetic Indoor ScenesResults on Semantic Segmentation on Real World Datagvnn: Neural Network Library for Geometric VisionFuture Ideas
Ankur Handa Understanding Real World Geometry
Background
What is the goal?Our initial goal was to segment 3D scenes in real-time.Overlay the per-pixel segmentations onto the 3D map.We are only interested in depth based semantic segmentationhere mainly to understand the role of geometry.
Ankur Handa Understanding Real World Geometry
Background
How do we get enough training data?Real world datasets are limited in size e.g. NYUv2 and SUNRGB-D have 795 and 5K images for training respectively.We can leverage computer graphics to generate the desireddata.There are lots of CAD repositories of objects available butnone contains scenes.We put together a repository of labelled 3D scenes calledSceneNet and train per-pixel segmentation on synthetic data.
Ankur Handa Understanding Real World Geometry
SceneNet
Repository of labelled 3D synthetic scenes.
Hosted at http://robotvault.bitbucket.org
Ankur Handa Understanding Real World Geometry
SceneNet
SceneNet Basis ScenesRoom Type Number of ScenesLiving Room 10
Bedroom 11Office 15
Kitchens 11Bathrooms 10
In total we have 57 very detailed scenes with about 3700 objects.We build upon these scenes to create unlimited number of sceneslater for large scale training data.
Ankur Handa Understanding Real World Geometry
SceneNet
A very detailed living room. Each object in this scene is labelleddirectly in 3D.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Generating indoor scenes as energy minimisation problemBounding box constraint.
Each object must maintain a safe distance from the other.Pairwise Constraint.
Objects that co-occur should not be more than a fixeddistance from each other, e.g. beds and nightstands
Visibility Constraint.Objects with visibility constraint must not have anythingjoining their line of sight joining their centers, e.g. sofa,tableand TV must not having anything in between them.
Distance to Wall Constraint.Some objects are more likely to occur next to walls e.g.cupboards, sofa etc.
Use simulated annealing to solve this energy minimisationproblem.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
wall
othe
rpro
pflo
orbe
d
pictur
e
pillo
w
othe
rfurn
iturelam
p
nigh
t sta
nd
blinds
chairbo
x
book
s
clot
hes
dres
serdo
or
pape
r
windo
w
curta
inde
skba
g
ceilin
g
book
shelf
cabine
t
table
shelve
s
othe
rstru
ctur
e
television
sofa
floor
mat
mirr
or
towel
pers
on
coun
ter
refri
dger
ator
show
er cur
tain
white
boar
dto
iletsink
bath
tub
wall
otherprop
floor
bed
picture
pillow
otherfurniture
lamp
night stand
blinds
chair
box
books
clothes
dresser
door
paper
window
curtain
desk
bag
ceiling
bookshelf
cabinet
table
shelves
otherstructure
television
sofa
floor mat
mirror
towel
person
counter
refridgerator
shower curtain
whiteboard
toilet
sink
bathtub
Co-occurrence statistics of objectsfrom NYUv2 bedrooms.
InterpretationsHeat-map of objectco-occurrence.Training set in NYUv2 isused to obtain thesestatistics.Shows that bed, picture,pillow and nightstandco-occur together.We sample new scenes fromthese object co-occurrencestatistics.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Random placement With Pairwise only All constraints
ConstraintsObjects that co-occur in real world should be placed together,e.g. beds and nightstands.Visibility constraint - nothing in the line of sight betweensofa-table and TV.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Very simple scenes with curtains.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Common room obtained from hierarchically optimising chairs andtable combination.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Living Room. Note that the inset objects are from Stanford Scenesand have been grouped together already.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Living Room. Grouped objects from Stanford Scenes.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Data CollectionCollecting trajectories with joy-stick (though we haverandomised them now).Using OpenGL glReadPixels to obtain the depth as well asground truth labels.
Ankur Handa Understanding Real World Geometry
Generating Unlimited Number of Synthetic Scenes
Depth Height Angle with gravity
AssumptionsHere we only study the segmentation from just depth-data inthe form of DHA images i.e. depth, height from ground planeand angle with gravity vector.Allows us to study the effects of geometry in isolation.Texturing takes time and needs careful photorealisticsynthesis. But we have added a custom raytracer to studyRGB-D based segmentation now.
Ankur Handa Understanding Real World Geometry
Network Architecture
SegNet: [Badrinarayanan, Handa, and Cipolla, arXiv 2015]Saves pooling indices explicitly in the encoder and passes onto the decoder for upsampling.Later, similar ideas emerged in DeconvNet from HyeonwoohNoh et al. ICCV2015 and Zhao et al. Stacked What-WhereAuto-encoders, arxiv2015.
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
Adding noise to depth-map Denoising depth-map Angle with gravity maps
Adapting synthetic data to match real world dataSynthetic depth-maps are injected with appropriate noisemodels.We then denoise these depth-maps to ensure that inputimages look similar to the NYUv2 dataset.Side by side comparison of synthetic bedroom with one of theNYUv2 bedrooms.
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
NYUv2 test images
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
Ground truth segmentation
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
Training on NYUv2 training set only
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
Training on SceneNet and fine-tuning on NYUv2
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
Training on NYUv2 only
Training on SceneNet and fine-tuning on NYUv2What we learnt? [ICRA 2016, CVPR 2016]
Computer graphics is a great source for collecting data.We need to work on designing or learning new loss functions.A loss function that takes local and global context together toavoid any objects mislabelled as others.
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
What we learnt? [ICRA 2016, CVPR 2016]We perform better than state-of-the-art on functionalcategories of objectsSuggesting shape is a strong cue for segmentation.We fall behind on objects that explicitly need RGB data.
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
What we learnt? [ICRA 2016, CVPR 2016]We perform better than state-of-the-art on functionalcategories of objectsSuggesting shape is a strong cue for segmentation.We fall behind on objects that explicitly need RGB data.
Ankur Handa Understanding Real World Geometry
Results on Semantic Segmentation on Real World Data
What we learnt? [ICRA 2016, CVPR 2016]We perform better than state-of-the-art on functionalcategories of objectsSuggesting shape is a strong cue for segmentation.We fall behind on objects that explicitly need RGB data.
Ankur Handa Understanding Real World Geometry
SceneNet and Physics
Adding PhysicsAllows us to create scenes with arbitrary clutter.Placing pens and other small objects on tables is relativelyeasy with physics.We don’t want chairs always in upright positions. Physicsallows us to explore the space of poses of objects relativelyeasy.
Ankur Handa Understanding Real World Geometry
SceneNet and Raytracing
Low-res on laptop High quality
Adding RaytracingSufficient Photorealism is desirable.Depth sensors have limited range.RGB is universal and allows us to model various differentcameras.Need good approximation of shadows, lighting, and variousother global lighting artefacts.
Ankur Handa Understanding Real World Geometry
gvnn: Geometric Vision with Neural Networks
What is gvnn?A new library inspired by Spatial Transformer Networks(STN).Includes various different transformations often used ingeometric computer vision.These transformations are implemented as new layers thatallow backpropagation as in STN.Brings together the domain knowledge in geometry within theneural network.
Ankur Handa Understanding Real World Geometry
Different new layers in gvnn
What is gvnn?Original Spatial Transformer (NIPS 2015) had 2Dtransformations mainly.Added SO3, SE3 and Sim3 layers for global transformation onthe image.Optical flow, over-parameterised Optical flow, Slanted planedisparity.Pin-Hole Camera Projection layer.Per-pixel SO3, SE3 and Sim3 for non-rigid registration.Different robust M-estimators.Very useful for 3D alignment, unsupervised warping with opticflow or disparity and geometric invariance for placerecognition.
Ankur Handa Understanding Real World Geometry
Global Transformations in gvnn
SO3 Layer (SE3 and Sim3 follow easily from here)
∂C∂v = ∂C
∂R(v) ·∂R(v)∂v (1)
∂R(v)∂vi
= vi [v]× + [v× (I− R)ei ]×||v||2 R (2)
NotesC is the cost function being minimised.v = (v1, v2, v3) is the SO3 vector. Note the derivative is notat vi ≈ 0 and ei is the i th column of the Identity matrix.
Ankur Handa Understanding Real World Geometry
Per-pixel 2D Transformations in gvnn
Optical flow and Over-parameterised Optical Flow
(x ′y′
)=
(x + txy + ty
)(3)
(x ′y′
)=
a0 a1 a2
a3 a4 a5
x
y1
(4)
NotesOver-parameterised optical flow also needs extraregularisation.2D and 6D transformations per-pixel. Useful for warpingimages (and unsupervised learning), Patraucean, et al.ICLRW2016.
Ankur Handa Understanding Real World Geometry
Per-pixel 2D Transformations in gvnn
Disparity and Slated Plane disparity
(x ′y′
)=
(x + d
y
)(5)
d = ax + by + c (6)
NotesFitting slanted planes at each pixel.Again, very useful for warping images (and unsupervisedlearning), Garg et al. ECCV2016.
Ankur Handa Understanding Real World Geometry
Camera Projection Layer in gvnn
Pin Hole Camera Projection Layer
π
uvw
=
fx uw + px
fy vw + py
(7)
∂C∂p = ∂C
∂π(p) ·∂π(p)∂p (8)
∂π
uvw
∂
uvw
=
fx 1w 0 −fx u
w2
0 fy 1w −fy v
w2
(9)
Ankur Handa Understanding Real World Geometry
Non-rigid per-pixel transformation
Non-rigid registration for point clouds: 6DoF and 10DoFPer-pixel 6DoF and 10DoF transformations.
x ′iy′iz ′i
= Ti
xiyizi1
(10)
x′i = si(Ri(xi − pi) + pi) + ti (11)
Also useful for volumetric spatial transformers when using voxelgrid representation.
Ankur Handa Understanding Real World Geometry
M-estimators
NotesDifferent M-estimators for robust outlier rejection.Very useful when doing regression.Implemented as layers in gvnn.
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SO3)
NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.Initial experiments with supervised learning.
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SO3)
NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SO3)
NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SO3)
NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SO3)
NotesGeneral dense image registration with global transformation.Can use optical flow/disparity for dense per-pixel imageregistration either with a CNN or RNN.Aim to train on large scale data from SceneNet and RGBvideos for unsupervised learning.
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SO3)
δupdate = fsiamese(Ikt , It+1) (12)
δk = δk−1 + δupdate (13)Ik
t = fwarping(It , δk) (14)I0
t = It (15)δ0 = 0 (16)
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SE3)
NotesNeeds explicitly the depth to do warping.This can either come from a sensor or an extra CNN/RNNmodule that learns depth.
Ankur Handa Understanding Real World Geometry
Sanity Checks on End-to-End Visual Odometry (SE3)
(a) Prediction (b) Ground Truth (c) Residual (difference)
Ankur Handa Understanding Real World Geometry
Where can deep learning help geometry?
Deep learning with geometry: Sample exampleCNNs provide relatively stable features for images of samescene taken across different lighting conditions.These images cannot be aligned with geometry basedper-pixel dense image alignment methods.We can put this together with change detection segmentationto also reason about dynamic scenes.Easy to collect data with SceneNet.
Ankur Handa Understanding Real World Geometry
Future Ideas
Physical scene understanding with SceneNet to reason abouthow the dynamics of the scene is going to evolve over time.Understanding dynamic scenes - change detection forpersistent 4D mapping.
Imagine two images of the scene taken at different times of theday from relatively similar positions.STN-SE3 could be used to align the images.Aligned images can be fed into a segmentation network thatgives per-pixel change mask.Joint end-to-end training.
Attend-Infer-Repeat style 3D scene understanding in the formof instances of objects and their 6DoF poses even for clutteredscenes, with recurrent neural networks.
Ankur Handa Understanding Real World Geometry
References
ReferencesSceneNet: an Annotated Model Generator for Indoor SceneUnderstanding, Ankur Handa, Viorica Patraucean, SimonStent, Roberto Cipolla, ICRA 2016Understanding Real World Indoor Scenes with Synthetic Data,Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan,Simon Stent, Roberto Cipolla, CVPR 2016gvnn: Neural Network Library for Geometric Vision, submittedto ECCVWorkshop.
Ankur Handa Understanding Real World Geometry
Acknowledgements
Thank youVijay Badrinarayanan, Viorica Patraucean, Simon Stent, andRoberto Cipolla, University of CambridgeZoubin Ghahramani, University of CambridgeJohn McCormac and Andrew Davison, Imperial CollegeLondonDaniel Cremers, TUM Munich
Thank you for listening.
Ankur Handa Understanding Real World Geometry