The informed sampler: A discriminative approach to Bayesian inference in generative computer vision...

The Informed Sampler:A Discriminative Approach to Bayesian Inference in

Generative Computer Vision Models

Varun Jampani [email protected]

Max Planck Institute for Intelligent Systems, Spemannstraße 41, 72076, Tubingen, Germany

Sebastian Nowozin [email protected]

Microsoft Research Cambridge, 21 Station Road, Cambridge, CB1 2FB, United Kingdom

Matthew Loper [email protected]


Peter V. Gehler [email protected]


Abstract

Computer vision is hard because of a large vari-ability in lighting, shape, and texture; in additionthe image signal is non-additive due to occlusion.Generative models promised to account for thisvariability by accurately modelling the image for-mation process as a function of latent variableswith prior beliefs. Bayesian posterior inferencecould then, in principle, explain the observation.While intuitively appealing, generative models forcomputer vision have largely failed to deliver onthat promise due to the difficulty of posterior in-ference. As a result the community has favoredefficient discriminative approaches. We still be-lieve in the usefulness of generative models incomputer vision, but argue that we need to lever-age existing discriminative or even heuristic com-puter vision methods. We implement this idea ina principled way in our informed sampler and incareful experiments demonstrate it on challeng-ing models which contain renderer programs astheir components. The informed sampler, usingsimple discriminative proposals based on existingcomputer vision technology achieves dramatic im-provements in inference. Our approach enables anew richness in generative models that was out ofreach with existing inference technology.

Copyright 2014 by the author(s).

1. IntroductionA conceptually elegant view on computer vision is to con-sider a generative model of the physical image formationprocess. The observed image becomes a function of un-observed variables of interest (for example presence andpositions of objects) and nuisance variables (for examplelight sources, shadows). When building such a generativemodel, we can think of a scene description θ that producesan image I = G(θ) using a deterministic rendering en-gine G, or more generally, results in a distribution overimages, p(I|θ). Given an image observation I and a priorover scenes p(θ) we can then perform Bayesian inferenceto obtain updated beliefs p(θ|I). This view was advocatedsince the late 1970’ies (Horn, 1977; Grenander, 1976; Mum-ford & Desolneux, 2010; Mansinghka et al., 2013; Yuille &Kersten, 2006).

Now, 30 years later, we would argue that the generativeapproach has largely failed to deliver on its promise. Thefew successes of the idea have been in limited settings. Inthe successful examples, either the generative model wasrestricted to few high-level latent variables, e.g. (Oliveret al., 2000), or restricted to a set of image transforma-tions in a fixed reference frame, e.g. (Black et al., 2000),or it modelled only a limited aspect such as object shapemasks (Eslami et al., 2012), or, in the worst case, the gen-erative model was merely used to generate training datafor a discriminative model (Shotton et al., 2011). With allits intuitive appeal, its beauty and simplicity, it is fair tosay that the track record of generative models in computervision is poor. As a result, the field of computer vision isnow dominated by efficient but data-hungry discriminativemodels, the use of empirical risk minimization for learning,and energy minimization on heuristic objective functions

arX

iv:1

402.

0859

v2 [

cs.C

V]

5 F

eb 2

014

https://www.researchgate.net/publication/243765832_Pattern_Synthesis_Lectures_in_Pattern_Theory?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/222553414_Robustly_Estimating_Changes_in_Image_Appearance?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/255564387_The_Shape_Boltzmann_Machine_A_Strong_Model_of_Object_Shape?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/221410177_A_Bayesian_Computer_Vision_System_for_Modeling_Human_Interaction?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


https://www.researchgate.net/publication/6998835_Yuille_A_Kersten_D_Vision_as_Bayesian_inference_analysis_by_synthesis_Trends_Cogn_Sci_10_301-308?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


https://www.researchgate.net/publication/220546145_Understanding_Image_Intensities?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/267022112_Pattern_Theory_The_Stochastic_Analysis_of_Real-World_Signals?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


https://www.researchgate.net/publication/221364708_Real-Time_Human_Pose_Recognition_in_Parts_from_Single_Depth_Images?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

The Informed Sampler: A Discriminative Approach to Bayesian Inference in Computer Vision

for inference.

Why did generative models not succeed? The key problemin the generative world view is the difficulty of posterior in-ference at test-time. This difficulty stems from a number ofreasons: first, the parameter θ is typically high-dimensionaland so is the posterior. Second, given θ, the image for-mation process realizes complex and dynamic dependencystructures, for example when objects occlude or self-occludeeach other. These intrinsic ambiguities result in multi-modalposterior distributions. Third, while most renderers are real-time, each simulation of the forward process is expensiveand prevents exhaustive enumeration.

We believe in the usefulness of generative models for com-puter vision tasks, but argue that in order to overcome thesubstantial inference challenges we have to leverage existingcomputer vision features and discriminative models in orderto aid inference in the generative model. We propose theinformed sampler, a Markov chain Monte Carlo (MCMC)method with discriminative proposal distributions. Duringsampling the informed sampler leverages computer visionfeatures and algorithms to make informed proposals for thestate of latent variables and these proposals are accepted orrejected based on the generative model.

The informed sampler is simple and easy to implement, butit enables inference in generative models that were out ofreach for current uninformed samplers. We demonstratethis claim on challenging models that incorporate renderingengines, object occlusion, ill-posedness, and multi-modality.In our experiments we use existing computer vision tech-nology: our informed sampler uses standard histogram-of-gradients features (HoG) (Dalal & Triggs, 2005), and theOpenCV library, (Bradski & Kaehler, 2008), to produceinformed proposals. Likewise one of our models is an ex-isting computer vision model, the BlendSCAPE model, aparametric model of human bodies (Hirshberg et al., 2012).The key contribution of our work is to enable inference inrich generative models through existing computer visiontechnology.

2. Related WorkThis work stands at the intersection of computer vision,computer graphics, and machine learning; and builds onprevious approaches we will discuss below.

There is a vast literature on approaches to solve computervision applications by means of generative models. Wemention some works that also use an accurate graphicsprocess as generative model. This includes applicationssuch as indoor scene understanding (Del Pero et al., 2012),human pose estimation (Lee & Cohen, 2004), hand poseestimation (de La Gorce et al., 2008) and many more. Mostof these works are however interested in inferring MAP

solutions, rather than the full posterior distribution.

Our method is similar in spirit to the approach of DataDriven Markov Chain Monte Carlo (DDMCMC) meth-ods that use a bottom-up approach to help convergenceof MCMC sampling. DDMCMC methods have been usedin image segmentation (Tu & Zhu, 2002), object recogni-tion (Zhu et al., 2000), and human pose estimation (Lee &Cohen, 2004). The idea of making Markov samplers datadependent is rather general, and in the works mentionedabove, lead to highly problem specific implementations,mostly using approximate likelihood functions. It is dueto specialization on a problem domain, that the proposedsamplers are not easily transferable to new problems. Thisis what we focus on in this work: problems with an accu-rate forward process, and a simple, yet efficient inferencetechnique, that can easily be adopted to new tasks.

The formulation to inverting graphics for scene understand-ing also has roots in the computer graphics community underthe term “inverse rendering”. The goal of inverse renderinghowever is to derive a direct mathematical model for theforward light transport process and then to analytically in-vert it. The work of (Ramamoorthi & Hanrahan, 2001) fallsin this category. The authors formulate the light reflectionproblem as a convolution, to then understand the inverselight transport problem as a deconvolution. While this is avery elegant way to pose the problem, it does require a spec-ification of the inverse process, a requirement generativemodeling approaches try to circumvent.

Our approach also resembles similarities with probabilisticprogramming approaches. In the recent work of (Mans-inghka et al., 2013), the authors combine graphics modulesin a probabilistic programming language to formulate anapproximate Bayesian computation. Inference is then im-plemented using Metropolis-Hastings (MH) sampling. Thisapproach is appealing in its generality and elegance, how-ever we show that for our graphics problems, a plain MHsampling approach is not sufficient and but convergence isachieved when using an informed sampler. Another pieceof work from (Stuhlmuller et al., 2013) is similar in theway knowledge about the forward process can be learnedas “stochastic inverses” and apply it for MCMC samplingin Bayesian networks. In the present work, we devise anMCMC sampler that we show works in both a multi-modalproblem as well as inverting an existing piece of imagerendering code. In summary, our method can be under-stood in a similar context as the above-mentioned papers,including (Mansinghka et al., 2013).

3. The Informed SamplerIn general, inference about the posterior distribution is chal-lenging because for a complex model p(I|θ) no closed-form

https://www.researchgate.net/publication/4156272_Triggs_B_Histograms_of_Oriented_Gradients_for_Human_Detection_In_CVPR?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/261154759_Bayesian_geometric_modeling_of_indoor_scenes?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/4082277_Proposal_maps_driven_MCMC_for_estimating_human_body_pose_in_static_images?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==



https://www.researchgate.net/publication/2913913_A_Signal-Processing_Framework_for_Inverse_Rendering?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/262235992_Coregistration_Simultaneous_Alignment_and_Modeling_of_Articulated_3D_Shape?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/3854162_Integrating_bottom-uptop-down_for_object_recognition_by_Data_Driven_Markov_Chain_Monte_Carlo?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/224323322_Model-Based_Hand_Tracking_with_Texture_Shading_and_Self-occlusions?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/287908658_Learning_stochastic_inverses?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/292796591_Learning_OpenCV_Computer_vision_with_OpenCV_library?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/281327886_Histograms_of_Oriented_Gradients_for_Human_Detection?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


simplifications can be made. This is especially true in thecase that we consider, where p(I|θ) corresponds to a graph-ics engine rendering images. Despite this apparent com-plexity we observe the following: for many computer visionapplications there exist well performing discriminative ap-proaches, that, given the image, predict some parameters θor distributions thereof. These do not correspond to the pos-terior distribution that we are interested in, but, intuitivelythe availability of discriminative inference methods shouldmake the task of inferring p(θ|I) easier. Furthermore aphysically accurate model can be exploited to generate asmany samples prior to inference as we can afford computa-tionally. Concretely, in our case the discriminative methodwill take the form of a global density TG(I) to be used ina valid MCMC inference method. In the remainder of thesection we first review Metropolis-Hastings Markov ChainMonte Carlo (MCMC) and then discuss our proposed in-formed samplers.

3.1. Metropolis-Hastings MCMC

The goal of any sampler is to realize independent and iden-tically distributed samples from a given probability distri-bution. MCMC sampling, due to Metropolis et al. (1953) isa particular instance that generates a sequence of randomvariables by simulating a Markov chain. Sampling from atarget distribution π(·) consists of repeating the followingtwo steps (Liu, 2001):

1. Propose a transition using a proposal distribution Tand the current state θt

θ ∼ T (·|θt)

2. Accept or reject the transition based on MetropolisHastings (MH) acceptance rule:

θt+1 =

{θ, rand(0, 1) < min

(1, π(θ)T (θ→θt)

π(θt)T (θt→θ)

),

θt, otherwise.

Different MCMC techniques mainly differ in the implemen-tation of the proposal distribution T .

3.2. Informed Proposal Distribution

We use a common mixture kernel for Metropolis-Hastingssampling

Tα(·|I , θt) = α TL(·|θt) + (1− α) TG(·|I). (1)

Here TL is an ordinary local proposal distribution, for ex-ample a multivariate Normal distribution centered aroundthe current sample θ, and TG is a global proposal distribu-tion independent of the current state. We inject knowledgeby conditioning the global proposal distribution TG on theimage observation. We learn the informed proposal TG(·|I)discriminatively in an offline training stage using a non-parametric density estimator described below.

Algorithm 1 Learning a global proposal TG(θ|I)

1. Simulate {(θ(i), I(i))}i=1,...,n from p(I|θ) p(θ)2. Compute a feature representation v(I(i))3. Perform k-means clustering of {v(I(i))}i4. For each cluster Cj ⊂ {1, . . . , n}, fit a kerneldensity estimate KDE(Cj) to the vectors θ{Cj}

Algorithm 2 INF-MH

Input: observed image ITL← Local proposal distribution (Gaussian)c← cluster for v(I)TG← KDE(c) (as obtained by Alg. 1)T = αTL + (1− α)TGRun standard MH sampler with proposal T

The mixture parameter α ∈ [0, 1] controls the contributionof each proposal, for α = 1 we recover MH. For α =0 the proposal Tα would be identical to TG(·|I) and theresulting Metropolis sampler would be a valid metropolizedindependence sampler (Liu, 2001). With α = 0 we call thisbaseline method Informed Independent MH (INF-INDMH).For intermediate values, α ∈ (0, 1), we combine local withglobal moves in a valid Markov chain. We call this methodInformed Metropolis Hastings (INF-MH).

3.3. Discriminatively Learning TG

The key step in the construction of TG is to include discrim-inative information about the sample I . Ideally we wouldhope to have TG propose global moves which improve mix-ing and even allow mixing between multiple modes, whereasthe local TL is responsible for exploring the density locally.To see that this is in principle possible, consider the caseof a perfect global proposal, that is, TG(·|I) = pθ(·|I). Inthat case we would get independent samples with α = 0because every proposal is accepted. In practice TG is onlyan approximation to pθ(·|I). If the approximation is goodenough then the mixture of local and global proposals willhave a high acceptance rate and explore the density rapidly.

In principle we can use any conditional density estima-tion technique for learning a proposal TG from samples.Typically high-dimensional density estimation is difficultand even more so in the conditional case; however, in ourcase we do have the true joint distribution available to pro-vide training samples. Therefore we use a simple but scal-able non-parametric density estimation method based onclustering a feature representation of the observed image,v(I) ∈ Rd. For each cluster we then estimate an uncondi-tional density over θ using kernel density estimation (KDE).We chose this simple setup since it can easily be reusedin many different scenarios, in the experiments we solvediverse problems using the same method.

For the feature representation we leverage successful dis-

https://www.researchgate.net/publication/224877317_Equation_of_State_Calculations_by_Fast_Computing_Machines?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/215446267_Monte_Carlo_Strategies_in_Scientic_Computing?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==



criminative features and heuristics developed in the com-puter vision community. Different task specific featurerepresentations can be used in order to provide invarianceto small changes in θ and to nuisance parameters.

We construct the KDE for each cluster using a small relativebandwidth in order to accurately represent the high prob-ability regions in the posterior. This is similar in spirit tousing only high probability regions as “darts” in the Dart-ing Monte Carlo sampling technique of Sminchisescu &Welling (2011). We summarize the offline training in Alg. 1.

At test time, this method has the advantage that given animage I we only need to identify the corresponding clusteronce using v(I) in order to sample efficiently from thekernel density TG. We show the full procedure in Alg. 2.

4. Setup and Baseline MethodsIn the remainder of the paper we demonstrate the proposedmethod in three different experimental setups. For all exper-iments, we use four parallel chains initialized at differentrandom locations sampled from the prior. The reported num-bers are median statistics over multiple test images exceptwhen noted otherwise.

4.1. Baseline Methods

Metropolis Hastings (MH). Described above, corre-sponds to α = 1, we use a symmetric diagonal Gaussiandistribution, centered at θt.

Metropolis Hastings within Gibbs (MHWG). We usea Metropolis Hastings scheme in a Gibbs sampler, that is,we draw from one-dimensional conditional distributions forproposing moves and the Markov chain is updated along onedimension at a time. We further use a blocked variant of thisMHWG sampler, where we update blocks of dimensions ata time, and denote it by BMHWG.

Parallel Tempering (PT). We use Parallel Tempering toaddress the problem of sampling from multi-modal dis-tributions (Geyer, 1991). This technique is also knownas “replica exchange MCMC sampling” (Hukushima &Nemoto, 1996). We run different parallel chains at differ-ent temperatures T , sampling π(·) 1

T and at each samplingstep propose to exchange two randomly chosen chains. Inour experiments we run three chains at temperature levelsT ∈ {1, 3, 27} that were found to be best working out ofall combinations in {1, 3, 9, 27} for all experiments indi-vidually. The highest temperature levels corresponds to analmost flat distribution.

Regeneration Sampler (REG-MH). We implemented aregenerative MCMC method (Mykland et al., 1995) thatperforms adaption (Gilks et al., 1998) of the proposal distri-bution during sampling. We use the mixture kernel (Eq 1) asproposal distribution and adapt only the global part TG(·|I).

This is initialized as the prior over θ and at times of regen-eration we fit a KDE to the already drawn samples. Forcomparison we used the same mixture coefficient α as forINF-MH (more details of this technique in the supplemen-tary material).

4.2. MCMC Diagnostics

We use established methods for monitoring the convergenceof our MCMC method (Kass et al., 1998). In particular, wereport the following diagnostics.1

Acceptance Rate (AR). The ratio of accepted samples tothe total Markov chain length. The higher the acceptancerate, the fewer the samples we need to approximate theposterior. Acceptance rate indicates how well the proposaldistribution approximates the true distribution locally.

Potential Scale Reduction Factor (PSRF). The PSRFdiagnostics (Gelman & Rubin, 1992; Brooks & Gelman,1998) is derived by comparing within-chain variances withbetween-chain variances of sample statistics. For this, itrequires independent runs of multiple chains (4 in our case)in parallel. Because our sample θ is multi-dimensional, weestimate the PSRF for each parameter dimension separatelyand take the maximum PSRF value as final PSRF value.A value close to one indicates that all chains characterizethe same distribution. This does not imply convergence,the chains may all collectively miss a mode. However, aPSRF value much larger than one is a certain sign of lack ofconvergence of the chain.

Root Mean Square Error (RMSE). During our experi-ments we have access to the input parameters θ∗ that gener-ated the image. To assess whether the posterior distributioncovers the “correct” value we report the RMSE between theposterior expectation Ep(·|I)[G(·)] and the true value G(θ∗).

4.3. Parameter Selection

For each sampler we individually selected hyper-parametersthat gave the best PSRF value after 10k iterations. In casethe PSRF does not differ for multiple values, we chose theone with highest acceptance rate. We include a detailedanalysis of the baseline samplers and parameter selection inthe supplementary material.

5. Experiment: Estimating Camera ExtrinsicsWe implement the following simple graphics scenario tocreate a challenging multi-modal problem. We render acubical room of edge length 2 with a point lightsource in thecenter of the room (0, 0, 0) from the viewpoint of a camerasomewhere inside the room. The camera parameters aredescribed by its (x, y, z)-position and the orientation, speci-

1We also monitored auto-correlation and include those resultsin the supplementary material.

https://www.researchgate.net/publication/38363224_Inference_From_Iterative_Simulation_Using_Multiple_Sequences?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/2504666_Markov_Chain_Monte_Carlo_in_Practice_A_Roundtable_Discussion?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/2783885_Adaptive_Markov_Chain_Monte_Carlo_through_Regeneration?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/1866589_Exchange_Monte_Carlo_Method_and_Application_to_Spin_Glass_Simulations?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


https://www.researchgate.net/publication/239384155_General_Methods_for_Monitoring_Convergence_of_Iterative_Simulations?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


https://www.researchgate.net/publication/296904289_Markov_chain_Monte_Carlo_in_practice_A_roundtable_discussion?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/243773529_Markov_chain_Monte_Carlo_Maximum_Likelihood?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


fied by yaw, pitch, and roll angles. The inference processconsists of estimating the posterior over these 6D cameraparameters θ. See Figure 1 for two example renderings. Pos-terior inference is a highly multi-modal problem because theroom is a cubical and thus symmetric. There are 24 differentcamera parameters that will result in the same image. Thisis also shown in Figure 1 where we plot the position andorientation (but not camera roll) of all camera parametersthat create the same image. A rendering of a 200 × 200image with a resolution of 32bit using a single core on anIntel Xeon 2.66Ghz machine takes about 11ms on average.

Figure 1. Two rendered room images with possible camera posi-tions and headings that produce the same image. Not shown arethe orientations; in the left example all six headings can be rolledby 90,180, and 270 degrees for the same image.

A small amount of isotropic Gaussian noise is added tothe rendered image G(θ), using a standard deviation ofσ = 0.02. The posterior distribution we try to infer thenreads: p(θ|I) ∝ p(I|θ)p(θ) = N (I|G(θ), σ2) Uniform(θ).

To learn the informed part of the proposal distributionfrom data, we computed a Histogram of Oriented gradients(HOG) descriptor (Dalal & Triggs, 2005) from the image, us-ing 9 gradient orientations and cells of size 20×20 yieldinga feature vector v(I) ∈ R900. We generated 300k trainingimages using a uniform prior over the camera extrinsic pa-rameters, and performed k-means using 5k cluster centersbased on the HOG feature vector. For each cluster cell,we then computed and stored a KDE for the 6 dimensionalcamera parameters, following the steps in Algorithm 1. Astest data, we create 30 images using extrinsic parameterssampled uniform at random over their range.

5.1. Results

The results are shown in Figure 2. We observe that bothMH and PT yield low acceptance rate compared to othermethods. However parallel tempering appears to overcomethe multi-modality better and improves over MH in termsof convergence. The same holds for the regeneration tech-nique, we observe many regenerations, good convergenceand AR. Both INF-INDMH and INF-MH converge quickly.We compute the RMSE of the posterior estimates after 1000iterations and find that the estimates of MH, PT, and REG-

MH are significantly higher (Wilcoxon’s signed rank test:W = 0, P < 0.001) than those for the two informed sam-plers. Even after 10k iterations there is still a difference andthe median RMSE value of the three samplers is still higher.

We also tested the MHWG sampler and found that it did notconverge even after 100k iterations, with a PSRF around 3.This is to be expected since single variable updates will nottraverse the multi-modal posterior distributions fast enoughdue to the high correlation of the camera parameters.

As expected, some knowledge of the multi-modal structureof the posterior needs to be available for the sampler toperform well. The methods INF-INDMH and INF-MH havethis information and perform better than baseline methodsand REG-MH.

6. Experiment: Occluding TilesIn a second experiment, images are rendered depicting afixed number (6) of quadratic tiles placed at a location in theimage x, y at a certain depth z and orientation θ. We blur theimage and add a bit of Gaussian random noise (σ = 0.02).An example is depicted in Figure 3(a), note that all the tilesare of the same size, but farther away tiles look smaller. Arendering of 200× 200 image takes about 25ms on average.

We chose this experiment because it has properties that arecommonplace in computer vision problems and is similar tothe “dead leaves model” of Lee et al. (2001). It is a scenecomposed of several objects that are independent, exceptfor occlusion, which complicates the problem. If occlusiondid not exist, the task is readily solved using a standardOpenCV(Bradski & Kaehler, 2008) rectangle finding algo-rithm (minAreaRect). The output of such an algorithmcan be seen in Fig 3(c), we use this as a discriminativesource of information. This problem is higher dimensionalthan the previous one (24, due to 6 tiles of 4 parameters).Since the tiles are independent except for occlusion we usea block-Gibbs like scheme in our sampler where we proposechanges to one tile at a time.

The experimental protocol is the same as before, we render500k images, apply the OpenCV algorithm to fit rectanglesand take their found four parameters as features for clus-tering (10k clusters). Again KDE distributions are fit toeach cluster and at test time, we assign to the correspondingcluster and use as global sampler TG one that samples a tile,then proposes an update of all its 4 parameters. We refer tothis as INF-BMHWG. α = 0.8 is found to be optimal forINF-BMHWG sampling.

6.1. Results

An example result is shown in Figure 3. We found that thethe MH and INF-MH samplers fail entirely on this prob-

https://www.researchgate.net/publication/4156272_Triggs_B_Histograms_of_Oriented_Gradients_for_Human_Detection_In_CVPR?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/42246063_Occlusion_Models_for_Natural_Images_A_Statistical_Study_of_a_Scale-Invariant_Dead_Leaves_Model?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==




Figure 2. Acceptance Rates (left), PSRFs (middle), and RMSEs (right) for different sampling methods in camera extrinsics recoveryexperiment. Median results over 30 test examples.

Figure 3. (a) A sample rendered image, (b) Ground truth squares, and most probable estimates from 5000 samples obtained by (c) MHWGsampler (best baseline) and (d) the INF-BMHWG sampler. (f) Posterior expectation of the square boundaries obtained by INF-BMHWGsampling. (The first 2000 samples are discarded as burn-in)

lem (not included in the figure, see supp.mat.). Both usea proposal distribution for the entire state and due to thehigh dimensions there is almost no acceptance (< 1%) andthus they do not reach convergence. The MHWG sampler,updating one dimension at a time, is found to be the bestamong the baseline samplers with acceptance rate of around42%, followed by a block sampler that samples each tileseparately. The OpenCV algorithm produces a reasonableinitial guess but fails in occlusion cases.

The block wise informed sampler INF-BMHWG convergesquicker, with higher acceptance rates (≈ 53%), and lower re-construction error. The median curves for 10 test examplesare shown in Figure 4, INF-BMHWG by far produces lowerreconstruction errors. Also in Fig 3(f) the posterior distri-bution is visualized, fully visible tiles are more localized,position and orientation of occluded tiles more uncertain.Although the model is relatively simple, all the baselinesamplers perform poorly and discriminative information iscrucial to enable accurate inference. Here the discriminativeinformation is provided by a readily available heuristic inthe OpenCV library.

7. Experiment: Estimating Body ShapeThe last experiment is motivated by a real world problem:estimating the 3D body shape of a person from a singlestatic depth image. With the recent availability of cheapactive depth sensors, the use of RGBD data has becomeubiquitous in computer vision.

Figure 4. PSRF (left) and RMSE (right) of different methods in theoccluding tiles experiment. Median results for 10 test examples.

To represent a human body we use the BlendSCAPEmodel (Hirshberg et al., 2012), which updates the origi-nally proposed SCAPE model (Anguelov et al., 2005) withbetter training and blend weights. This model produces a 3Dmesh of a human body as shown in Figure 5(left) as a func-tion of shape and pose parameters. The shape parametersallow us to represent bodies of many builds and sizes, andincludes a statistical characterization (being roughly Gaus-sian). These parameters control directions in deformationspace, which were learned from a corpus of roughly 20003D mesh models registered to scans of human bodies viaPCA. The pose parameters are joint angles which indirectlycontrol local orientations of predefined parts of the model.

Our model uses 57 pose parameters and any number ofshape parameters to produce a 3D mesh with 10,777 vertices.We use the first 7 SCAPE components to represent the shapeof a person. The camera viewpoint, orientation, and pose of

https://www.researchgate.net/publication/234777770_SCAPE_Shape_completion_and_animation_of_people?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==



Figure 5. A sample test result showing the result of 3D mesh reconstruction with the first 1000 samples obtained using our INF-MHsampling method. We visualize the angular error (in degrees) between the estimated and ground truth edge and project onto the mesh.

the person is held fixed. Thus a rendering process takes θ ∈R7, generates a 3D mesh representation of it and projects itthrough a virtual depth camera to create a depth image of theperson. This can be done in various resolutions, we chose430× 260 with depth values represented as 32bit numbersin [0, 4]. On average, a full render path takes about 28ms.We add Gaussian noise with standard deviation of 0.02 tothe created depth image. See Fig.5(left) for an example.

We used very simple low level features for feature repre-sentation. In order to learn the global proposal distributionwe compute depth histogram features on a 15× 10 grid onthe image. For each cell we record the mean and varianceof the depth values. Additionally we add the height andthe width of the body silhouette as features resulting in afeature vector v(I) ∈ R302. As normalization, each featuredimension is divided by the maximum value in the train-ing set. We used 400k training images sampled from thestandard normal prior distribution and 10k clusters to learnthe KDE proposal distributions in each cluster cell. For thisexperiment we also experimented with another conditionaldensity estimation approach using a forest of random re-gression trees. We use the same features as for k-meansclustering but grow the regression trees using a mean squareerror criterion for scoring the split functions. A forest of 10binary trees with a depth of 15 is grown, with the constraintof having a minimum of 40 training points per leaf node.Then for each of the leaf nodes, a KDE is trained as before.At test time the regression forest yields a mixture of KDEsas the global proposal distribution. We denote the methodas INF-RFMH in the experiments.

Again, we chose parameters for all samplers individually,based on empirical mixing rates. For informed samplers,we chose α = 0.8, and a local proposal standard deviationof 0.05. The full analysis for all samplers is included in thesupplementary material.

7.1. Results

We tested the different approaches on 10 test images thatare generated by parameters drawn from the standard nor-

mal prior distribution. Figure 6 summarizes the results ofthe sampling methods (AR plots in supp. mat.). We makethe following observations. The baselines methods MH,MHWG and PT show poorer convergence results and MHand PT also have to cope with lower acceptance rates. Justsampling from the distribution of the discriminative step(INF-INDMH) is not enough, as low AR indicates that theglobal proposals do not represent the correct posterior dis-tribution. However, combined with a local proposal in amixture kernel, we achieve higher AR, faster convergenceand decrease in RMSE. The regression forest approach hasslower convergence than INF-MH. In this example, the re-generation sampler REG-MH does not improve over simplerbaseline methods. We attribute this to rare regenerationswhich may be improved with more specialized methods.

We believe that our simple choice of depth image represen-tation can also significantly be improved on. For example,features can be computed from identified body parts, some-thing that the simple histogram features have not taken intoaccount. In the computer vision literature some discrimina-tive approaches for pose estimation do exist, most prominentbeing the influential work on pose recovery in parts for theKinect XBox system (Shotton et al., 2011). In future workwe plan to use similar methods to deal with pose varia-tion and complicated dependencies between parameters andobservations.

Figure 6. PSRFs (left), and RMSEs (right) for different methods inthe body shape experiment. Median results over 10 test examples.



Figure 7. Box plots of three body measurements for three testsubjects, computed from the first 10k samples obtained by the INF-MH sampler. Dotted lines indicate measurements correspondingto ground truth scape parameters.

7.2. 3D Mesh Reconstruction

In Figure 5 we show a sample 3D body mesh reconstructionresult using the INF-MH sampler after only 1000 iterations.We visualized the difference of the mean posterior and theground truth 3D mesh in terms of mesh edge directions. Onecan observe that most differences are in the belly region andthe feet of the person. The retrieved posterior distributionallows us to assess the model uncertainty. To visualize theposterior variance we record standard deviation over theedge directions for all mesh edges. This is backprojectedto achieve the visualization in Figure 5(right). We see thatposterior variance is higher in regions of higher error. Ina real-world body scanning scenario, this information willbe beneficial; for example, when scanning from multipleviewpoints or in an experimental design scenario, it helps inselecting the next best pose and viewpoint to record.

7.3. Body Measurements

Predicting body measurements has many applications in-cluding clothing, sizing and ergonomic design. Given pixelobservations, one may wish to infer a distribution over mea-surements (such as height and chest circumference). For-tunately, our original shape training corpus includes a hostof 47 different per-subject measurements, obtained by pro-fessional anthropometrists; this allows us to relate shapeparameters to measurements. Among many possible formsof regression, regularized linear regression (Zou & Hastie,2005) was found to best predict measurements from shapeparameters. This linear relationship allows us to transformany posterior distribution over SCAPE parameters into aposterior over measurements, as shown in Figure 7. Wereport for three randomly chosen subjects’ (S1, S2, and S3)results on three out of the 47 measurements. The dashedlines corresponds to ground truth values. Our estimate notonly faithfully recovers the true value but also yields a char-acterization of the full conditional posterior.

Figure 8. Mean 3D mesh and corresponding errors and uncertain-ties (std. deviations) in mesh edge directions, for the same test caseas in figure 5, computed from first 10k samples of our INF-MHsampling method with (bottom row) occlusion mask in image evi-dence. (blue indicates small values and red indicates high values)

7.4. Incomplete Evidence

Another advantage of using a generative model is the abilityto reason with missing observations. We perform a simpleexperiment by occluding a portion of the observed depth im-age. We use the same inference and learning codes, with thesame parameterization and features as in the non-occlusioncase but retrain the model to account for the changes in theforward process. The result of INF-MH, computed on thefirst 10k samps is shown in Fig. 8. The 3D reconstructionis reasonable even under large occlusion; the error and theedge direction variance did increase as expected.

8. Discussion and ConclusionsThis work proposes a method to incorporate discriminativemethods into Bayesian inference in a principled way. Weaugment a sampling technique with discriminative infor-mation to enable inference with global accurate generativemodels. Empirical results on three challenging and diversecomputer vision experiments are discussed. This sampler isapplicable to general scenarios and in this work we leveragethe accurate forward process for offline training, a settingfrequently found in computer vision applications.

We show that even for very simple scenarios, most baselinesamplers perform poorly or fail completely. Including aglobal image-conditioned proposal distribution that is in-formed through discriminative inference improves samplingperformance. We deliberately use a simple learning tech-nique (KDEs on k-Means cluster cells) to enable easy reusein other applications.

There are some avenues to advance our method. Identifyingconditional dependence structure should improve results,e.g. recently Stuhlmuller et al. (2013) used structure inBayesian networks to identify such dependencies. One as-sumption in our work is that we use an accurate generativemodel. Relaxing this assumption to allow for more gen-eral scenarios where the generative model is known onlyapproximately is important future work. In particular forhigh-level computer vision problems such as scene or ob-


https://www.researchgate.net/publication/289652931_Regularization_and_variable_selection_via_the_elastic_net?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/289652931_Regularization_and_variable_selection_via_the_elastic_net?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==


ject understanding there are no accurate generative modelsavailable yet but there is a clear trend towards physicallymore accurate 3D representations of the world. This moregeneral setting is different to the one we consider in thispaper, but we believe that some ideas can be carried over.For example, we could create the informed proposal distri-butions from manually annotated data as readily availablein many computer vision data sets.

We believe that generative models are potentially useful inmany computer vision scenarios and that the interplay be-tween computer graphics and computer vision is a prime can-didate for studying probabilistic inference and probabilisticprogramming (Mansinghka et al., 2013). However, currentinference techniques need to be improved on many fronts:efficiency, ease of usability, and generality. Our method is astep towards this direction: the informed sampler leveragesthe power of existing discriminative and heuristic techniquesto enable a principled Bayesian treatment in rich generativemodels. Our emphasis is on generality; we aimed to createa method that can be easily reused in other scenarios withexisting codebases. The presented results are a successfulexample of the inversion of an involved rendering pass. Inthe future we plan to investigate ways to combine exist-ing computer vision techniques with principled generativemodels, with the aim of being general rather than problemspecific.

ReferencesAhn, Sungjin, Chen, Yutian, and Welling, Max. Distributed and

adaptive darting Monte Carlo through regenerations. In Pro-ceedings of the 16th International Conference on Artificial In-telligence and Statistics (AI Stat), 2013.

Anguelov, Dragomir, Srinivasan, Praveen, Koller, Daphne, Thrun,Sebastian, Rodgers, Jim, and Davis, James. Scape: shapecompletion and animation of people. In ACM Transactions onGraphics (TOG), volume 24, pp. 408–416. ACM, 2005.

Athreya, Krishna B and Ney, P. A new approach to the limit theoryof recurrent Markov chains. Transactions of the AmericanMathematical Society, 245:493–501, 1978.

Black, Michael J., Fleet, David J., and Yacoob, Yaser. Robustlyestimating changes in image appearance. Computer Vision andImage Understanding, 78(1):8–31, 2000.

Bradski, Gary and Kaehler, Adrian. Learning OpenCV: Computervision with the OpenCV library. OReilly, 2008.

Brooks, SP. and Gelman, A. General methods for monitoringconvergence of iterative simulations. Journal of Computationaland Graphical Statistics, 7:434–455, 1998.

Dalal, Navneet and Triggs, Bill. Histograms of oriented gradientsfor human detection. In Computer Vision and Pattern Recogni-tion, 2005.

de La Gorce, Martin, Paragios, Nikos, and Fleet, David J. Model-based hand tracking with texture, shading and self-occlusions.In Computer Vision and Pattern Recognition, 2008.

Del Pero, Luca, Bowdish, Joshua, Fried, Daniel, Kermgard, Bon-nie, Hartley, Emily, and Barnard, Kobus. Bayesian geometricmodeling of indoor scenes. In Computer Vision and PatternRecognition, 2012.

Eslami, S. M. Ali, Heess, Nicolas, and Winn, John M. The shapeboltzmann machine: A strong model of object shape. In CVPR,pp. 406–413. IEEE, 2012.

Gelman, A and Rubin, DB. Inference from iterative simulationusing multiple sequences. Statistical Science, 7:457–511, 1992.

Geyer, Charles J. Markov chain Monte Carlo maximum likelihood.In Proceedings of the 23rd Symposium on the Interface, pp.156–163, 1991.

Gilks, Walter R, Roberts, Gareth O, and Sahu, Sujit K. AdaptiveMarkov chain Monte Carlo through regeneration. Journal ofthe American statistical association, 93(443):1045–1054, 1998.

Grenander, Ulf. Pattern Synthesis - Lectures in Pattern Theory.Springer, New York, 1976.

Hirshberg, David A, Loper, Matthew, Rachlin, Eric, and Black,Michael J. Coregistration: simultaneous alignment and model-ing of articulated 3d shape. In European conference on Com-puter Vision, 2012.

Horn, Berthold K. P. Understanding image intensities. ArtificialIntelligence, 8:201–231, 1977.

Hukushima, Koji and Nemoto, Koji. Exchange Monte Carlomethod and application to spin glass simulations. Journal ofthe Physical Society of Japan, 65(6):1604–1608, 1996.

Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M. Markovchain Monte Carlo in practice: A roundtable discussion. TheAmerican Statistician, 52:93–100, 1998.

Lee, Ann B., Mumford, David, and Huang, Jinggang. Occlu-sion models for natural images: A statistical study of a scale-invariant dead leaves model. International Journal of ComputerVision, 41(1-2):35–59, 2001.

Lee, Mun Wai and Cohen, Isaac. Proposal maps driven MCMCfor estimating human body pose in static images. In ComputerVision and Pattern Recognition, 2004.

Liu, Jun S. Monte Carlo Strategies in Scientific Computing.Springer Series in Statistics. New York, 2001.

Mansinghka, Vikash, Kulkarni, Tejas D, Perov, Yura N, and Tenen-baum, Josh. Approximate Bayesian image interpretation usinggenerative probabilistic graphics programs. In Advances inNeural Information Processing Systems, pp. 1520–1528, 2013.

Metropolis, Nicholas, Rosenbluth, Arianna W, Rosenbluth, Mar-shall N, Teller, Augusta H, and Teller, Edward. Equation ofstate calculations by fast computing machines. The journal ofchemical physics, 21:1087, 1953.

Mumford, David and Desolneux, Agnes. Pattern Theory: TheStochastic Analysis of Real-World Signals. 2010.

Mykland, Per, Tierney, Luke, and Yu, Bin. Regeneration in Markovchain samplers. Journal of the American Statistical Association,90(429):233–241, 1995.





https://www.researchgate.net/publication/244416205_A_New_Approach_to_the_Limit_Theory_of_Recurrent_Markov_Chains?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==






























































Nummelin, Esa. A splitting technique for harris recurrent Markovchains. Zeitschrift fur Wahrscheinlichkeitstheorie und ver-wandte Gebiete, 43(4):309–318, 1978.

Oliver, Nuria, Rosario, Barbara, and Pentland, Alex. A bayesiancomputer vision system for modeling human interactions. IEEETrans. Pattern Anal. Mach. Intell, 22(8):831–843, 2000.

Ramamoorthi, Ravi and Hanrahan, Pat. A signal-processing frame-work for inverse rendering. In Computer graphics and interac-tive techniques, pp. 117–128. ACM, 2001.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M.,Moore, R., Kipman, A., and Blake, A. Real-time human poserecognition in parts from single depth images. In CVPR, 2011.

Sminchisescu, Cristian and Welling, Max. Generalized dartingMonte Carlo. Pattern Recognition, 44(10):2738–2748, 2011.

Stuhlmuller, Andreas, Taylor, Jessica, and Goodman, Noah. Learn-ing stochastic inverses. In Advances in Neural InformationProcessing Systems, pp. 3048–3056, 2013.

Tu, Zhuowen and Zhu, Song-Chun. Image segmentation by data-driven markov chain monte carlo. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 24(5):657–673, 2002.

Yuille, Alan and Kersten, Daniel. Vision as Bayesian inference:analysis by synthesis? Trends in cognitive sciences, 10(7):301–308, 2006.

Zhu, Song-Chun, Zhang, Rong, and Tu, Zhuowen. Integrat-ing bottom-up/top-down for object recognition by data drivenmarkov chain monte carlo. In Computer Vision and PatternRecognition, 2000.

Zou, Hui and Hastie, Trevor. Regularization and variable selectionvia the elastic net. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 67(2):301–320, 2005.







https://www.researchgate.net/publication/220599573_Generalized_darting_Monte_Carlo?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==

https://www.researchgate.net/publication/220599573_Generalized_darting_Monte_Carlo?el=1_x_8&enrichId=rgreq-d8627f51-0c5c-4013-b33c-c19afbacab85&enrichSource=Y292ZXJQYWdlOzI2MDAzOTczNztBUzoxNzI0NDI2MzkxNTExMDVAMTQxODEyNDkzNzE1OA==














Supplementary Material forThe Informed Sampler:

A Discriminative Approach to Bayesian Inference inGenerative Computer Vision Models

Varun Jampani [email protected]


Sebastian Nowozin [email protected]

Microsoft Research Cambridge, 21 Station Road, Cambridge, CB1 2FB, United Kingdom

Matthew Loper [email protected]


Peter V. Gehler [email protected]


A. Baseline MethodsIn the main paper we briefly introduced the baseline meth-ods that we used in the experiments. Besides the well knownMetropolis Hastings (MH), Metropolis Hastings WithinGibbs (MHWG) and Parallel Tempering (PT) methods wealso used a Regeneration Sampling (REG-MH) that we willdescribe in more detail.

A.1. Regeneration Sampler (REG-MH)

Adapting the proposal distribution with existing MCMCsamples is not straight-forward as this would potentiallyviolate the Markov property of the chain. One approachis to identify the times of regeneration when the chain isrestarts and then to adapt the proposal distribution usingsamples already drawn. Several approaches to identify goodregeneration times in a general Markov chain have beenproposed (Athreya & Ney, 1978; Nummelin, 1978). Webuild on (Mykland et al., 1995) that proposed two splittingmethods for finding the regeneration times. Here, we brieflydescribe the method that we implemented in this study.

Let the present state of the sampler be x and let the indepen-dent global proposal distribution be TG. When y ∼ TG isaccepted according to the MH acceptance rule, the probabil-ity of a regeneration is given by:

r(x, y) =

max{ c

w(x) ,c

w(y)} if w(x) > c & w(y) > c

max{w(x)c , w(y)

c } if w(x) < c & w(y) < c

1 otherwise(2)

where c is any arbitrary constant and w(x) = π(x)TG(x) . c can

be set to any value to maximize regeneration probability.At every sampling step, if a sample from the independentproposal distribution is accepted, we compute regenerationprobability using equation 2. If a regeneration occurs, thepresent sample is discarded and replaced with one from theindependent proposal distribution TG. We use the samemixture proposal distribution as in our informed samplingapproach where we initialize the global proposal TG witha prior distribution and at times of regeneration fit a KDEto the existing samples. This becomes the new adapteddistribution TG. Refer to (Mykland et al., 1995) for moredetails of this regeneration technique. In the work of (Ahnet al., 2013) this regeneration technique is used with successin a Darting Monte Carlo sampler.

B. ResultsOn the next pages of this supplementary material, we givean in-depth performance analysis of the various samplersand the effect of their hyperparameters. We choose hyper-parameters with the lowest PSRF value after 10k iterations,for each sampler individually. If the differences betweenPSRF are not significantly different among multiple values,we choose the one that has the highest acceptance rate.



C. Experiment: Estimating CameraExtrinsics

C.1. Parameter Selection

Metropolis Hastings (MH) Figure 10(a) shows the me-dian acceptance rates and PSRF values corresponding tovarious proposal standard deviations of plain MH sampling.Mixing gets better and the acceptance rate gets worse asthe standard deviation increases. The value 0.3 is selectedstandard deviation for this sampler.

Metropolis Hastings Within Gibbs (MHWG) As men-tioned in the main paper, the MHWG sampler with one-dimensional updates did not converge for any value of pro-posal standard deviation. This problem has high correla-tion of the camera parameters and is of multi-modal nature,which this sampler has problems with.

Parallel Tempering (PT) For PT sampling, we took thebest performing MH sampler and used different temperaturechains to improve the mixing of the sampler. Figure 10(b)shows the results corresponding to different combinationof temperature levels. The sampler with temperature levelsof [1, 3, 27] performed best in terms of both mixing andacceptance rate.

Effect of Mixture Coefficient in Informed Sampling(INF-MH) Due to space constraints, we didn’t discussthe effect of the mixture coefficient (α) on the informedsampling, in the main paper. Figure 10(c) shows the effectof mixture coefficient (α) on the informed sampling INF-MH. Since there is no significant different in PSRF valuesfor 0 ≤ α ≤ 0.7, we chose 0.7 due to its high acceptancerate.

C.2. Auto Correlation

In Figure 1 we plot the median auto-correlation of samplesobtained by different sampling techniques, separately foreach of the six extrinsic camera parameters. The informed

Figure 9. Auto-Correlation of samples obtained by different sam-pling techniques in camera extrinsics experiment, for each of thesix extrinsic camera parameters.

sampling approach (INF-MH and INF-INDMH) appears tomix much faster than other samplers.

(a) MH

(b) PT

(c) INF-MH

Figure 10. Results of the ‘Estimating Camera Extrinsics’ experi-ment. PRSFs and Acceptance rates corresponding to (a) variousstandard deviations of MH, (b) various temperature level combi-nations of PT sampling and (c) various mixture coefficients ofINF-MH sampling.


D. Experiment: Occluding TilesD.1. Parameter Selection

Metropolis Hastings (MH) Figure 11(a) shows the re-sults of MH sampling. Results show the poor convergencefor all proposal standard deviations and rapid decrease ofAR with increasing standard deviation. This is due to thehigh-dimensional nature of problem. We selected a standarddeviation of 1.1.

Blocked Metropolis Hastings Within Gibbs (BMHWG)The results of BMHWG are shown in Figure 11(b). Inthis sampler we update only one block of tile variables (ofdimension four) in each sampling step. Results show muchbetter performance compared to plain MH. The optimalproposal standard deviation for this sampler is 0.7.

Metropolis Hastings Within Gibbs (MHWG) Fig-ure 11(c) shows the result of MHWG sampling. This sam-pler is better than BMHWG and converges much morequickly. Here a standard deviation of 0.9 is found to bebest.

Parallel Tempering (PT) The Figure 11(d) shows the re-sults of PT sampling with various temperature combinations.Results show no improvement in AR from plain MH sam-pling and again [1, 3, 27] temperature levels are found to beoptimal.

Effect of Mixture Coefficient in Informed Sampling(INF-BMHWG) Figure 11(e) shows the effect of mix-ture coefficient (α) on the blocked informed sampling INF-BMHWG. Since there is no significant different in PSRFvalues for 0 ≤ α ≤ 0.8, we chose 0.8 due to its high accep-tance rate.

D.2. Qualitative Results

In Figure 12 we show more qualitative results of this experi-ment. The informed sampling approach (INF-BMHWG) isbetter than the best baseline (MHWG). This still is a verychallenging problem since the parameters for occluded tilesare flat over a large region. Some of the posterior varianceof the occluded tiles is already captured by the informedsampler.

(a) MH

(b) BMHWG

(c) MHWG

(d) PT

(e) INF-BMHWG

Figure 11. Results of the ‘Occluding Tiles’ experiment. PRSF andAcceptance rates corresponding to various standard deviations of(a) MH, (b) BMHWG, (c) MHWG,; (d) various temperature levelcombinations of PT sampling and; (e) various mixture coefficientsof our informed INF-BMHWG sampling.


Figure 12. Qualitative Results of the occluding tiles experiment. From left to right: (a) Given image, (b) Ground truth tiles, and mostprobable estimates from 5000 samples obtained by (c) MHWG sampler (best baseline) and (d) our INF-BMHWG sampler. (f) Posteriorexpectation of the tiles boundaries obtained by INF-BMHWG sampling. (First 2000 samples are discarded as burn-in)


E. Experiment: Estimating Body ShapeE.1. Parameter Selection

Metropolis Hastings (MH) Figure 13(a) shows the resultof MH sampling with various proposal standard deviations.The value of 0.1 is found to be best.

Metropolis Hastings Within Gibbs (MHWG) ForMHWG sampling we select 0.3 proposal standard devia-tion. Results are shown in Figure 13(b).

Parallel Tempering (PT) As before, results in Fig-ure 13(c), the temperature levels were selected to be[1, 3, 27] due its slightly higher AR.

Effect of Mixture Coefficient in Informed Sampling(INF-MH) Figure 13(d) shows the effect of α on PSRFand AR. Since there is no significant differences in PSRFvalues for 0 ≤ α ≤ 0.8, we choose 0.8.

E.2. Qualitative results

Figure 14 shows some more results of 3D mesh reconstruc-tion using posterior samples obtained by our informed sam-pling INF-MH.

(a) MH

(b) MHWG

(c) PT

(d) INF-MH

Figure 13. Results of the ‘Body Shape Estimation’ experiment.PRSFs and Acceptance rates corresponding to various standarddeviations of (a) MH, (b) MHWG; (c) various temperature levelcombinations of PT sampling and; (d) various mixture coefficientsof the informed INF-MH sampling.


Figure 14. Qualitative resuts for the body shape experiment. Shown is the 3D mesh reconstruction results with first 1000 samples obtainedusing the INF-MH informed sampling method. (blue indicates small values and red indicates high values)


F. Results Overview


(a) Results for: Estimating Camera Extrinsics

(b) Results for: Occluding Tiles

(c) Results for: Estimating Body Shape

Figure 15. Summary of the statistics for the three experiments. Shown are for several baseline methods and the informed samplers theacceptance rates (left), PSRFs (middle), and RMSE values (right). All results are median results over multiple test examples.

The informed sampler: A discriminative approach to Bayesian inference in generative computer vision...

Documents

Transcript of The informed sampler: A discriminative approach to Bayesian inference in generative computer vision...