Particle Metropolis–Hastings using gradient and Hessian information

Second-order Particle MCMC for Bayesian

Parameter Inference

Johan Dahlin, Fredrik Lindsten and Thomas B. Schon∗

December 16, 2013

Abstract

We propose an improved proposal distribution in the Particle Metropolis-Hastings (PMH) algorithm for Bayesian parameter inference in nonlinearstate space models (SSMs). This proposal incorporates second-order in-formation about the posterior distribution over the system parameters,which can be extracted from the particle filter used in the PMH algo-rithm. This makes the algorithm scale-invariant, simpler to calibrate andshortens the burn-in phase. We also suggest improvements that reducesthe computational complexity of our earlier first-order method. The com-plexity of the previous method is quadratic in the number of particles,whereas the new second-order method is linear.

1 Introduction

We are interested in Bayesian parameter inference in nonlinear state spacemodels (SSM). An SSM with latent states x1:T , {xt}Tt=1 and measurementsy1:T , {yt}Tt=1 is defined as

xt|xt−1 ∼ fθ(xt|xt−1), (1a)

yt|xt ∼ gθ(yt|xt), (1b)

where fθ(·) and gθ(·) denote known distributions parame- trised by the unknownstatic parameter vector θ ∈ Θ ⊆ Rd. For simplicity, we assume that the initialstate x0 is known. In Bayesian inference, we are interested in computing theparameter posterior,

p(θ|y1:T ) =pθ(y1:T )p(θ)

p(y1:T ), (2)

∗This work was supported by: the project Calibrating Nonlinear Dynamical Models (Con-tract number: 621-2010-5876) funded by the Swedish Research Council and CADICS, aLinnaeus Center also funded by the Swedish Research Council. JD and FL are with theDivision of Automatic Control, Linkoping University, Linkoping, Sweden. E-mail: {johan.dahlin,lindsten }@isy.liu.se. TS is with Division of Systems and Control, Upp-sala University, Uppsala, Sweden. E-mail: [email protected].

arX

iv:1

311.

0686

v1 [

stat

.CO

] 4

Nov

201

3

where p(θ) denotes the prior distribution of the parameter. Here, the likelihoodfunction can be expressed as

pθ(y1:T ) = p(y1:T |θ) =

T∏t=1

pθ(yt|y1:t−1). (3)

For linear Gaussian SSMs, the one-step predictive distribution pθ(yt|y1:t−1) canbe computed using a Kalman filter. However, the distributions are intractablefor nonlinear and/or non-Gaussian models. Hence, the parameter posterior isintractable for many interesting problems, but it can be estimated e.g. usingSequential Monte Carlo (SMC) [7], Markov Chain Monte Carlo [25] or a com-bination of the two. The latter solution is referred to as Particle Markov ChainMonte Carlo (PMCMC) [1] [14] and enables routine Bayesian parameter infer-ence in general SSMs (1).

Earlier work in the area of Bayesian parameter inference includes e.g. [2],[18], [21]. PMCMC has earlier been used for nonlinear inference in e.g. finance[23], social network analysis [8] and system identfication [3]. In the latter, wepropose a method using Particle Metropolis-Hastings (PMH) with a proposalbased on first-order information about the posterior.

In this work, we improve the performance of the PMH-algorithm by also in-corporating second-order information into the proposal. This draws upon resultspresented in [10] for the Metropolis-Hastings (MH) algorithm. By including theHessian, the proposal is given the ability to automatically adjust the step lengthduring the run. This has the benefit of shortening the burn-in period and sim-plifies the tedious tuning, as the proposal is scale-invariant. Note, that this issimilar to a Newton-based optimisation algorithm, which also enjoy the sameinvariance.

Another improvement is the application of the fixed-lag particle smootherfor estimating the first-order and second-order information about the posterior.This greatly decreases the computational cost of the algorithm compared withour earlier work, from a quadratic complexity to a linear complexity in thenumber of particles. The proposed method is illustrated on two SSMs, whichverifies the benefits of the second-order proposal.

2 Second-order proposals

As previously stated, direct computation of the parameter posterior distribution(2) is often intractable. Instead, we make use of the Metropolis-Hastings (MH)algorithm [16, 11, 25] to sample from the posterior by the use of a Markovchain with certain properties. The chain is constructed so that its stationarydistribution is the posterior p(θ|y1:T ), from which we would like to sample.

The (ideal) MH-algorithm is an iterative procedure where two steps are car-ried out during each iteration: (i) sample parameters from a proposal distribu-tion, θ′ ∼ q(θ′|θ), where θ denotes the parameters from the previous state of theMarkov chain, and (ii) accept or reject the new parameters with the acceptance

probability,

α(θ′, θ) = 1 ∧ p(θ′)

p(θ)

pθ′(y1:T )

pθ(y1:T )

q(θ|θ′)q(θ′|θ)

, (4)

where we introduce a ∧ b = min{a, b}.Recall that the likelihood pθ(y1:T ) is intractable for the general SSM (1). In

Section 4, we discuss how to solve this particular problem, while still makingsure that the Markov chain converges to the parameter posterior. This is doneby replacing the intractable likelihood with an unbiased estimate resulting inan exact approximation of the MH-algorithm [1].

In this section, we design a proposal that uses second-order informationabout the posterior. After this, we discuss how to construct estimators for thelikelihood and the second-order information using SMC methods.

2.1 Laplace approximation of the log-posterior distribu-tion

A proposal distribution can be constructed by using a Laplace approximation[25] of the log-posterior distribution. Consider a second-order Taylor expansionof log p(θ′|y1:T ) around θ,

log p(θ′|y1:T ) ≈ log p(θ|y1:T )

+ (θ′ − θ)>∇ log p(θ|y1:T )

+1

2(θ′ − θ)>

[∇2 log p(θ|y1:T )

](θ′ − θ).

By taking the exponential of both sides and completing the square, we obtain

p(θ′|y1:T ) = N (θ′; θ + GT (θ),WT (θ)), with

W−1T (θ) , IT (θ) +∇2 log π(θ),

GT (θ) ,WT (θ) [ST (θ) +∇ log π(θ)] ,

which is discussed in e.g. [25]. Here, we introduced the notation ST (θ) ,∇ log pθ(y1:T ) and IT (θ) , −∇2 log pθ(y1:T ) for the gradient and the negativeHessian of the log-likelihood, respectively.

In [25], the authors discard the second-order informationWT (θ) from the ex-pression by replacing it with a constant diagonal d×d-matrix. Here, we insteadkeep the second-order information and, guided by the Laplace approximation,suggest the use of the proposal,

q (θ′|θ,ST (θ), IT (θ)) = N(θ′; θ +

Γ

2GT (θ),ΓWT (θ)

), (5)

where Γ = diag(γ) denotes a diagonal matrix with γ being a scalar or a d-vectorwith step-length(s). We use the former in the second-order proposal becauseof its scale-invariance property. In the zeroth-order and first-order proposals(introduced below) a vector is often needed to use different step-lengths foreach parameter.

2.2 Properties of the proposal distribution

We refer to the expression in (5) as the second-order proposal, makes use of boththe gradient and the Hessian in proposing new parameters. If the Hessian forthe log-posterior is replaced with a d × d-identity matrix, WT (θ) ≡ Id, a first-order proposal is obtained. Lastly, if the gradient is removed as well, GT (θ) ≡ 0,a zeroth-order proposal is obtained. This proposal distribution is equivalent toa Gaussian random walk proposal, which is a common standard choice whenusing the MH-algorithm.

We note in the passing that the second-order proposal has a statistical andgeometrical interpretation. The gradient and the negative Hessian of the log-likelihood are often referred to as the score function and Fisher’s informationmatrix, respectively. From such a perspective, the proposal in (5) is shown in[10] to be a random walk on a Riemann manifold with constant curvature usingthe information matrix as the metric.

The convergence of the first-order proposal is analysed by [26] and undercertain assumptions it require O(d−1/3) steps to converge to the stationarydistribution. This is compared with O(d) steps for the zeroth-order proposal.Therefore the first-order proposal is more efficient as the number of parametersd increases. To the authors’ knowledge, no analysis has been published forthe second-order proposal. However, numerical comparisons are presented inSection 5 indicating clear benefits when adding second-order information.

The MH-algorithm and the second-order proposal depends on the likelihood,gradient and Hessian, which for the general SSM (1) are intractable. Therefore,we now continue with discussing SMC methods which can be used to solve thisproblem.

3 Sequential Monte Carlo

Sequential Monte Carlo (SMC) is a family of algorithms used to sample froma sequence of probability distributions. A typical application of SMC methodsis to sample from the filtering and smoothing distribution in SSMs. In thissetting, we refer to SMC methods as particle filters and particle smoothers,respectively. Here, we limit ourselves to the auxiliary particle filter (APF) [22]and the fixed-lag (FL) particle smoother [13]. For more information, see e.g.[7], [5]

3.1 Auxiliary particle filter

We use the APF to compute an estimate of the likelihood and the latent statesof the SSM (1). An APF targeting the smoothing distribution pθ(x1:t|y1:t)generates a particle system using N particles {x(i)1:t, w

(i)t|t}

Ni=1. This can be used

to estimate the smoothing distribution,

pθ(dx1:t|y1:t) ,N∑i=1

w(i)t|t∑N

k=1 w(k)t|t

δx(i)1:t

(dx1:t), (6)

where w(i)t|t and x

(i)1:t denote the unnormalised weight and the state trajectory

of particle i from time 1 to t, respectively. Here, δz(dx1:t) denotes the Diracmeasure in the point z. The particle system is generated sequentially by theAPF in two steps: (i) the sampling/propagation step, and (ii) the weightingstep.

In the first step, the particle system from the previous time step t − 1 isresampled and propagated to generate an unweighted particle system at time t.This can be seen as sampling from a proposal kernel,

{a(i)t , x(i)t } ∼

wt−1|t−1∑Nk=1 w

(k)t−1|t−1

Rθ,t(xt|xatt−1), (7)

where we append the sampled particle to the trajectory by x(i)1:t = {xa

(i)t

1:t−1, x(i)t }.

Here, a(i)t denotes the ancestor index, i.e. the index of the particle at time

t − 1, from which x(i)t originates. Furthermore, Rθ,t(xt|xatt−1) denotes some

propagation kernel from which we can sample a new particle at time t given theancestor particle at time t− 1.

In the second step, the particle weights are computed as

w(i)t|t = Wθ,t(x

(i)t , x

a(i)tt−1) ,

gθ(yt|x(i)t )fθ(x(i)t |x

a(i)tt−1)

Rθ,t(x(i)t |x

a(i)tt−1)

. (8)

The particle system at time t can thus be estimated recursively using the twosteps in the APF.

3.2 Estimation of the likelihood

The likelihood for the general SSM (1) can be estimated using the particlesystems obtained from the APF. This is done by first writing the one-steppredictive density as

pθ(yt|y1:t−1) =

∫gθ(yt|xt)fθ(xt|xt−1)pθ(xt−1|y1:t−1) dxt−1:t

=

∫Wθ,t(xt, xt−1)Rθ,t(xt|xt−1)pθ(xt−1|y1:t−1) dxt−1:t,

where we have multiplied and divided with the propagation kernel Rθ,t(·). Toapproximate the integral, we note that the (unweighted) particle pairs {xatt−1, xt}

are approximately drawn from Rθ,t(xt|xt−1)pθ(xt−1|y1:t−1). Consequently, weobtain the Monte Carlo approximation

pθ(yt|y1:t−1) ≈ 1

N

N∑i=1

Wθ,t(x(i)t , x

a(i)tt−1) =

1

N

N∑i=1

w(i)t|t .

By inserting this approximation into (3) we obtain the particle estimate of thelikelihood,

L(θ) =

T∏t=1

(1

N

N∑i=1

w(i)t|t

). (9)

This likelihood estimator has been studied extensively in the SMC literature.The estimator is consistent and unbiased, see e.g. [23] and Proposition 7.4.1in [4]. Remember, that the unbiasedness is an essential property for the exactapproximation of the MH-algorithm and therefore also for our algorithm.

3.3 Estimation of the log-likelihood gradient

To estimate the gradient of the log-likelihood ST (θ) using SMC methods, weemploy Fisher’s identity [9, 2, 19],

∇ log pθ(y1:T ) =

∫∇ log pθ(x1:T , y1:T )pθ(x1:T |y1:T ) dx1:T ,

= Eθ[∇ log pθ(x1:T , y1:T )

∣∣∣y1:T ] . (10)

For the general SSM (1), we have

pθ(x1:T , y1:T ) =

T∏t=1

fθ(xt|xt−1)gθ(yt|xt). (11)

and this inserted into (10) results in

∇ log pθ(y1:T ) =T∑t=1

∫∇ log gθ(yt|xt, ut)pθ(xt|y1:T ) dxt

+

T∑t=1

∫∇ log fθ(xt|xt−1, ut−1)pθ(xt−1:t|y1:T ) dxt−1:t,

which depends on the one-step pθ(xt|y1:T ) and the two-step pθ(xt−1:t|y1:T )smoothing distributions.

In [24], quantity above is computed by using the APF directly or by usinga forward smoother (FS) [6]. The drawback with the first approach is pooraccuracy due to particle degeneracy. The second approach is computationallycostly as the FS algorithm has a computational complexity ofO(N2T ) comparedto O(NT ) for the APF.

In this paper, we instead make use of the FL-smoother [13, 20] which hasthe same computational cost as the APF, but better accuracy. This followsfrom that the FL-smoother mitigates the particle degeneracy experienced by theAPF. The FL-smoother relies on the assumption that the SSM (1) is mixingfast. That is, we can use the approximation pθ(xt|y1:T ) ≈ pθ(xt|y1:κt), withκt = min{t + ∆, T} and where ∆ denotes some lag. Hence, the smoothingdistribution of xt is not strongly influenced by measurements obtained aftersome time κt.

By marginalisation of (6) over x1:t−1 and xt+1:κt , we obtain the empiricalone-step smoothing distribution

pθ(dxt|y1:κt) ,N∑i=1

w(i)κt|κtδx(i)

κt,t(dxt), (12)

where we use the notation x(i)κt,t = x

a(i)κt,t

t . Here, we let a(i)κt,t denote the ancestor

index of particle x(i)κt at time t. Analogously, we obtain the empirical two-step

smoothing distribution as

pθ(dxt−1:t|y1:κt) ,N∑i=1

w(i)κt|κtδx(i)

κt,t−1:t( dxt−1:t), (13)

Inserting (11)-(13) into (10) gives the estimate of the gradient

ST (θ) =

T∑t=1

[L1,t(θ) + L2,t(θ)] , (14)

where we have introduced the quantities

L1,t(θ) =

N∑i=1

w(i)κt|κt∇ log gθ(yt|x(i)κt,t), (15a)

L2,t(θ) =

N∑i=1

w(i)κt|κt∇ log fθ(x

(i)κt,t|x

(i)κt,t−1). (15b)

In [20], the statistical properties of the FL-smoother are analysed. Theauthors show that the lag ∆? = log T minimises the mean squared error ofthe state estimates. It is also shown that the resulting estimates are biasedand this could be a significant problem in many applications. However in oursetting, the bias is later compensated for by the accept/reject-procedure in theMH-algorithm and the invariance property is retained.

Algorithm 1 Sequential Monte Carlo (SMC) for estimation of the gradient andHessian of the log-likelihood

Input: The system of the form (1) with measurements y1:T . The proposal distributionRθ,t(·), the number of particles N and the fixed-lag ∆.

Output: A particle estimate of the likelihood pθ(y1:T ), gradient ST (θ) and Hessian

IT (θ) at the parameter θ.

• Run the auxiliary particle filter

Initialise the particles x(i)0 for i = 1, . . . , N .

for t = 1, . . . , T do

- Resample and propagate each particle using (7).

- Calculate the weights for each particle using (8).

end for

• Estimate the likelihood function using (9)

• Run the fixed-lag particle smoother

for t = 1, . . . , T do

- Set κt = min(t+ ∆, T ).

- Recover the ancestor index a(i)κt,t

and a(i)κt,t−1 for each particle.

– Compute (15) and store the results.

– Compute (18) and store the results.

end for

• Estimate the gradient and the Hessian at time T using (14) and (17) with the storedvalues from the previous steps.

3.4 Estimation of the log-likelihood Hessian

The negative Hessian IT (θ) of the log-likelihood can be estimated using SMCmethods in combination with Louis’ identity [15, 2],

−∇2 log pθ(y1:T ) = [∇ log pθ(y1:T )]2

− Eθ[[∇ log pθ(x1:T , y1:T )]2|y1:T

]− Eθ

[∇2 log pθ(x1:T , y1:T )|y1:T

], (16)

where we introduce v2 = vv> for some vector v.We obtain similar expressions as for the gradient by inserting (11)-(13) into

(16),

IT (θ) =[ST (θ)

]2−

4∑k=1

T∑t=1

Mk,t(θ), (17)

where we have introduce the following quantities

M1,t(θ) =

N∑i=1

w(i)κt|κt

[∇θ log fθ(x

(i)κt,t|x

(i)κt,t−1)

]2, (18a)

M2,t(θ) =

N∑i=1

w(i)κt|κt

[∇θ log gθ(yt|x(i)κt,t)

]2, (18b)

M3,t(θ) =

N∑i=1

w(i)κt|κt∇

2θ log fθ(x

(i)κt,t|x

(i)κt,t−1), (18c)

M4,t(θ) =

N∑i=1

w(i)κt|κt∇

2θ log gθ(yt|x(i)κt,t). (18d)

Here, we have assumed for simplicity that no single parameter appears in boththe dynamics and observation processes. If such parameters exist, they can behandled by including the corresponding cross terms in (18).

3.5 SMC algorithm

In Algorithm 1, we present the complete algorithm that combines the APF andthe FL-smoother to compute the estimates of the gradient and Hessian. Theprimary outputs from this algorithm are the estimates of the likelihood, thegradient and the Hessian given a parameter θ. We continue by integrating thisinformation into the MH-sampler previously discussed.

In our experience, the off-diagonal elements in the information matrix areoften more difficult to estimate with good accuracy. Therefore, we only usethe diagonal elements of the information matrix in the remainder of this work.This retains the property that the second-order proposal is scale-invariant, butwithout taking the curvature into account.

4 Particle Metropolis-Hastings

From the previous development, we know how to estimate the various quantitiesneeded for using the MH-algorithm with the second-order proposal. Recall, thatthe exact approximation of the MH-algorithm guarantees that the stationarydistribution of the Markov chain remains the parameter posterior, see [1]. Thisresult only requires that the log-likelihood estimate is unbiased.

In fact, we are allowed to use the entire particle system in the proposal,see [3]. This opens up for using to use the second-order proposal, since wehave demonstrated that the gradient and Hessian information can be computedusing the particle system. Note, that these estimates are biased but this doesnot affect the invariance property as this is compensated for by the accept/rejectmechanism.

Algorithm 2 Second-order Particle Metropolis-Hastings (PMH) for Bayesian pa-rameter inference in nonlinear SSMsInput: The inputs to Algorithm 1. The number of PMH-iterations M , the initialparameter θ0 and the proposal step lengths γ.Output: Samples from the parameter posterior θ = {θ1, . . . , θM}.

• Run Algorithm 1 to obtain pθ0(y1:T ), ST (θ0) and IT (θ0).

• for k = 1, . . . ,M do

- Propose a new parameter θ′ ∼ q(θ′|θk−1) using (5) with the estimates of the

gradient ST (θk−1) and Hessian IT (θk−1).

- Run Algorithm 1 to obtain pθ′(y1:T ), ST (θ′) and IT (θ′).

- Sample uk ∼ U [0, 1].

- if uk < α(θ′, θk−1) given by (19) then

accept the proposed parameter

set θk = θ′ and pθk (y1:T ) = pθ′(y1:T ).

set ST (θk) = ST (θ′) and IT (θk) = IT (θ′).

else

reject the proposed parameter

set θk = θk−1 and pθk (y1:T ) = pθk−1(y1:T ).

set ST (θk) = ST (θk−1) and IT (θk) = IT (θk−1).

end if

end for

Hence, we can use the MH-algorithm together with Algorithm 1 to form thefinal method in Algorithm 2. The acceptance probability follows from (4) as

α(θ′, θ) = 1 ∧ pθ′(y1:T )

pθ(y1:T )

p(θ′)

p(θ)

q(θ|θ′, ST (θ′), IT (θ′))

q(θ′|θ, ST (θ), ST (θ)). (19)

This is the full PMH procedure that uses the second-order proposal. The com-plexity of the algorithm is linear in the number of particles N and in the numberof iterations M . The user-choices include the proposal kernel Rθ,t(·), the lag∆, the number of particles N and the number of iterations M . Also, the step-sizes γ needs to be calibrated for each model, this is further discussed in thesubsequent section.

5 Numerical illustrations

We continue by illustrating the method proposed in Algorithm 2 for parameterestimation in nonlinear SSMs. First, we consider a linear Gaussian state space

(LGSS) model and then a popular stochastic volatility model with a nonlinearobservation process.

We compare the three different variations of the proposal in (5), i.e. zeroth-order, first-order and second-order. The step length γ is selected individuallyfor each method such that the acceptance rate is about 30%. Also, we use thesame step length for all the parameters to simplify calibration, i.e. γ is selectedas a scalar.

5.1 Linear Gaussian state space model

Consider the linear Gaussian state space model (LGSS),

xt+1|xt ∼ N (xt+1; θ1xt, θ2) , (20a)

yt|xt ∼ N(yt;xt, 0.1

2), (20b)

with parameters θ? = {θ?1 , θ?2} = {0.5, 1.0}. We use T = 250 time steps, N =1 000 particles, M = 10 000 (discarding the first 5 000 iterations as burn-in) andthe bootstrap APF with Rθ,t(·) = fθ(·). The fixed-lag ∆ is chosen accordingto the rule-of-thumb ∆ = log T ≈ 6. Here, we use improper priors for theparameters, i.e. p(θ1) = U [−1, 1] and p(θ2) = U [0,∞]. The step lengths arecalibrated as γ(0) = 0.30, γ(1) = 0.16, γ(2) = 1.20, for the zeroth-order, first-order and second-order proposals respectively.

In the left part of Figure 1, we present the trace plots of the burn-in phase ofthe algorithms. We clearly see the advantage of using the second-order proposal,as it adjusts its step size quickly to reach the neighbourhood of the true param-eters. The contour plots of the estimated parameter posteriors are shown in theright part of Figure 1, where we see that all proposals give similar parameterposterior estimates.

5.2 Nonlinear stochastic volatility model

Consider the Hull-White stochastic volatility model [12],

xt+1|xt ∼ N(xt+1; θ1xt, θ

22

), (21a)

yt|xt ∼ N(yt; 0, 0.652 exp(xt)

), (21b)

with parameters θ? = {θ?1 , θ?2} = {0.98, 0.16}. We use the same settings andpriors as for the LGSS example. The step lengths are calibrated as γ(0) = 0.04,γ(1) = 0.03, γ(2) = 0.8, respectively.

In Figure 2, we present the burn-in trace plots and the parameter posteriordistributions for the three proposals. The behaviours of the proposals are similarto the LGSS example and using the second-order proposal again shortens theburn-in but keeps a similar parameter posterior estimate. Notice, the modeseeking behaviour of the first-order and second-order proposals as the posteriorestimates are more localised.

−1.0 −0.5 0.0 0.5 1.0

02

46

810

12

θ1

θ2

Zeroth−order

−1.0 −0.5 0.0 0.5 1.0

02

46

810

12

θ1

θ2

First−order

−1.0 −0.5 0.0 0.5 1.0

02

46

810

12

θ1

θ2

Second−order

θ1

θ2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Zeroth−order

θ1

θ2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

First−order

θ1

θ2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Second−order

Figure 1: The trace plots (left) of the first 50 iterations and contour plots of theparameter posterior estimates (right) from the three proposals used in Algorithm 2 onthe LGSS model in (20). The dotted lines corresponds to the parameters from whichthe data were generated.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

θ1

θ2

Zeroth−order

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

θ1

θ2

First−order

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

θ1

θ2

Second−order

θ1

θ2

0.88 0.92 0.96 1.00

0.0

0.1

0.2

0.3

0.4

0.5

Zeroth−order

θ1

θ2

0.88 0.92 0.96 1.00

0.0

0.1

0.2

0.3

0.4

0.5

First−order

θ1

θ2

0.88 0.92 0.96 1.00

0.0

0.1

0.2

0.3

0.4

0.5

Second−order

Figure 2: The trace plots (left) of the first 170 iterations and contour plots of theparameter posterior estimates (right) from the three proposals used in Algorithm 2 onthe stochastic volatility model in (21). The dotted lines corresponds to the parametersfrom which the data were generated.

6 Conclusions

We have proposed a novel algorithm based on PMH and FL-smoothing forBayesian parameter inference in nonlinear SSMs. The algorithm uses first-order and second-order information in the proposal to improve the performanceof the vanilla PMH-algorithm. The complexity of the proposed algorithm islinear in the number of particles, which makes it a practical alternative to othersmoothing-based inference algorithms.

We have seen examples of that using the second-order proposals shortens theburn-in phase. Also, the second-order proposal is simpler to tune as it is scale-invariant and automatically rescales the step length in each direction. In theMH-algorithm, it is known that adding first-order information into the proposalimproves the performance in large dimensional problems. Hopefully, similarresults can be found for the second-order proposal in the PMH framework.

Future work includes theoretical analysis of the convergence rate and scalingproperties of the algorithm. Also, it would be interesting to explore the useHamiltonian MCMC [17] ideas in this setting. This would potentially improvethe mixing of the Markov chain and open up the possibility of solving problemswith hundreds of parameters.

References

[1] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain MonteCarlo methods. Journal of the Royal Statistical Society: Series B (Statis-tical Methodology), 72(3):269–342, 2010.

[2] O. Cappe, E. Moulines, and T. Ryden. Inference in Hidden Markov Models.Springer, 2005.

[3] J. Dahlin, F. Lindsten, and T. B. Schon. Particle Metropolis Hastings usingLangevin dynamics. In Proceedings of the 38th International Conference onAcoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada,May 2013.

[4] P. Del Moral. Feynman-Kac Formulae - Genealogical and Interacting Par-ticle Systems with Applications. Probability and its Applications. Springer,2004.

[5] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers.Journal of the Royal Statistical Society: Series B, 68(3):411–436, 2006.

[6] P. Del Moral, A. Doucet, and S. Singh. Forward smoothing using sequentialMonte Carlo. Pre-print, 2010. arXiv:1012.5390v1.

[7] A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing:Fifteen years later. In D. Crisan and B. Rozovsky, editors, The OxfordHandbook of Nonlinear Filtering. Oxford University Press, 2011.

[8] R. G. Everitt. Bayesian parameter estimation for latent Markov randomfields and social networks. Journal of Computational and Graphical Statis-tics, 21(4):940–960, 2012.

[9] R. A. Fisher. Theory of statistical estimation. Mathematical Proceedingsof the Cambridge Philosophical Society, 22(05):700–725, 1925.

[10] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamilto-nian Monte Carlo methods. Journal of the Royal Statistical Society: SeriesB, 73(2):1–37, 2011.

[11] W. K. Hastings. Monte Carlo sampling methods using Markov chains andtheir applications. Biometrika, 57(1):97–109, 1970.

[12] J. Hull and A. White. The pricing of options on assets with stochasticvolatilities. The Journal of Finance, 42(2):281–300, 1987.

[13] G. Kitagawa and S. Sato. Monte carlo smoothing and self-organising state-space model. In A. Doucet, N. de Fretias, and N. Gordon, editors, Sequen-tial Monte Carlo methods in practice, pages 177–195. Springer, 2001.

[14] F. Lindsten and T. B. Schon. Backward simulation methods for MonteCarlo statistical inference. 6(1):1–143, August 2013.

[15] T. A. Louis. Finding the observed information matrix when using the EMalgorithm. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 44(02):226–233, 1982.

[16] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, andE. Teller. Equation of state calculations by fast computing machines. TheJournal of Chemical Physics, 21(6):1087–1092, 1953.

[17] R. M. Neal. MCMC using Hamiltonian dynamics. In B. Steve, G. Andrew,J. Galin, and M. Xiao-Li, editors, Handbook of Markov Chain Monte Carlo.Chapman & Hall/ CRC Press, June 2010.

[18] B. Ninness and S. Henriksen. Bayesian system identification via Markovchain Monte Carlo techniques. Automatica, 46(1):40–51, 2010.

[19] B. Ninness, A. Wills, and T. B. Schon. Estimation of general nonlinearstate-space systems. In Proceedings of the 49th IEEE Conference on Deci-sion and Control (CDC), Atlanta, USA, December 2010.

[20] J. Olsson, O. Cappe, R. Douc, and E. Moulines. Sequential monte carlosmoothing with application to parameter estimation in nonlinear statespace models. Bernoulli, 14(1):155–179, 2008.

[21] V. Peterka. Bayesian system identification. Automatica, 17(1):41–53, 1981.

[22] M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particlefilters. Journal of the American Statistical Association, 94(446):590–599,1999.

[23] M. K. Pitt, R. S. Silva, P. Giordani, and R. Kohn. On some properties ofMarkov chain Monte Carlo simulation methods based on the particle filter.Journal of Econometrics, 171:134–151, 2012.

[24] G. Poyiadjis, A. Doucet, and S. S. Singh. Particle approximations of thescore and observed information matrix in state space models with applica-tion to parameter estimation. Biometrika, 98(1):65–80, 2011.

[25] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2edition, 1999.

[26] G. O. Roberts and J. S. Rosenthal. Optimal Scaling of Discrete Approx-imations to Langevin Diffusions. Journal of the Royal Statistical Society.Series B (Statistical Methodology), 60(1):255–268, 1998.

Particle Metropolis–Hastings using gradient and Hessian information

Documents

Transcript of Particle Metropolis–Hastings using gradient and Hessian information