An Adaptive Metropolis Algorithm

An Adaptive Metropolis algorithmHeikki Haario, Eero Saksman� and Johanna TamminenAbstractA proper choice of a proposal distribution for MCMC methods, e.g. for theMetropolis-Hastings algorithm, is well known to be a crucial factor for the con-vergence of the algorithm. In this paper we introduce an adaptive MetropolisAlgorithm (AM), where the Gaussian proposal distribution is updated along theprocess using the full information cumulated so far. Due to the adaptive nature ofthe process, the AM algorithm is non-Markovian, but we establish here that it hasthe correct ergodic properties. We also include the results of our numerical tests,which indicate that the AM algorithm competes well with traditional Metropolis-Hastings algorithms, and demonstrate that AM provides an easy to use algorithmfor practical computation.1991 Mathematics Subject Classi�cation: 65C05, 65U05.Keywords: adaptive MCMC, comparison, convergence, ergodicity, Markov ChainMonte Carlo, Metropolis-Hastings algorithm1 IntroductionIt is generally acknowledged that the choice of an e�ective proposal distribution fore.g. the random walk Metropolis algorithm is essential in order to obtain reason-able results by the simulation in a limited amount of time. This concerns both thesize and the spatial orientation of the proposal distribution, which are often very di�-cult to choose well since the target density is unknown (see e.g. [Gelman et al. 1996],[Gilks et al. 1995], [Gilks et al. 1998],[Haario et al. 1998], and [Roberts et al. 1994]). Apossible remedy is provided by adaptive algorithms, which use the history of the pro-cess in order to 'tune' the proposal distribution suitably. This has previously beendone (for instance) by assuming that the state space contains an atom. The adap-tation is performed only at the recurrence times to the atom in order to preservethe right ergodic properties [Gilks et al. 1998]. The adaptation criteria are then ob-tained by monitoring the acceptance rate. A related interesting self regenerative ver-sion of adaptive MCMC, based on introducing an auxiliary chain, is contained in the�Supported by the Academy of Finland, Project 328371

recent preprints [Sahu and Zhigljavsky 1998a] and [Sahu and Zhigljavsky 1998b]. Forother versions of adaptive MCMC and related work we refer to e.g. [Evans 1991],[Fishman 1996] [Gelfand and Sahu 1994], and [Marinari and Parisi 1992] together withthe references therein.We introduce here an adaptive Metropolis algorithm (AM), which adapts continu-ously to the target distribution. What is important, the adaptation a�ects both the sizeand the spatial orientation of the proposal distribution. Moreover, the new algorithm isstraightforward to implement and use in practice. The de�nition of the AM algorithmis based on the classical random walk Metropolis algorithm [Metropolis et al. 1953] andits modi�cation, the AP algorithm (introduced in [Haario et al. 1998]). In the AP algo-rithm the proposal distribution is a Gaussian distribution centered at the current state,and the covariance is calculated from a �xed �nite number of the previous states. In theAM algorithm the covariance of the proposal distribution is calculated using all of theprevious states. The method is easily implemented with no extra computational costsince one may apply a simple recursion formula for the covariances involved.An important advantage of the AM algorithm is that it uses the cumulating infor-mation right from the very begin of the simulation. The rapid start of the adaptationensures that the search becomes more e�ective already at an early stage of the simula-tion, which diminishes the number of function evaluations needed.To be more exact, assume that at time t the already sampled states of the AM chainare X0;X1; : : : ;Xt, some of which may be multiple. The new proposal distribution forthe next candidate point is then a Gaussian distribution with mean at the current pointXt and covariance given by sdR; where R is the covariance matrix determined by thespatial distribution of the states X0;X1; : : : ;Xt 2 IRd. The scaling parameter sd dependsonly on the dimension d of the vectors. This adaptation strategy forces the proposaldistribution to approach an appropriately scaled Gaussian approximation of the targetdistribution, which increases the e�ciency of the simulation. A more detailed descriptionof the algorithm is given in section 2 below.One of the di�culties in constructing adaptive MCMC algorithms is to ensure thatthe algorithm maintains the correct ergodicity properties. We observe here (see also[Haario et al. 1998]) that the AP algorithm fails this property. Our main result, Theorem1 below, veri�es that the AM process indeed has the correct ergodicity properties. TheAM chain is not Markovian, but we show that the asymptotic dependence between theelements of the chain is weak enough to apply known theorems of large numbers formixingales. Similar results may be proven also for variants of the algorithm, where thecovariance is computed from a suitably increasing segment of the near history.Section 3 contains a detailed description of the AM algorithm as a stochastic processand the theorem on the ergodicity of the AM. The proof is based on an auxiliary resultthat is proven in Section 4. Finally, in Section 5 we present results from test simulations,where the AM algorithm is compared with traditional Metropolis-Hastings algorithmsby applying both linear and non-linear, correlated and uncorrelated unimodal target dis-tributions. Our tests seem to imply that AM performs at least as well as the traditional2

algorithms for which a nearly optimal proposal distribution has been given a priori.2 Description of the algorithmWe shall assume for simplicity that our target distribution is supported on a boundedsubset S � IRd, and that it has the (unscaled) density �(x) with respect to the Lebesguemeasure on S. We shall also denote the target distribution by � with a slight abuse ofthe notation.We now explain how the AM algorithm works. Recall from the introduction thatthe basic idea is to update the proposal distribution by using the knowledge we have sofar learned about the target distribution. Otherwise the de�nition of the algorithm isidentical to the usual Metropolis process.Suppose thus that at time t�1 we have sampled the statesX0;X1; : : : ;Xt�1; whereX0is the initial state. Then a candidate point Y is sampled from the proposal distributionqt(�jX0; : : : ;Xt�1); which now may depend on the whole history (X0;X1; : : : ;Xt�1): Thecandidate point Y is accepted with the probability�(Xt�1; Y ) = min 1; �(Y )�(Xt�1)! ;in which case of we set Xt = Y; and otherwise Xt = Xt�1:The proposal distribution qt(�jX0; : : : ;Xt�1) employed in the AM algorithm is a Gaus-sian distribution with mean at the current pointXt�1 and covarianceCt = Ct(X0; : : :Xt�1).Note that in the simulation only jumps into S are accepted since we assumed that thetarget distribution vanishes outside S.The crucial thing regarding the adaptation is how the covariance of the proposaldistribution depends on the history of the chain. In the AM algorithm this is solved bysetting Ct = sdCov(X0; : : :Xt�1) + sd"Id after an initial period, where sd is a parameterthat depends only on dimension d and " > 0 is a constant that we may choose verysmall compared to the size of S. In order to start we select an arbitrary strictly positivede�nite initial covariance C0, which of course is chosen according to our best prioriknowledge (which may be quite poor). We select an index t0 > 0 for the length of aninitial period and de�neCt = ( C0; t � t0sdCov(X0; : : : ;Xt�1) + sd"Id; t > t0: (1)The covariance Ct may be viewed as a function of t variables from IRd having values inuniformly positive de�nite matrices.Recall the de�nition of the empirical covariance matrix determined by points x0; : : : ; xk 2IRd : Cov(x0; : : : ; xk) = 1k kXi=0 xixTi � (k + 1)xk xTk ! ; (2)3

where xk = 1k+1 Pki=0 xi and the elements xi 2 IRd are considered as column vectors. Soone obtains that in de�nition (1) for t � t0 + 1 the covariance Ct satis�es the recursionformula Ct+1 = t� 1t Ct + sdt �tX t�1XTt�1 � (t+ 1)X tXTt +XtXTt + "Id� : (3)This allows one to calculate Ct without too much computational cost since the mean X talso satis�es an obvious recursion formula.The choice for the length of the initial segment t0 > 0 is free, but the bigger it ischosen the slower the e�ect of the adaptation starts to take place. In a sense the sizeof t0 re ects our trust in the initial covariance C0. As S was assumed to be bounded,the covariance Ct will never blow up. The role of the parameter " is just to ensure thatCt will not become singular (see Remark 1 below). As a basic choice for the scalingparameter we have adopted the value sd = (2:4)2=d from [Gelman et al. 1996], whereit was shown that in a certain sense this choice optimizes the mixing properties of theMetropolis search in the case of Gaussian targets and Gaussian proposals.Remark 1 In our test runs the covariance Ct has not had the tendency to degenerate.Hence, in practical computations one may safely utilize the de�nition (1) with " = 0,although the change is negligible if " was already chosen small enough. However, for thepurposes of our theoretical considerations it is convenient to assume that " > 0:Remark 2 In order to avoid a slow start of the algorithm it is possible to employspecial tricks. Naturally, if a priori knowledge (like the maximum likelihood value orapproximate covariance of the target distribution) is available, it can be utilized inchoosing the initial state or the initial covariance C0: Also, in some cases it is advisableto employ the greedy start procedure: during a short initial period one updates theproposal using only the accepted states. Afterwards the AM is run as described above.Moreover, during the early stage of the algorithm it is natural to require that it moves atleast a little. If it has not moved enough during some number of iterations, the proposaldistribution could be shrunk by some constant factor.Remark 3 It is also possible to choose an integer n0 > 1 and update the covarianceevery n0th step only (using again the whole history). This saves computer time whengenerating the candidate points. There is again a simple recursion formula for thecovariances Ct:3 Ergodicity of the AM chainIn the AP algorithm, which was brie y described in the introduction, the covarianceCt was calculated from the last H states only, where H � 2. This strategy bringsnon-exactness to the simulation as an undesirable consequence. There are several ways4

to see this. One may, for instance, study the Markov chain consisting of H-tuples ofconsecutive variables of the AP chain, to obtain the limit distribution for the AP bya suitable projection from the equilibrium distribution of this Markov chain. Simpleexamples in the case of �nite state space for an analogous model show that the limitingdistribution of the AP algorithm di�ers slightly from the target distribution. Moreover,numerical calculations in the continuous case indicate similar behaviour. An illustratingexample of this phenomenon is presented in the Appendix.It is our aim in this section to show that the AM algorithm has the right ergodicproperties and hence provides correct simulation of the target distribution.Let us start by recalling some basic notions of the theory of stochastic processes thatare needed later. We �rst de�ne the setup. Let (S;B;m) be a state space. For naturalnumbers n � 1 the symbol Kn will always denote a generalized transition probabilityfrom the product Sn ! S. Thus Kn is a map Sn�B ! [0; 1] for which x! Kn(x;A) isBn-measurable for each A � B, and Kn(x; �) is a probability measure on (S;B) for eachx 2 Sn: In a natural way Kn de�nes a positive contraction from M(Sn) into M(S).Here M(S) denotes the set of �nite measures on (S;B) and the norm k � k on M(S)denotes the total variation norm. A transition probability on S corresponds to the casen = 1 in the above de�nition.We assume that a sequence of generalized transition probabilities (Kn)1n=1 is given.Moreover, let �0 be a probability distribution (the initial distribution) on S: Then thesequence (Kn) and �0 determine uniquely the �nite dimensional distributions of thediscrete time stochastic process (chain) (Xn)1n=0 on S via the formulaP(X0 2 A0;X1 2 A1; : : : ;Xn 2 An) = Zy02A0 �0(dy0)�Zy12A1 K1(y0; dy1) (4)�Zy22A1 K2(y0; y1; dy2) � � � �Zyn2AnKn(y0; y1; : : : ; yn�1; dyn)� : : :� :In fact, it is directly veri�ed that these distributions are consistent and the theorem ofIonescu-Tulcea (see e.g. Proposition V.1.1 of [Neveu 1965]) yields the existence of thechain (Xn) on S satisfying (4).We shall now turn to the exact de�nition of the AM chain as a discrete time stochasticprocess. We recall, that our target distribution is supported on a bounded subset S � IRd,so that �(x) � 0 outside S. Thus we shall choose S to be our state space, whenequipped with the Borel sigma algebra B(S) and choosing m to be the normalizedLebesgue measure on S. The target � has the (unscaled) density �(x) with respect tothe Lebesgue measure on S. For simplicity we shall assume here that the density isbounded from above on S : for some M <1 it holds that�(x) �M for x 2 S: (5)Let C be a symmetric and strictly positive de�nite matrix on IRd and denote by NC5

the density of the mean-zero Gaussian distribution on IRn with covariance C: ThusNC(x) = 1(2�)n=2qjCj exp(�12xTC�1x): (6)The Gaussian proposal transition probability corresponding to the covariance C satis�esQC(x;A) = ZANC(y � x)dy; (7)where A � IRd is a Borel set and dy is the standard Lebesgue measure on IRd: It followsthat QC is m-symmetric (see e.g. [Haario and Saksman 1991, De�nition 2.2]).We next recall the de�nition of the transition probability MC for the Metropolisprocess having the target density �(x) and the proposal distribution QCMC(x;A) = ZANC(y � x)min 1; �(y)�(x)!m(dy) (8)+ �A(x) ZIRd NC(y � x) "1 �min 1; �(y)�(x)!#m(dy);for A 2 B(S), and where �A denotes the characteristic function of the set A. It is easilyveri�ed that MC de�nes a transition probability with state space S.The AM chain, that we now de�ne, is a Markov chain with in�nite memory. Thede�nition corresponds exactly to the AM algorithm introduced in section 2.De�nition 1 Let S and � be as above and let the initial covariance C0 and the constant" be given. De�ne the functions Cn for n � 1 by the formula (1). For a given initialdistribution �0 the Adaptive Metropolis chain (the AM chain) is a stochastic chain on Sde�ned through (4) by the sequence (Kn)1n=1 of generalized transition probabilities, whereKn(x0; : : : ; xn�1;A) =MCn(x0;:::;xn�1)(xn�1;A) (9)for all n � 1; xi 2 S (0 � i � n� 1), and for subsets A 2 B(S):Let us turn to the study of the ergodicity properties of the AM chain. In order tobe able to proceed we give some de�nitions. Recall �rst the de�nition of the coe�cientof ergodicity [Dobrushin 1956]. Let T be an transition probability on S and set�(T ) = sup�1;�2 k�1T � �2Tkk�1 � �2k ; (10)where the supremum is taken over distinct probability measures �1; �2 on (S;B): Asusual, �T denotes the measure A 7! RS T (x;A)�(dx):6

Clearly 0 � �(T ) � 1. In the case �(T ) < 1 the mapping T is a strict contraction onM(S) with respect to the metric de�ned by the total variation norm on M(S). Fromthe de�nition it easily follows that�(T1T2:::Tn) � nYi=1 �(Ti): (11)The condition �(T k0) < 1 for some k0 � 1 is well known to be equivalent to the uniformergodicity (compare [Nummelin 1984, Section 6.6.]) of the Markov chain having thetransition probability T:For our purposes it is useful to de�ne the transition probability that is obtainedfrom a generalized transition probability by 'freezing' the n � 1 �rst variables. Hence,given a generalized transition probability Kn (where n � 2) and a �xed (n � 1)-tuple(y0; y1; : : : yn�2) 2 Sn�1, we denote eyn�2 = (y0; y1; : : : yn�2) and de�ne the transitionprobability Kn;eyn�2 by Kn;eyn�2 (x;A) = Kn(y0; y1; : : : yn�2; x;A) (12)for x 2 S and A 2 B(S):We are now ready to state and prove our main theorem.Theorem 1 The AM chain (Xn) de�ned by the generalized transition probabilities (9)simulates properly the target distribution � regardless of the initial distribution: for anybounded and measurable function f : S ! IR it holds almoast surely thatlimn!1 1n+ 1(f(X0) + f(X1) + : : :+ f(Xn)) = ZS f(x)�(dx):The proof of the theorem is based on the following technical auxiliary result, whose proofwe postpone to the next section.Theorem 2 Assume that the �nite dimensional distributions of the stochastic process(Xn)1n=0 on the state space S satisfy (4), where the sequence of generalized transitionprobabilities (Kn) is assumed to satisfy the following three conditions (i) { (iii):(i) There is a �xed integer k0 and a constant � 2 (0; 1) such that�((Kn;eyn�2)k0) � � < 1 for all eyn�2 2 Sn�1 and n � 2:(ii) There is a �xed probability measure � on S and a constant c0 > 0 so thatk�Kn;eyn�2 � �k � c0n for all eyn�2 2 Sn�1 and n � 2:7

(iii) The estimate for the operator normkKn;eyn�2 �Kn+k;eyn+k�2kM(S)!M(S) � c1 kn;holds, where c1 is a �xed positive constant, n; k � 1 and one assumes that the(n+ k � 1)-tuple eyn+k�2 is a direct continuation of the (n� 1)-tuple eyn�2.Then, if f : S ! IR is bounded and measurable, it holds almost surely thatlimn!1 1n+ 1(f(X0) + f(X1) + : : :+ f(Xn)) = ZS f(x)�(dx): (13)In what follows the auxiliary constants ci, i = 1; 2; : : :, depend on S; ", or C0, andtheir actual value is irrelevant for our purposes here.Proof of Theorem 1. According to the above theorem it su�ces to prove that theAM chain satis�es the conditions (i){(iii). In order to check condition (i) we observethat, directly from de�nition (1) and by the fact that S is bounded, all the covariancesC = Cn(y0; : : : ; yn�1) satisfy the matrix inequality0 < c1Id � C � c2Id: (14)Hence the corresponding normal densities NC(� � x) are uniformly bounded from belowon S for all x 2 S, and (5) together with (8) trivially yield the boundKn;eyn�2 (x;A) � c3�(A) for all x 2 S and A � S;with c3 > 0. This easily yields (compare e.g. [Nummelin 1984, pp. 122{123]) that�(Kn;eyn�2) � 1 � c3, which proves (i) with k0 = 1:We next verify condition (iii). For that end we assume that n � 2 and observe thatfor given eyn+k�2 2 Sn+k�1 one haskKn;eyn�2 �Kn+k;eyn+k�2kM(S)!M(S) � 2 supy2S;A2B(S)jKn;eyn�2(y;A)�Kn+k;eyn+k�2 (y;A)j:Fix y 2 S and A 2 B(S) and introduce R1 = Cn(y0; : : : ; yn�2; y) together with R2 =Cn+k(y0; : : : ; yn+k�2; y). According to De�nition 1 and formula (8) we see thatjKn;eyn�2 (y;A)�Kn+k;eyn+k�2 (y;A)j = jMR1(y;A)�MR2(y;A)j� j Zx2A(NR1 �NR2)(x� y)min 1; �(x)�(y)!m(dx)+ �A(x) Zx2IRd(NR1 �NR2)(x� y) "1�min 1; �(x)�(y)!#m(dx)j� 2 ZIRd jNR1(z)�NR2(z)jdz� 2 ZIRd dz Z 10 ds �� ddsNR1+s(R2�R1)(z)�� ds� c4kR1 �R2k; 8

where at the last stage we applied the knowledge (14) in order to deduce that the partialderivatives of the density NR1+s(R2�R1) with respect to the components of the covarianceare integrable over IRd with bounds that depend only on ";C0 and S: Finally, it is clearfrom recursion formula (3) that in general kCt � Ct+1k � c=t for t > 1. By applyingthis inductively and using the uniform boundedness from above of the covariances Ctwe easily see that kR1 �R2k � c(S;C0; ")k=n;and hence the previous estimates yield (iii).In order to check condition (ii) �x eyn�2 2 Sn�1 and denote C� = Cn�1(y0; : : : yn�2):It follows that kC��Cn(y0; : : : yn�2; y)k � c5=n; where c5 does not depend on y 2 S:Wemay hence estimate as before thatkKn;eyn�2 �MC�kM(S)!M(S) � c6n :Since MC� is a Metropolis transition probability we have that �MC� = �; see e.g.[Tierney 1994, p. 1705]), and we obtaink� � �Kn;eyn�2k = k�(MC� �Kn;eyn�2)k � c6n ;which completes the proof of Theorem 1.Let us record an expected result on the behaviour of the AM chain.Corollary 3 Under the assumptions of Theorem 1 the covariance Ct almost surelystabilizes during the algorithm. In fact, as t ! 1 the covariance Ct converges tosdCov(�) + "Id, where Cov(�) denotes the covariance of the target distribution �:Proof. The claim follows directly from the de�nition (1) of the covariance Ct byapplying Theorem 1 by choosing repeatedly f(x) = xi and f(x) = xixj, where 1 �i; j � d:Remark 4 In the de�nition of the AM chain one can replace the Gaussian proposalsby e.g. uniform distributions in a parallelepiped. In this case the size and orientation ofthe parallelepiped is guided in a natural manner by the covariance Ct that is determinedby (1) as above. Our proof of Theorem 1 remains unchanged and we again obtain theexactness of the simulation. The only di�erence is that in this case the constant k0 incondition (i) of Theorem 2 may exceed 1. Naturally, here one has to make the necessaryassumption that the set fx : �(x) > 0g does not contain components that are too muchseparated. In this connection the estimates provided by [Haario and Saksman 1991,Theorem 6.5.(b)] are relevant.Similar remarks apply to modi�cations where one adapts only certain parameters orsome of the parameters are discrete. 9

Remark 5 It is clear that in the course of the AM algorithm one may also determinethe covariance by using only an increasing part of the near history. For example, onemay determine Cn by using only the samples X[n=2];X[n=2]+1; : : :Xn. This is easily im-plemented in practice and in this case Theorem 1 yields exactness of the simulationwith only minor changes in the proof. Similar remarks apply also to the case, whereone updates the covariance only every n0:th step (see Remarks 3 and 6). Finally wenote that in this paper we have left open the question whether the convergence of thealgorithm (as established in Theorem 1) satis�es a central limit theorem.4 Proof of Theorem 2In this section we will prove Theorem 2 by showing that a related process is a mixingale(in the sense of [McLeish 1975]) that satis�es an appropriate law of large numbers. Theconditions of the theorem were tailored to apply to the AM chain on bounded subsets ofIRn, but they are stated in the language of a general state space. This is advantageoussince one may apply them in a more general situation, especially for variants of the AMwhere the state space contains both discrete and continuous parts. Our proof is basedon the following basic propositionProposition 4 Let the chain (Xn) on the state space S and the generalized transitionprobabilities (Kn) ful�ll the conditions of Theorem 2. Denote by Fn = �(X0;X1; : : : ;Xn)the sigma algebra generated by the chain up to time n and write �0 = �1=k0 . Then for allinitial distributions and for any measurable and bounded function f on S it holds thatkE(f(Xn+k)jFn)� ZS f(y)�(dy)k1 � c2(c0; �) inf1�j�k j2n + k � j + �0j! kfk1: (15)Proof. We may clearly assume that RS f(y)�(dy) = 0 since the claim holds trivially ifthe function f is a constant. Let n � 2; k � 2 and note that from the de�nition of theconditional expectation and (4) it follows that (almost surely)E(f(Xn+k)jFn) (16)= Zyn+12SKn+1(X0;X2; : : : ;Xn; dyn+1) Zyn+22SKn+2(X0;X2; : : : ;Xn; yn+1; dyn+2)� � � Zyn+k2SKn+k(X0;X2; : : : ;Xn; yn+1; : : : ; yn+k�1; dyn+k)f(yn+k)! : : :! :Let us denote (X0; : : : ;Xn) = fXn: In what follows fXn does not interfere with theintegrations and hence it may be thought as a free variable (or constant). We also intro-duce the transition probability Q, where Q(y; dz) = Kn+2(fXn; y; dz): The assumption10

(iii) yields for arbitrary values of fXn and yn+1; : : : yn+k�1 thatj Zyn+k2S�Kn+k(fXn; yn+1; : : : ; yn+k�1; dyn+k) (17)�Kn+2(fXn; yn+k+1; dyn+k)�f(yn+k)j � c0kfk1 k � 2n+ 2 :This estimate enables us write (16) in the formE(f(Xn+k)jFn) = gk(fXn) + Zyn+12SKn+1(fXn; dyn+1) Zyn+22SKn+2(fXn; yn+1; dyn+2)� � � Zyn+k�12SKn+k�1(fXn; yn+1; : : : ; yn+k�2; dyn+k�1)� � � Zyn+k2SKn+2(fXn; yn+k�1; dyn+k)f(yn+k)! : : :! ; (18)where gk = gk(fXn) satis�esjgk(fXn)j � Zyn+12SKn+1(fXn; dyn+1) Zyn+22SKn+2(fXn; yn+1; dyn+2)� � � Zyn+k�12SKn+k�1(fXn; yn+1; : : : yn+k�2; dyn+k�1)f(yn+k)c0kfk1 k � 2n+ 2! : : :!� c0kfk1 k � 2n+ 2 :In the next step we iterate the procedure by replacing the generalized transition prob-ability Kn+k�1(fXn; yn+1; : : : ; yn+k�2; dyn+k�1) by the transition probability Q in formula(18). By continuing in this manner we obtainE(f(Xn+k)jFn) = Zyn+12SKn+1(fXn; dyn+1) Zyn+22S Q(yn+1; dyn+2)� � � Zyn+k2S Q(yn+k�1; dyn+k)f(yn+k)! : : :!+ g2(fXn) + g2(fXn) + : : :+ gk(fXn);where gj = Zyn+12S Kn+1(fXn; dyn+1) Zyn+22S Kn+2(fXn; yn+1; dyn+2) (19)� � � Zyn+j2S�Kn+j(fXn; yn+1; : : : ; yn+j�1; dyn+j)� Kn+2(fXn; yn+j�1; dyn+j)�Qk�jf(yn+j)� : : :� :11

Recall here that Qk�j denotes the (k � j):th iterate of the transition probability Q andwe apply the standard notation (Qk�jf)(x) = RS Qk�j(x; dy)f(y):Since kQk�jfk1 � kfk1 we obtain as before from assumption (i) thatjgjj � c0 j � 2n + 2kfk1:Summing up, we have shown thatE(f(Xn+k)jFn) = "n;k + Zyn+12SKn+1(X0; : : : ;Xn; dyn+1)Qk�1f(yn+k); (20)where "n;k = "n;k(X0; : : : ;Xn) satis�esj"n;kj � kXj=2 c0 j � 2n+ 2kfk1 � c0k2n kfk1: (21)Write next [(k � 1)=k0] = k0 and notice that �(Qk�1) � �k0 according to (i). By (ii)and the de�nition of Q we havek�Qk�1 � �k � k�2Xj=0 k�Qj+1 � �Qjk � k�2Xj=0 c0n+ 2 � c0(k � 1)n+ 2and hence, using (iii) and the assumption �f = 0, we may estimatekQk�1fk1 = supx2Sj�xQk�1f j � supx2Sj(�x � �)Qk�1f j+ j�Qk�1f j� 2�k0kfk1 + j(�Qk�1 � �)f j � (c0(k � 1)n+ 2 + 2�k0)kfk1: (22)Combining this with (20) and (21), it follows thatkE(f(Xn+k)jFn)k1 � c1(c0; �) k2n + �[(k�1)=k0 ]! kfk1; (23)which is valid for all n; k � 2:In order to deduce the proposition, we �rst observe that for any index j between 1and k the standard properties of the conditional expectation yield thatkE(f(Xn+k)jFn)k1 � kE(f(Xn+k)jFn+k�j)k1:Hence, by replacing n by n + k � j and k by j in the estimate (23) we �nally deducethat kE(f(Xn+k)jFn)k1 � inf1�j�k c1(c0; �) j2n + k � j + �[(j�1)=k0]! kfk1: (24)The claim of the proposition follows immediately from this estimate.12

Proof of Theorem 2. From Proposition 4 we obtain thatkE(f(Xn+k)� ZS f(y)�(dy))jFn)k1 � (k); (25)where (k) � c2(c0; �) inf1�j�k( j2k � j + �0j)kfk1 � c2(c0; f; �) log2 kk ; (26)and the last estimate is obtained by choosing j = log k= log(1=�0) for k � k1(�0):At this stage the estimate (26) for the asymptotic independence together with thede�nition of the sigma-algebra Fn make it clear that f(Xn)�RS f(y)�(dy) is a mixingalein the sense of McLeish (see [McLeish 1975] or [Hall and Heyde 1980, p. 19]). Moreover, (k) � C(")k"�1 for every " > 0: Hence we may apply directly the well-known laws oflarge numbers for mixingales in the form of [Hall and Heyde 1980, Theorem 2.21, p. 41]to the sequence f(Xn) in order to obtain the desired conclusion.Remark 6 We refer to the original article [McLeish 1975] or to the recent review article[Davidson and Jong 1997] for basic properties of mixingales. However, we point out thatthe proof of Theorem 2 could be concluded by elementary means, without referring tothe theory of mixingales, by applying Proposition 4 to estimate the variance of the sumSn = (1=n)Pnk=1 f(Xn)� RS f(y)�(dy) and utilizing the boundedness of the function f .Nevertheless, the reference to mixingales is useful since it is possible to weaken condition(iii) and still obtain Theorem 2. By this manner one obtains Theorem 1 also in the casewhere the covariance is calculated from a relatively slowly increasing segment of the nearhistory only (compare Remark 5). For instance, this is the case if at time t this segmenthas the length � t�; where � 2 (1=2; 1):5 Testing AM in practice and comparison with tra-ditional methodsIn this section we present results obtained from testing the AM algorithm numerically.From the practical point of view, it is important to know how accurate simulations of thetarget distribution one can expect to get from �nite MCMC runs. In [Haario et al. 1998]we compared three di�erent methods: random walk Metropolis algorithm (M), singlecomponent Metropolis algorithm (SC), and the adaptive proposal algorithm (AP) (seethe introduction or [Haario et al. 1998] for the exact de�nition and more detailes). Recallagain that the di�erence between AP and AM algorithms was simply that in AP thecovariance for the proposal distribution was computed only from a �xed number ofprevious states. Here we have made similar tests as in [Haario et al. 1998] and includedthe AM algorithm in the comparison.We have applied the AM algorithm succesfully in various dimensions up to d = 200.Here we present the results of extensive tests in dimension d = 8. As target disributions13

55

60

65

70

75

80

Target distribution

68.3

% c

onfid

ence

reg

ion

π1 π2 π3 π4

SCM APAM

0

10

20

30

40

50

60

Target distribution

1−d

conf

iden

ce r

egio

n 68

.5−

95%

π1 π2 π3 π4

SCM APAMFigure 1: The result with di�erent 8-dimensional target distributions. The top �gurecorresponds to the frequency of hits to the 68.3% con�dence region. The lower �gurecorresponds to the frequency of hits to the 1-dimensional con�dence region between68.3% and 95%. The true value is denoted with a horizontal line.we used two Gaussian distributions: correlated (�1) and uncorrelated (�2), and two non-linear 'banana' shaped distributions: moderately 'twisted' (�3) and strongly 'twisted'(�4). The number of function evaluations varied depending on the target distribution:20000 for �1 and �2, 40000 for �3 and 80000 for �4. The burn in period was chosen to behalf of the chain length. Each test case was run 100 times in order to retrieve statisticallyrelevant information. Hence, each accuracy criterion number is an average value over100 repetitions. We refer to [Haario et al. 1998] for a more detailed explanation of thetest procedure.We have tried to be fair in choosing the proposal distributions for the random walkMetropolis and the single component Metropolis algorithms. For example, in the case ofthe Gaussian target distributions we used covariances corresponding to the targets andnormalized them with the heuristic optimal scaling from [Gelman et al. 1996].In Figure 1 the test results in dimension 8 are summarized in a graphical form. Inthe top picture we present the mean and the error bars given by the standard deviationsfor the 68.3% con�dence region. In the lower picture the mean and the error bars for the1-dimensional con�dence region between 68.3% and 95% are given. The 1-dimensionalcon�dence region contains all those points whose projections to the direction of the mainsemi-axis of the target distribution belong to the corresponding 1-dimensional con�dence14

interval. In the case of the non-linear target distributions �3 and �4 this value re ectsthe accuracy of the simulation on the tails of the 'banana'.Table 1: Results obtained from 8-dimensional non-linear distribution �4. The numberof function evaluations was 80000 and the run was performed 100 times. Mean(jjEjj)denotes the mean distance of the expectation values of the chains from the true valueat origo calculated over 100 repetitions. Thus mean(jjEjj) = 1100P100j=1 �Pdi=1(Eij)2� 12 ,where Ej is the mean vector of chain j. Std(jjEjj) denotes the standard deviation of thedistances from the true value at origo. Non-linear (�4)Method SC M AP AMMean(jjEjj) 6.53 7.89 4.85 2.41Std(jjEjj) 4.79 7.54 4.20 1.15Acceptance rate (%) 70.39 7.84 9.60 4.59We present in the Table 1 some standard performance criteria in the case of the mostnonlinear target distribution �4 (again in dimension 8). The rows of the table give theperformance criteria while the columns represent di�erent methods compared.The results expressed in the �gure and in the table indicate that the AM algorithmsimulates the target distribution most accurately in these tests. With the Gaussiantarget distributions the results obtained using the AM algorithm are equally good asthose using the Metropolis algorithm with an optimal proposal distribution. Moreover,in the case of non-linear distributions the AM algorithm seems to be superior.Acknowledgements: We thank Elja Arjas, Esa Nummelin and Antti Penttinen for usefuldiscussions on the topics of the paper.References[Davidson and Jong 1997] J. Davidson and R. de Jong: Strong laws of large numbers fordependent heterogeneous processes: a synthesis of recent and new results, Econo-metric Rev. 16 (1997), 251{279.[Dobrushin 1956] R. Dobrushin: Central limit theorems for non-stationary Markovchains II, Theory Prob. Appl. 1 (1956), 329{383.[Evans 1991] M. Evans: Chaining via annealing, Ann. Statistics 19 (1991), 382{393.[Fishman 1996] G. S. Fishman: Monte Carlo. Concepts, algorithms and applications.Springer (1996). 15

[Gelfand and Sahu 1994] A. E. Gelfand and S. K. Sahu: On Markov chain Monte Carloacceleration, J. Computational and Graphical Stat. 3 (1994), 261{276.[Gelman et al. 1996] A. G. Gelman, G. O. Roberts and W. R. Gilks: E�cient Metropolisjumping rules, Bayesian Statistics V pp. 599{608 (eds J.M. Bernardo, J.O. Berger,A.F. David and A.F.M. Smith). Oxford University press, 1996.[Gilks et al. 1994] W. R. Gilks, G.O. Roberts, E.I. George: Adaptive direction sampling,The Statistician, 43, 179{189.[Gilks et al. 1995] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter: Introducingmarkov chain Monte Carlo. pp. 1{19. In Markov Chain Monte Carlo in practice.(eds W.R. Gilks, S. Richardson and D.J. Spiegelhalter). Chapman & Hall, 1995.[Gilks and Roberts 1995] W.R. Gilks and G. O. Roberts: Strategies for improvingMCMC, pp. 75{88. In Markov Chain Monte Carlo in practice. (eds W.R. Gilks, S.Richardson and D.J. Spiegelhalter). Chapman & Hall, 1995.[Gilks et al. 1998] W. R. Gilks, G.O. Roberts and S.K. Sahu: Adaptive Markov ChainMonte Carlo, J. Amer. Statist. Assoc. (to appear).[Haario and Saksman 1991] H. Haario and E. Saksman: Simulated annealing process ingeneral state space, Adv. Appl. Prob. 23 (1991), 866{893.[Haario et al. 1998] H. Haario, E. Saksman and J. Tamminen: Adaptive proposal dis-tribution for random walk Metropolis algorithm, Reports of the Department ofMathematics, University of Helsinki. Preprint 176 (1998).[Hall and Heyde 1980] P. Hall and C.C. Heyde: Martingale limit theory and its applica-tion. Academic Press, 1980.[Hastings 1970] W. K. Hastings: Monte Carlo sampling methods using Markov chainsand their applications, Biometrica 57 (1970), 97{109.[Marinari and Parisi 1992] E. Marinari and G. Parisi: Simulated tempering: a newMonte Carlo Scheme, Europhys. Lett. 19 (1992), 451{458.[McLeish 1975] D. L. McLeish: A maximal inequality and dependent strong laws, Ann.Prob. 3 (1975), 829{839.[Metropolis et al. 1953] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H.Teller, and E. Teller: Equations of state calculations by fast computing machines,J. Chem. Phys. 21 (1953), 1087-1091.[Neveu 1965] J. Neveu: Mathematical foundations of the calculus of probability. Holden-Day 1965. 16

[Nummelin 1984] E. Nummelin: General irreducible Markov chains and non-negativeoperators. Cambridge University Press, 1984.[Roberts et al. 1994] G. O. Roberts, A. Gelman and W. R. Gilks: Weak con-vergence and optimal scaling of random walk metropolis algorithms. Preprint.(http://www.stats.bris.ac.uk/MCMC/)[Sahu and Zhigljavsky 1998a] S. K. Sahu and A. A. Zhigljavsky: Self regenerativeMarkov chain Monte Carlo. Preprint.(http://www.stats.bris.ac.uk/MCMC/)[Sahu and Zhigljavsky 1998b] S. K. Sahu and A. A. Zhigljavsky: Adaption for self re-generative MCMC, Preprint. (http://www.stats.bris.ac.uk/MCMC/)[Tierney 1994] L. Tierney: Markov chains for exploring posterior distributions, withdiscussion, Ann. Statistics (1994), 1701{1762.AppendixWe present here an illustrative two-dimensional example from [Haario et al. 1998],where in the AP algorithm the covariance Ct was calculated from last 200 states andthe AP algorithm produced considerable error in the simulation. This phenomenonunderlines the importance of calculating the covariance from an increasing segment ofthe history, as is done in the AM algorithm. When the AM algorithm was applied tothe same example it produced, as expected, simulation that was free of bias. For manypractical applications the error produced by the AP algorithm is, however, ignorable,see [Haario et al. 1998].Example. Let us consider a density f de�ned on the rectangle R = [�18; 18] �[�3; 3] � IR2 containing the middle strip S = [�0:5; 0:5]� [�3; 3]: We setf(x) = ( 36 if x 2 S1 if x 2 R n S:With this choice �(S) : �(R n S) = 36 : 35 and hence about half of the mass isconcentrated on S: Thus the AP algorithm stays about the same time on S and on R.However, S and R are thin rectangles with opposite direction which forces the AP toregularly turn the direction of the proposal distribution. This causes notable bias in thesimulation on S (see Figure 2). In fact, the relative error in the simulation on S is 13%.There is also a slight error in the simulation near the far ends of the rectangle R. Thecorresponding unbiased results of the AM algorithm are presented in Figure 3.17

−20 −15 −10 −5 0 5 10 15 200

5

10

15

20

25

30

35

40

45

Figure 2: The solid line represents the 1-dimensional projection of true target distribu-tion of Example 1 simulated with the AP algorithm calculating the covariance from 200previous states. The histogram shows the mean on 100 runs with 100000 states.−20 −15 −10 −5 0 5 10 15 200

5

10

15

20

25

30

35

40

Figure 3: The solid line represents the 1-dimensional projection of true target distribu-tion of Example 1 simulated with the AM algorithm. The histogram shows the meanon 100 runs with 100000 states. 18

Addresses of the authors:H. HAARIO: Department of Mathematics, P.O. Box 4 (Yliopistonkatu 5), FIN-00014University of Helsinki, Finland. E-mail : heikki.haario@helsinki.�E. SAKSMAN: Department of Mathematics, P.O. Box 4 (Yliopistonkatu 5), FIN-00014University of Helsinki, Finland. E-mail : eero.saksman@helsinki.�J. TAMMINEN: Finnish Meteorological Institute, Geophysics Research, P.O. Box 503,FIN-00101 Helsinki, Finland. E-mail : johanna.tamminen@fmi.�

19

An Adaptive Metropolis Algorithm

Documents

Transcript of An Adaptive Metropolis Algorithm