On the convergence of the Markov chain simulation method

On the Convergence of the Markov Chain SimulationMethod

Krishna B. Athreya ∗

Iowa State University

Hani Doss †

Ohio State University

Jayaram Sethuraman ‡

Florida State University

July 1992

Revised July 1993Revised March 1995

Abstract

The Markov chain simulation method has been successfully used in many problems,including some that arise in Bayesian statistics. We give a self-contained proof of theconvergence of this method in general state spaces under conditions that are easy toverify.

Key words and phrases: Calculation of posterior distributions, ergodic theorem, successivesubstitution sampling.

Abbreviated title: Convergence of Markov Chain Simulation Method.

AMS (1991) subject classifications: Primary 60J05; secondary 65U05, 60B10.

∗Research supported by National Science Foundation Grant NSFDMS-90-07182†Research supported by Air Force Office of Scientific Research Grant F49620-94-1-0028‡Research supported by Army Research Office Grant DAAL03-90-G-0103

1 Introduction

Let π be a probability distribution on a measurable space (X ,B), and suppose that we areinterested in estimating characteristics of it such as π(E) or

∫fdπ where E ∈ B and f is

a bounded measurable function. Even when π is fully specified one may have to resort tomethods like Monte Carlo simulation, especially when π is not computationally tractable.For this one uses the available huge literature on generation of random variables froman explicitly or implicitly described probability distribution π. Generally these methodsrequire X to be the real line or require that π have special features, such as a structurein terms of independent real valued random variables. When one cannot generate randomvariables with distribution π one has to be satisfied with looking for a sequence of randomvariables X1, X2, . . . whose distributions converge to π and using Xn with a large index nas an observation from π. An example is the classical Markov chain simulation method,discussed further below.

Let P be a transition probability function on a measurable space (X ,B), i.e. P is afunction on X × B such that for each x ∈ X , P (x, ·) is a probability measure on (X ,B),and for each C ∈ B, P (·, C) is a measurable function on (X ,B). Suppose that π is aprobability measure on (X ,B) which is invariant for the Markov chain, i.e.

π(C) =∫P (x,C)π(dx) for all C ∈ B. (1.1)

We fix a starting point x0, generate an observationX1 from P (x0, ·), generate an observationX2 from P (X1, ·), etc. This generates the Markov chain x0 = X0, X1, X2, . . .. In order tomake use of the Markov chain {Xn}∞n=0 to get some information about π, one needs resultsof the form:

(a) Ergodicity: For all or for “most” starting values x, the distribution of Xn convergesto π in a suitable sense, for example

(a1) Variation norm ergodicity: supC∈B |P n(x,C)− π(C)| → 0, or

(a2) Variation norm mean ergodicity: supC∈B | 1n∑nj=0 P

j(x,C)− π(C)| → 0.

(b) Law of large numbers: For all or for most starting values x, for each C ∈ B,

1

n

n∑j=0

IC(Xj)→ π(C) for a.e. realization of the chain,

and for each f with∫|f |dπ <∞,

1

n

n∑j=0

f(Xj)→∫fdπ for a.e. realization of the chain.

Then, we may estimate π for example by generating G such chains in parallel, obtainingindependent observations X(1)

n , . . . , X(G)n , or by running one (or a few) very long chains.

In Section 3 we make some remarks on the advantages and disadvantages of these twomethods.

1

Thus our goal is to find conditions on a given Markov chain or rather on its transitionfunction P (·, ·) so that some or all of the conditions (a) and (b) above hold, assuming thatP admits an invariant probability measure π. In applications of Markov chain simulation,the probability measure π of interest is by construction the invariant probability measureof the Markov chain.

When {Xn} is a Markov chain with a countable state space, say {1, 2, . . .}, and transitionprobability matrix P = (pi,j), the existence of an invariant probability distribution π andthe irreducibility condition that there exists a state i0 such that from any initial state i,there is positive probability that the chain eventually hits i0, are enough to guarantee that(i) the chain {Xn} is recurrent in an appropriate sense, (ii) conditions (b) and (a2) abovehold, and (iii) when an additional aperiodicity condition also holds, then (a1) above alsoholds. These facts are well known; see for instance, Hoel, Port and Stone (1972).

A natural question is whether this is true for general state space Markov chains. Inparticular, when (1.1) holds, is there a form of the irreducibility condition under whichsome or all of (a) and (b) above hold?

The Markov chain literature has a number of results in this direction; see Orey (1971),Athreya and Ney (1978) and Nummelin (1984). Under a condition known as Harris recur-rence (see below) the existence of an invariant distribution π implies mean ergodicity (con-dition (a2)) and the laws of large numbers (condition (b)). Unfortunately, Harris recurrenceis not an easy condition to verify in general, and it is much stronger than irreducibility.

The main point of this paper is to show that when (1.1) holds, a simple irreducibilitycondition ((1.4) below) is enough to yield (a2) and (b). An additional aperiodicity con-dition yields (a1) as well. This provides a complete generalization of the results for thecountable case. It is worth noting that recurrence emerges as a consequence of (1.1) andthe irreducibility condition (1.4), and is not imposed as a hypothesis.

Before stating our main theorems, we will need a few definitions. For any set C ∈ B,let Nn(C) =

∑nm=1 I(Xm ∈ C) and N(C) =

∑∞m=1 I(Xm ∈ C) be the number of visits

to C by time n and the total number of visits to C, respectively. The expectations ofNn(C) and N(C), when the chain starts at x, are given by Gn(x,C) =

∑nm=1 P

m(x,C) andG(x,C) =

∑∞m=1 P

m(x,C), respectively. Define T (C) = inf{n : n > 0, Xn ∈ C} to be thefirst time the chain hits C, after time 0. Note that Px(T (C) < ∞) > 0 is equivalent toG(x,C) > 0.

The set A ∈ B is said to be accessible from x if Px(T (A) < ∞) > 0. Let ρ be aprobability measure on (X ,B). The Markov chain is said to be ρ-recurrent (or Harrisrecurrent with respect to ρ) if for every A with ρ(A) > 0, Px(T (A) <∞) = 1 for all x ∈ X .The chain is said to be ρ-irreducible if every set A with ρ(A) > 0 is accessible from allx ∈ X . The set A is said to be recurrent if Px(T (A) <∞) = 1 for all x ∈ X .

For the case where the σ-field B is separable, there is a very useful equivalent definitionof ρ-irreducibility of a Markov chain. In this case, we can deduce from Theorem 2.1 ofOrey (1971), on the existence of “C-sets,” that ρ-irreducibility of a Markov chain impliesthat there exist a set A ∈ B with ρ(A) > 0, an integer n0, and a number ε > 0 satisfying

Px(T (A) <∞) > 0 for all x ∈ X , (1.2)

andx ∈ A, C ∈ B imply P n0(x,C) ≥ ερ(C ∩ A). (1.3)

2

Let ρA(C) = ρ(C∩A)ρ(A)

. This is well defined because ρ(A) > 0. The set function ρA is a

probability measure satisfying ρA(A) = 1. Note that (1.2) simply states that A is accessiblefrom all x ∈ X and this condition does not make reference to the probability measureρ. Condition (1.3) states that uniformly in x ∈ A, the n0-step transition probabilitiesfrom x into subsets of A are bounded below by ε times ρ. That (1.2) and (1.3) implyρA-irreducibility is, of course, immediate. This alternative definition of ρA-irreducibility,which applies to nonseparable σ-fields as well, will be usually much easier to verify inMarkov chain simulation problems. By replacing ρ by ρA, we can also assume with no lossof generality that ρ is a probability measure with ρ(A) = 1 when verifying Condition (1.3).

We denote the greatest common divisor of any subset M of integers by g.c.d.(M).The main results of this paper are the following two theorems, which are stated for

general Markov chains. They give sufficient conditions for the Markov chain simulationmethod to be successful.

Theorem 1 Suppose that the Markov chain {Xn} with transition function P (x,C) hasan invariant probability measure π, i.e. (1.1) holds. Suppose that there is a set A ∈ B, aprobability measure ρ with ρ(A) = 1, a constant ε > 0, and an integer n0 ≥ 1 such that

π{x : Px(T (A) <∞) > 0

}= 1, (1.4)

and

P n0(x, ·) ≥ ερ(·) for each x ∈ A. (1.5)

Suppose further that

g.c.d.{m : there is an εm > 0 such that Pm(x, ·) ≥ εmρ(·) for each x ∈ A

}= 1. (1.6)

Then there is a set D such that

π(D) = 1 and supC∈B|P n(x,C)− π(C)| → 0 for each x ∈ D. (1.7)

Theorem 2 Suppose that the Markov chain {Xn} with transition function P (x,C) satis-fies Conditions (1.1), (1.4) and (1.5). Then

supC∈B

∣∣∣ 1

n0

n0−1∑r=0

Pmn0+r(x,C)− π(C)∣∣∣→ 0 as m→∞ for [π]-almost all x, (1.8)

and hence

supC∈B

∣∣∣ 1n

n∑j=1

P j(x,C)− π(C)∣∣∣→ 0 as n→∞ for [π]-almost all x. (1.9)

Let f(x) be a measurable function on (X ,B) such that∫π(dy)|f(y)| <∞. Then

Px

{1

n

n∑j=1

f(Xj)→∫π(dy)f(y)

}= 1 for [π]-almost all x (1.10)

and1

n

n∑j=1

Ex(f(Xj))→∫π(dy)f(y) for [π]-almost all x. (1.11)

3

Variants of these theorems form a main core of interest in the Markov chain litera-ture. However, most of this literature makes strong assumptions such as the existence of arecurrent set A and proves the existence of an invariant probability measure before estab-lishing (1.7) and (1.8). Theorems 1 and 2 exploit the existence of an invariant probabilitymeasure, which is given to us “for free” in the Markov chain simulation context, and estab-lish the ergodicity or mean ergodicity under minimal and easily verifiable assumptions. Forexample, we have already noted that in the context of the Markov chain simulation method,we really need to check only (1.4), (1.5), and (1.6). To show (1.4) in most cases one will es-tablish that Px(T (A) <∞) > 0 for all x. Condition (1.6) is usually called the aperiodicitycondition and is automatically satisfied if (1.5) holds with n0 = 1. Condition (1.5) holds iffor each x ∈ A, P n0(x, ·) has a non-trivial absolutely continuous component with respect tosome measure ρ and the associated density pn0(x, y) satisfies infx,y∈A p

n0(x, y) > 0 for someA with ρ(A) > 0. In a remark following the proof of Lemma 3 we indicate the critical pointswhere one can use additional information to obtain results on the rate of the convergence.

In many interesting problems, including those that arise in Bayesian statistics, describedlater, the state space X is not countable. Early results on ergodicity of Markov chains ongeneral state spaces used a condition known as the Doeblin condition; see Hypothesis (D′)on p. 197 of Doob (1953), which can be stated in an equivalent way as follows. There is aprobability measure φ on (X ,B), an integer k, and an ε > 0 such that

P k(x,C) ≥ εφ(C) for all x ∈ X and all C ∈ B.

This is a very strong condition. It implies that there exists an invariant probability measureto which the Markov chain converges at a geometric rate, from any starting point.

Theorem 3 Suppose that the Markov chain satisfies the Doeblin condition. Then thereexists a unique invariant probability measure π such that for all n

supC∈B|P n(x,C)− π(C)| ≤ (1− ε)(n/k)−1 for all x ∈ X .

A proof of this theorem may be found on p. 197 of Doob (1953). The Doeblin condition,though easy to state, is very strong and rarely holds in the problems that appear in theclass of applications we are considering. We note that it is equivalent to the conditions ofTheorem 1, with the set A of Theorem 1 replaced by X . In its absence, one has to to imposethe obvious conditions of irreducibility and aperiodicity and some other extra conditions,such as recurrence, to obtain ergodicity. Standard references in this area are Orey (1971),Revuz (1975) and Nummelin (1984). An exposition suitable for our purposes can be foundin Athreya and Ney (1978). Theorem 4.1 of that paper may be stated as follows.

Theorem 4 Suppose that there is a set A ∈ B, a probability measure ρ concentrated onA, and an ε with 0 < ε < 1 such that

Px(T (A) <∞) = 1 for all x ∈ X ,

andP (x,C) ≥ ερ(C) for all x ∈ A and all C ∈ B.

Suppose further that there is an invariant probability measure π. Then

supC∈B|P n(x,C)− π(C)| → 0 for all x ∈ X .

4

This theorem establishes ergodicity under the assumption of the existence of an invariantprobability measure but also makes the strong assumption of the existence of a recurrentset A. It is often difficult to check that a set A is recurrent. Our main results, Theorems 1and 2, weaken this recurrence condition to just the accessibility of the set A from [π]-almostall starting points x. We believe that this makes it routine to check the conditions of ourtheorem in Markov chain simulation problems. (We remark that our Theorems 1 and 2state only that convergence occurs for [π]-almost all starting points. Examples can be givento show that this is the strongest assertion that can be made, even if instead of (1.4) theset A is assumed to be accessible from all x.)

Based on the work of Nummelin (1984), Tierney (1991) gives sufficient conditions forconvergence of Markov chains to their invariant distribution. The main part of his Theo-rem 1 may be stated as follows.

Theorem 5 Suppose that the chain has invariant probability measure π. Assume that thechain is π-irreducible and aperiodic. Then (1.7) holds.

The main difference between Theorems 1 and 5 is that in Theorem 1 the probability measurewith respect to which irreducibility needs to be verified is not restricted to be the invariantmeasure. This distinction is more than cosmetic. To check π-irreducibility, one has to showthat Px(TA < ∞) > 0 for all x ∈ X and all A for which π(A) > 0. For certain Markovchain simulation problems in which the state space is very complicated, it is difficult orimpossible to even identify these sets, since it is difficult to get a handle on the unknownπ. An example of such a situation arose in the context of Bayesian nonparametrics inDoss (1991), where the state space was the set of all distribution functions. In that paper,the Markov chain simulation method was proposed as a way to determine π, but theunknown π was sufficiently complicated that one could not determine the sets to which itgives positive measure. On the other hand, a convenient choice of ρ made it possible tocheck ρ-irreducibility through the equivalent conditions (1.4) and (1.5). See the discussionin Section 4 of Doss (1991).

We point out that Tierney (1991) does not give a detailed definition of aperiodicity,but refers the reader to Chapter 2.4 of Nummelin (1984) where an implicit definition ofthe period of a Markov chain is given. In the present paper, aperiodicity as constructivelydefined in (1.6) is usually easy to check: If the n0 appearing in (1.5) is 1, then (1.6) isautomatic.

The statistical applications of the above include the Metropolis algorithm and its vari-ants which produce Markov transition functions satisfying (1.1). This algorithm was orig-inally developed for estimating certain distributions and expectations arising in statisticalphysics, but can also be used in Bayesian analysis; see Tierney (1991) for a review.

However, in the usual problems of Bayesian statistics, currently the most commonly usedMarkov chain is one that is used to estimate the unknown joint distribution π = πX(1),...,X(p)

of the (possibly vector-valued) random variables (X(1), . . . , X(p)) by updating the coordi-nates one at a time, as follows. We suppose that we know the conditional distributionsπX(i)| {X(j) j 6=i}, i = 1, . . . , p or at least that we are able to generate observations from these

conditional distributions. If Xm = (X(1)m , . . . , X(p)

m ) is the current state, the next state

Xm+1 = (X(1)m+1, . . . , X

(p)m+1) of the Markov chain is formed as follows. Generate X

(1)m+1 from

πX(1)| {X(j) j 6=1}(·, X(2)m , . . . , X(p)

m ), then X(2)m+1 from πX(2)| {X(j) j 6=2}(X

(1)(m+1), ·, X(3)

m , . . . , X(p)m ),

5

and so on until X(p)m+1 is generated from πX(p)| {X(j) j 6=p}(X

(1)(m+1), . . . , X

(p−1)(m+1), ·). If P is the

transition function that produces Xm+1 from Xm, then it is easy to see that P satisfies (1.1).This method is reminiscent of the simulation method described in Geman and Geman

(1984). In that paper, p, the number of coordinate indices in the vector (X(1), . . . , X(p)),is usually of the order of N ×N where N = 256 or higher. They assume that these indicesform a graph with a meaningful neighborhood structure and that π is a Gibbs distribution,so that the conditional distributions πX(i)| {X(j) j 6=i}, i = 1, . . . , p depend on much fewerthat p− 1 coordinates. They also assume that each random variable Xi takes only a finitenumber k of values and that π gives positive mass to all possible kN

2values. Geman

and Geman (1984) appeal to the ergodic theorem on Markov chains with a finite statespace and prove that this simulation method works. They prove other interesting resultson how this can be extended when a temperature parameter T (that can be incorporatedinto π) is allowed to vary. This may be the reason why the method described in theprevious paragraph has come to be known as the Gibbs sampler. We consider this to bea misnomer, because no Gibbs distribution nor any graph with a nontrivial neighborhoodstructure supporting a Gibbs distribution is involved in this method; we will refer to itsimply as successive substitution sampling.

We note that this algorithm depends on π only through the conditional distributionsπX(i)| {X(j) j 6=i}. Perhaps the first thought that comes to mind when considering this methodis to ask whether or not, in general, these conditionals determine the joint distribution π.The answer is that in general they do not; we give an example at the end of Section 2.A necessary consequence of convergence of successive substitution sampling is that thejoint distribution is determined by the conditionals. It is therefore clear that any theoremgiving conditions guaranteeing convergence from every starting point also gives, indirectly,conditions which guarantee that the conditionals determine the joint distribution π.

We now give a very brief description of how this method is useful in some Bayesianproblems. We suppose that the parameter θ has some prior distribution, that we observe adata point Y whose conditional distribution given θ is L(Y | θ), and that we wish to obtainL(θ |Y ), the conditional distribution of θ given Y . It is often the case that if we consideran (unobservable) auxiliary random variable Z, then the distribution πθ,Z = L(θ, Z |Y ) hasthe property that πθ|Z (= L(θ |Y, Z)) and πZ|θ (= L(Z |Y, θ)) are easy to calculate. Typicalexamples are missing and censored data problems. If we have a conjugate family of priordistributions on θ, then we may take Z to be the missing or the censored observations,so that πθ|Z is easy to calculate. Successive substitution sampling then gives a randomobservation with distribution (approximately) L(θ, Z |Y ), and retaining the first coordinategives an observation with distribution (approximately) equal to L(θ |Y ).

Another application arises when the parameter θ is high dimensional, and we are in anonconjugate situation. Let us write θ = (θ1, . . . , θk), so that what we wish to obtain isπθ1,...,θk . Direct calculation of the posterior will involve the evaluation of a k-dimensionalintegral, which may be difficult to accomplish. On the other hand, application of the suc-cessive substitution sampling algorithm involves the generation of one-dimensional randomvariables from πθi|{θj j 6=i}, which is available in closed form, except for a normalizing con-stant. There exist very efficient algorithms for doing this (see Gilks and Wild (1992)),and the use of these algorithms is made routine by the computer language BUGS (Thomas,Spiegelharter and Gilks (1992)).

6

Results which give not only convergence of the Markov chain to its invariant distributionbut also convergence at a geometric rate are obviously extremely desirable. General resultsestablishing that the convergence rate is geometric are given in Theorem 1 of Schervish andCarlin (1992) and in Theorem 2.1 of Chan (1993). For certain models it is possible to giveactual bounds for the geometric rate of convergence; see Goodman and Sokal (1989) andAmit (1991) for examples involving continuous state spaces. It is, however, important tokeep in mind that for most problems arising in Bayesian statistics, checking conditions thatensure convergence at a geometric rate is an order of magnitude more difficult than checkingthe conditions needed for simple convergence, for example Theorems 1 and 5 in the presentpaper. This is because in cases where the dimension of the state space of the Markovchain is very high, it is usually extremely difficult to check the integrability conditionsneeded. This situation arises in Bayesian nonparametrics for example; see Doss (1991) foran illustration.

In addition, the Markov chain may converge but not at a geometric rate. This canhappen even in very simple situations. An illustration is provided in the example below,which is due to T. Sellke. Let U be a random variable on IR with distribution ν which wetake to be the standard Cauchy distribution. Let the conditional distribution of V givenU be the Beta distribution with parameters 2 and 2, shifted so that it is centered at U ,and let X = (U, V ). If we start successive substitution sampling at X0 = (0, 0), then it iseasy to see that U1 must be in the interval (−1, 1), and in fact, the value of U can changeby at most one unit at each iteration. Thus, the distribution of Un is concentrated in theinterval (−n, n). In particular,

supC∈B

∣∣∣P (Un ∈ C | U0 = 0)− ν(C)∣∣∣ ≥ ν{(−∞,−n) ∪ (n,∞)} ∼

(2

π

)1

n,

so that the rate of convergence cannot be geometric. The distribution ν could have beentaken to be any distribution whose tails are “thicker than those of the exponential distri-bution,” and in fact, we can make the rate of convergence arbitrarily slow by taking thetails of ν to be sufficiently thick.

It is not difficult to see that if we select the starting point at random from a boundeddensity concentrated in a neighborhood of the origin, then this example provides a simplecounterexample to Theorem 3 of Tanner and Wong (1987), which asserts convergence at ageometric rate.

This paper is organized as follows. Section 2 gives the proofs of Theorems 1 and 2.Section 3 discusses briefly some issues to consider when deciding how to use the output ofthe Markov chain to estimate π and functionals of π.

2 Ergodic Theorems for Markov Chains on General

State Spaces

The proofs of Theorems 1 and 2 rest on the familiar technique of regenerative events ina Markov chain. See for instance Athreya and Ney (1978). In Section 2.1, we proveProposition 1 in which we assume that the set A is a singleton α, so that ρ is the degenerateprobability measure on {α}. We also assume that the singleton α is an aperiodic state,

7

a condition which is stated more fully as Condition (C) in Proposition 1 below. Underthese simplified assumptions we establish ergodicity and some laws of large numbers forthe Markov chain.

In Section 2.2 we establish Theorem 1 as follows. In Proposition 2 we show that, whenn0 = 1, under the conditions of Theorem 1, a general Markov chain can be reduced to onesatisfying the above simplified assumptions of Proposition 1. This is done by enlargingthe state space with an extra point ∆ and extending the Markov chain to the enlargedspace. We then show that this singleton set {∆} satisfies the simplified assumptions ofProposition 1. From this it follows that the extended chain is ergodic. After this step wededuce that the original chain is also ergodic. Finally, we show how the condition n0 = 1can be relaxed under the aperiodicity condition (1.6).

In Section 2.3 we prove Theorem 2 which asserts convergence of averages of transitionfunctions and averages of functions of the Markov chain, without the aperiodicity assump-tion (1.6). The key step in the proof is to recognize that the Markov chain observed at timepoints which are multiples of n0 is an embedded Markov chain satisfying the conditions ofProposition 2 and with an invariant probability distribution π0 which is the restriction of πto the set A0 defined by (2.29). In the Markov chain literature, mean ergodicity is usuallyobtained as an elementary consequence of ergodicity in the aperiodic case and the existenceof a well-defined period and cyclically moving disjoint subclasses. Our proof circumvents,in a way which we believe is new, the need for well-defined periodicity and cyclically movingdisjoint subclasses.

2.1 State Spaces with a Distinguished Point

Let (X ,B) be a measurable space and let {Xn}∞0 be a Markov chain with a probabilitytransition function P (·, ·). Fix a point α in X . For convenience, we will refer to this pointas the distinguished point. We will often write just α for the singleton set {α}. The numberof visits to α, Nn({α}) and N({α}), will be denoted simply by Nn and N , respectively.The first time the chain visits α after time 0, namely T ({α}), will be denoted simply byT . Let

C0 ={x : Px(T <∞) = 1

}={x : Px(T =∞) = 0

}(2.1)

andX0 =

{x : Px(T <∞) > 0

}(2.2)

be the set of all states from which α can be reached with probability 1 and the set of allstates from which α is accessible, respectively.

Definition 1 The state α is said to be transient if Pα(T < ∞) < 1 and recurrent ifPα(T <∞) = 1. The state α is said to be positive recurrent if Eα(T ) <∞.

All the results of this section are part of the applied probability literature, see e.g. As-mussen (1987). We present the results below, with proofs, to make the paper self contained.

Proposition 1 Suppose that the transition function P (x,C) satisfies the following condi-tions:

(A) π is an invariant probability measure for P .

8

(B) π{x : Px(T <∞) > 0

}= 1.

Then

π(C0) = 1 and supC∈B

∣∣∣ 1n

n∑j=0

P j(x,C)− π(C)∣∣∣→ 0 for each x ∈ C0. (2.3)

Suppose in addition that

(C) g.c.d.{n : P n(α, α) > 0} = 1.

ThensupC∈B|P n(x,C)− π(C)| → 0 for each x ∈ C0. (2.4)

The proof of this proposition is given after the remark following the proof of Lemma 3.

Lemma 1 If Conditions (A) and (B) of Proposition 1 hold, then π(α) > 0 and α is positiverecurrent.

Proof We first establish that π(α) > 0. From Condition (A) it follows that π(α) =∫π(dx)P n(x, α) for n = 1, 2, . . ., and hence

nπ(α) =∫π(dx)Gn(x, α) (2.5)

for all n. The monotone convergence theorem and Condition (B) imply that

limnπ(α) =∫π(dx)G(x, α) > 0, (2.6)

and hence π(α) > 0.Let the Markov chain start at some x ∈ X . Let T1 = T and Tk = inf{n : n >

Tk−1, Xn = α} for k = 2, 3, . . ., with the usual convention that the infimum of the emptyset is ∞. If N < ∞ then only finitely many Tk’s are finite. If N = ∞ then all theTk’s are finite. In the latter case, the Markov chain starts afresh from α at time Tk, andhence Tk − Tk−1, k = 2, 3, . . . are independent and identically distributed with distributionH where H(n) = Pα(T ≤ n). These facts, the Strong Law of Large Numbers and theinequality

Nn

TNn+1

≤ Nn

n≤ Nn

TNn(2.7)

imply that1

nNn →

1

Eα(T )I(N =∞) (2.8)

[Px]-a.e., for each x ∈ X . From the bounded convergence theorem, it follows that

1

nGn(x, α) = Ex

( 1

nNn

)→ 1

Eα(T )Px(N =∞) for each x ∈ X . (2.9)

Divide both sides of (2.5) by n, take limits and compare with the above. By using the factthat π is a probability measure and applying the bounded convergence theorem, we obtain

π(α) =1

Eα(T )

∫π(dx)Px(N =∞). (2.10)

9

Since π(α) > 0, it follows that∫π(dx)Px(N = ∞) > 0 and Eα(T ) < ∞, and hence α is

positive recurrent. ♦

The arguments leading to the conclusion π(α) > 0 in the above lemma, which werebased on (2.5) and (2.6), and did not use the full force of Condition (B). The followingcorollary records that fact and will be used later in the proof of Lemma 6.

Corollary 1 Let π satisfy (A) of Proposition 1 and let E ∈ B be such that

π({x : G(x,E) > 0}

)> 0.

Then π(E) > 0.

The fact that α is positive recurrent gives us a way of obtaining an explicit form for afinite invariant measure ν and show that it must be a multiple of π.

Lemma 2 Let α be recurrent. Let

ν(C) = Eα

( T−1∑j=0

I(Xj ∈ C))

=∞∑n=0

Pα(Xn ∈ C, T > n) (2.11)

be the expected number of visits to C between consecutive visits to α, beginning fromα. Then ν is an invariant measure for P (·, ·) with ν(X c

0 ) = 0, and is unique up to amultiplicative constant; more precisely,

ν(·) =∫P (x, ·)ν(dx),

and if ν ′ is any other invariant measure with ν ′(X c0 ) = 0, then

ν ′(C) = ν ′(α)ν(C) for all C ∈ B.

The measure ν also has the property

ν(Cc0) = 0.

Suppose that Conditions (A) and (B) of Proposition 1 hold, so that α is positive recurrentand π is an invariant probability measure for P (·, ·) with π(X c

0 ) = 0. Then

ν(X ) = Eα(T ) <∞

and

π(C) =ν(C)

Eα(T )

is the unique invariant probability measure with π(C0) = 1.

Proof Since∑T−1n=0 I(Xn = α) = 1 we have ν(α) = 1 = Pα(T < ∞). To show that

ν(Cc0) = 0, notice that for all n

0 = Pα(T =∞) = Eα(Pα(T =∞ | X1, X2, . . . , Xn)

)= Eα

(PXn(T =∞)I(T > n)

).

10

From this it follows that

0 = Pα{PXn(T =∞)I(T > n) > 0

}= Pα{Xn ∈ Cc

0, T > n}

for each n. From the definition of ν in (2.11) it now follows that ν(Cc0) = 0. We now show

that ν is an invariant measure. Let f(x) be a bounded measurable function on (X ,B).Then∫

ν(dx)f(x) =∞∑n=0

Eα(f(Xn)I(T > n)

)= f(α) +

∞∑n=1

(Eα(f(Xn)I(T > n− 1)

)− Eα

(f(Xn)I(T = n)

))

= f(α) +∞∑n=1

Eα

(Eα(f(Xn)I(T > n− 1)

) ∣∣∣ X0, X1, . . . , Xn−1

)

−∞∑n=1

Eα(f(Xn)I(T = n)

)= f(α)− f(α)Pα(T <∞) +

∞∑n=1

Eα(EXn−1(f(Xn))I(T > n− 1)

)=

∞∑n=1

Eα

(∫P (Xn−1, dy)f(y)I(T > n− 1)

)

=∞∑n=0

Eα

(∫P (Xn, dy)f(y)I(T > n)

)=

∫y∈X

(∫x∈X

ν(dx)P (x, dy))f(y).

where the fourth equality in the above follows from the Markov property. This shows thatν is an invariant measure.

Let ν ′ be any other invariant measure for P (·, ·) satisfying ν ′(X c0) = 0. Fix C ∈ B.

Then for C such that α 6∈ C,

ν ′(C) =∫ν ′(dx)P (x,C)

= ν ′(α)Pα(X1 ∈ C) +∫x 6=α

ν ′(dx)Px(X1 ∈ C)

= ν ′(α)Pα(X1 ∈ C) +∫y∈X

∫x 6=α

ν ′(dy)P (y, dx)Px(X1 ∈ C)

= ν ′(α)Pα(X1 ∈ C) +∫ν ′(dy)Py(X2 ∈ C, T > 1)

...

= ν ′(α)n∑

m=1

Pα(Xm ∈ C, T > m− 1) +∫ν ′(dy)Py(Xn+1 ∈ C, T > n)

≥ ν ′(α)n∑

m=1

Pα(Xm ∈ C, T > m− 1)

≥ ν ′(α)n∑

m=1

Pα(Xm ∈ C, T > m)

11

for each n. In the last line above we used the fact that {Xm ∈ C, T > m − 1} ={Xm ∈ C, T > m}, since α 6∈ C. Thus ν ′(C) ≥ ν ′(α)ν(C) for all C since ν(α) = 1. Letλ(C) = ν ′(C)−ν ′(α)ν(C). Then λ is an invariant nonnegative measure and λ(α) = 0 sinceν(α) = 1. Thus

0 = λ(α) =∫Gn(x, α)λ(dx)→

∫G(x, α)λ(dx)

by the monotone convergence theorem. Since X0 = {x : G(x, α) > 0}, this implies thatλ(X0) = 0. This proves that

ν ′(C) = ν ′(α)ν(C), (2.12)

which shows that ν is the unique invariant measure satisfying ν(X c0 ) = 0, up to a multi-

plicative constant.We now assume that α is positive recurrent. Since

∑T−1n=0 I(Xn ∈ X ) = T , we have

ν(X ) = Eα(T ) < ∞. Let π be an invariant probability measure satisfying π(X c0 ) = 0.

From (2.12), we have the equality

π(C) = π(α)ν(C).

From the earlier part of this proof it now follows that π is the unique invariant probabilitymeasure. ♦

In the following, we consider general measurable functions f(x) with∫|f(y)|π(dy) <∞,

instead of I(x = α) as was done in Lemmas 1 and 2.

Corollary 2 Let Conditions (A) and (B) of Proposition 1 hold. Let f(x) be a measurablefunction on (X ,B) with

∫|f(y)|π(dy) <∞. Then π(Af ) = 1 where

Af =

{x : Px

{1

n

n∑j=1

f(Xj)→∫f(x)dπ(x)

}= 1; Ex

(1

n

n∑j=1

f(Xj))→∫f(x)dπ(x)

}.

Proof By considering positive and negative parts, we can assume that f(·) ≥ 0. Let

{Tk}∞k=1 andNn be as in Lemma 1. Define Un =∑min(n,T1−1)j=1 f(Xj) and Vr =

∑Tr+1−1j=Tr f(Xj).

Note that V1, V2, . . . are i.i.d. and from Lemma 2, E(V1) =∫f(x)π(dx)Eα(T1). Since

f(x) ≥ 0, we have the inequality

Nn−1∑r=1

Vr ≤n∑j=1

f(Xj) ≤ Un +Nn∑r=1

Vr.

For x ∈ C0 (defined in (2.1)), Px(Tk <∞ for all k ≥ 1) = 1 and hence, from (2.9) and thelaw of large numbers,

Nn

n→ 1

Eα(T1)and

1

n

Nn∑r=1

Vr →Eα(V1)

Eα(T1)=∫f(x)π(dx)

[Px]-a.e. From Fatou’s lemma, this yields

lim inf Ex

(1

n

n∑j=1

f(Xj))≥∫f(x)π(dx)

12

[π]-a.e. To obtain the reverse inequality

lim supEx

(1

n

n∑j=1

f(Xj))≤∫f(x)π(dx) (2.13)

we begin by using Wald’s identity E(∑Nn

r=1 Vr

)= Eα(V1)Eα(Nn) to obtain

Ex

(1

n

n∑j=1

f(Xj))≤ Ex

(1

n

T1−1∑j=1

f(Xj))

+ Eα(V1)Eα

(Nn

n

).

We have already seen that the second term on the right hand side of the above convergesto∫f(x)dπ(x). We now prove that Ex(

∑T1−1j=1 f(Xj)) < ∞ with [π]-probability one, and

this will establish the inequality (2.13). Note that, for r = 0, 1, . . .,

∞ > Eα

( T1−1∑j=0

f(Xj))

≥ Eα

(Ex

( T1−1∑j=0

f(Xj)I(T1 > r)))

= Eα

(EXr

( T1−1∑j=0

f(Xj))I(T1 > r)

)

since I(T1 > r) is measurable with respect to {X1, . . . , Xr}. Thus, Ex(∑T1−1j=0 f(Xj)) <∞

for almost all x with respect to Pα(Xr ∈ · ; T1 > r). From Lemma 2 we conclude that∑∞r=0 Pα(Xr ∈ C; T > r) = Eα(T1)π(C). Thus Ex(

∑T1−1r=0 f(Xj)) <∞ with [π]-probability

one. This completes the proof of Corollary 2. ♦

To get the convergence assertions (2.3) and (2.4) of Proposition 1 we need the followinglemma from renewal theory.

Lemma 3 Let {pn, n = 0, 1, . . .} be a probability distribution with p0 = 0 and let µ =∑∞n=1 npn < ∞. Let {ηi, i = 1, 2, . . .} be a sequence of i.i.d. random variables with

distribution {pn}. Let S0 = 0, Sk =∑kj=1 ηj for n ≥ 1. Define {p(k)

n , n = 1, 2, . . .}, k =

1, 2, . . . recursively by p(1)n = pn, p

(k)n =

∑0≤j≤n p

(k−1)j pn−j = P (Sk = n). For n = 0, 1, . . .,

define

rn =∞∑k=0

p(k)n . (2.14)

Then

(a) rn is the unique solution of the so-called renewal equations

r0 = 1, rn =n∑j=1

rn−jpj, n = 1, 2, . . . .

Furthermore,

(b) 1n

∑nj=0 rj → 1

µas n→∞.

If the additional condition g.c.d.{n : pn > 0} = 1 holds, then

(c) rn → 1µ

as n→∞.

13

Proof It is easy to establish (a) by direct verification. To prove Part (b), we note that∑nj=0 rj =

∑∞k=0 P (Sk ≤ n) = E(N(n)) where N(n) = sup{k : Sk ≤ n}. By the Strong Law

of Large Numbers and the inequalities

SN(n) ≤ n < SN(n)+1,

it follows thatN(n)

n→ 1

µw.p. 1.

Since Nn/n ≤ 1, it follows that E(Nn/n) = 1n

∑nj=0 rj → 1/µ, which establishes (b).

Part (c) is the well known discrete renewal theorem for which there are many proofs instandard texts, some of which are purely analytic (see, e.g. Chapter XIII.10 in Feller (1950))and others are probabilistic (see e.g. Chapter 2 of Hoel, Port, and Stone (1972)). ♦Remark The proofs given in this paper require only the convergence of rn to 1/µ assertedin Part (c) of Lemma 3. In fact, geometric bounds on the the tail behavior of the probabilitydistribution {pn} can be used to obtain geometric bounds on the rate of convergence ofrn to 1/µ (Stone (1965)). These can in turn can be used to obtain results on geometricconvergence in the ergodic theorem for Markov chains; see Section 2.4 of Athreya, Doss,and Sethuraman (1992).

Proof of Proposition 1 Let D be the collection of all measurable functions f on (X ,B)with supy |f(y)| ≤ 1. Let f ∈ D. Then for any x ∈ X ,

Ex(f(Xn)) = Ex(f(Xn)I(T > n)

)+

n∑k=0

Px(T = k)Eα(f(Xn−k)), n = 0, 1, . . . . (2.15)

Let vn = Eα(f(Xn)), an = Eα(f(Xn)I(T > n)

)and pn = Pα(T = n), n = 0, 1, . . .. Note

that vn and an depend on the function f while pn does not. Putting x = α in (2.15) weget the important identity

vn = an +n∑k=0

pkvn−k. (2.16)

It is not difficult to check that vn =∑nk=0 akrn−k is the unique solution to (2.16) where rn

is as defined in (2.14). Thus

1

n

n∑j=0

vj =1

n

n∑j=0

j∑k=0

akrj−k =1

n

n∑k=0

akRn−k =∞∑k=0

akRn−k

nI(k ≤ n)

where Rn =∑nj=0 rj. Also,

1

µ

∞∑j=0

aj =Eα(∑T−1

j=0 f(Xj))

Eα(T )=

∫fdν

Eα(T )=∫fdπ.

Thus for f ∈ D,

∣∣∣ 1n

n∑j=0

vj −∫fdπ

∣∣∣ ≤ ∞∑k=0

|ak|∣∣∣∣Rn−k

nI(k ≤ n)− 1

µ

∣∣∣∣14

≤ 2∞∑j=m

|aj|+( ∞∑j=0

|aj|)

supn−m≤k≤n

∣∣∣∣Rk

n− 1

µ

∣∣∣∣≤ 2

∞∑j=m

Pα(T > j) + (Eα(T )) supn−m≤k≤n

∣∣∣∣Rk

n− 1

µ

∣∣∣∣for any positive integer m. Note that for fixed m, supn−m≤k≤n |Rkn −

1µ| → 0 as n→∞ from

Part (b) of Lemma 3, and∑∞j=m Pα(T > j) → 0 as m → ∞, since α is positive recurrent.

By first fixing m and letting n→∞, and then letting m→∞, we get

∣∣∣ 1n

n∑j=0

vj −∫fdπ

∣∣∣→ 0 uniformly in f as n→∞. (2.17)

Let x ∈ C0. Let wn = Ex(f(Xn)), bn = Ex(f(Xn)I(T > n)

)and gn = Px(T = n).

Note that for a fixed x, bn → 0 as n → ∞, uniformly in f and that gn is a probabilitysequence which does not depend on f . Using equation (2.15) once again, we see that wnsatisfies the equation

wn = bn +n∑k=0

gkvn−k, n = 0, 1, . . . . (2.18)

Using (2.17), we conclude that

1

n

n∑j=0

wj =1

n

n∑j=0

bj +n∑k=0

gk1

n

n−k∑j=0

vj →∫fdπ

uniformly in f as n→∞. This establishes (2.3) of Proposition 1.We now use Condition (C). Under this assumption, g.c.d.{n : P n(α, α) > 0} = 1, and

thus g.c.d.{n : pn > 0} = 1; see for instance see the lemma on p. 29 of Chung (1967). Thus,from Part (c) of Lemma 3 we have rn → 1

µ. Repeating the arguments leading to (2.17)

and (2.18) with this stronger result on rn, we see that vn →∫fdπ and wn →

∫fdπ

uniformly in f ∈ D. This proves (2.4) and completes the proof of Proposition 1. ♦

2.2 Proof of Theorem 1 for General Markov Chains

We will now establish Theorem 1 under the condition that the n0 appearing in (1.5) is 1.This is stated as Proposition 2 below, and though it is technically weaker, its proof containsthe heart of the arguments needed to establish Theorem 1.

Proposition 2 Suppose that A ∈ B and let ρ be a probability measure on (X ,B) withρ(A) = 1. Suppose that the transition function P (x,C) of the Markov chain {Xn} satis-fies (1.1), (1.4), and (1.5) where the n0 appearing in (1.5) is equal to 1. Then there is a setD0 such that

π(D0) = 1 and supC∈B|P n(x,C)− π(C)| → 0 for each x ∈ D0. (2.19)

Proof The proof consists of adding a point ∆ to X , defining a transition function on theenlarged space and appealing to Proposition 1.

15

Consider the space (X , B), where X = X ∪{∆} and B is the smallest σ-field containingB and {∆}. Let ε∗ = ε/2, and define the function P (x,C) on (X , B) by

P (x,C) =

P (x,C) if x ∈ X \ A, C ∈ BP (x,C)− ε∗ρ(C) if x ∈ A, C ∈ Bε∗ if x ∈ A, C = {∆}∫A ρ(dz)P (z, C) if x = ∆, C ∈ B

(2.20)

Also, define the set function π on (X , B) by

π(C) =

{π(C)− ε∗ρ(C)π(A) if C ∈ Bε∗π(A) if C = {∆} (2.21)

It is easy to verify that P (x,C) and π(C) extend to B as a transition probability measureand probability measure, respectively.

We will now show that the transition probability function P (x,C) together with π andthe distinguished point ∆ satisfy Conditions (A), (B), and (C) of Proposition 1.

Clearly P (∆,∆) = ε∗ > 0, so that Condition (C) of Proposition 1 is satisfied. Recallthat X0 = {x ∈ X : G(x,A) > 0}. Let X0 = {x ∈ X : G(x,∆) > 0}. If x ∈ X0 \ A,then G(x,∆) ≥

∫AG(x, dy)P (y,∆) ≥ ε∗G(x,A) > 0. This shows that X0 ⊂ X0. Since

G(∆,∆) ≥ P (∆,∆) = ε∗ > 0, we also have ∆ ∈ X0, i.e. X0 ⊃ X0 ∪ {∆}. Since π(X0) = 1by (1.4) it follows that π(X0) = 1 by (2.21). Thus, Condition (B) of Proposition 1 issatisfied.

Next, for C ∈ B, we have∫Xπ(dx)P (x,C) =

∫X

(π(dx)− ε∗ρ(dx)π(A)

)P (x,C) + ε∗π(A)

∫Xρ(dx)P (x,C)

=∫Xπ(dx)P (x,C)

=∫Xπ(dx)

(P (x,C)− ε∗ρ(C)I(x ∈ A)

)= π(C)− ε∗ρ(C)π(A)

= π(C).

When C = {∆}, we have∫Xπ(dx)P (x,∆) =

∫X

(π(dx)− ε∗ρ(dx)π(A)

)(ε∗I(x ∈ A)

)+ ε∗π(A)

∫Xρ(dx)ε∗I(x ∈ A)

= ε∗π(A)

= π(∆).

This verifies Condition (A) of Proposition 1.Let

D0 = {x : x ∈ X , Px(T∆ <∞) = 1}.

From Proposition 1 it follows that

π(D0) = 1 and supC∈B|P n(x,C)− π(C)| → 0 for each x ∈ D0. (2.22)

16

To translate (2.22) into a result for P n(x,C) we define a function v(x,C) on X × B by

v(x,C) ={I(x ∈ C) if x ∈ Xρ(C) if x = ∆

We may view v(x,C) as a transition function from X into X . The following lemma showshow one can go from P n(x,C) to P n(x,C) and back. The proof of Proposition 2 is continuedafter Lemma 5.

Lemma 4 The transition functions P (x,C), P (x,C) and v(x,C) and the probability mea-sures π and π are related as follows:

P (x,C) =∫XP (x, dy)v(y, C) = P (x,C) + P (x,∆)ρ(C) for x ∈ X , C ∈ B, (2.23)

P (x,C) =∫Xv(x, dy)P (y, C) for x ∈ X , C ∈ B, (2.24)

P n(x,C) =∫XP n(x, dy)v(y, C) = P n(x,C) + P n(x,∆)ρ(C) for x ∈ X , C ∈ B, (2.25)

andπ(C) =

∫Xπ(dx)v(x,C) = π(C) + π(∆)ρ(C) for C ∈ B. (2.26)

Proof These are proved by direct verification. For x ∈ X , C ∈ B, we have∫XP (x, dy)v(y, C) =

∫XP (x, dy)I(y ∈ C) + ε∗I(x ∈ A)ρ(C)

= P (x,C)− ε∗I(x ∈ A)ρ(C) + ε∗I(x ∈ A)ρ(C)

= P (x,C).

Similarly, for x ∈ X , C ∈ B, we get∫Xv(x, dy)P (y, C) =

{P (x,C) if x ∈ X∫ρ(dy)P (y, C) = P (∆, C) if x = ∆

We prove (2.25) by induction on n. For n = 1, this is just (2.23). Assume that (2.25) hasbeen proved for n− 1.

For x ∈ X , C ∈ B, we have∫XP n(x, dy)v(y, C) =

∫z,y∈X

P n−1(x, dz)P (z, dy)v(y, C)

=∫z,y∈X , w∈X

P n−1(x, dz)v(z, dw)P (w, dy)v(y, C)

=∫z∈X , w∈X

P n−1(x, dz)v(z, dw)P (w,C)

=∫w∈X

P n−1(x, dw)P (w,C)

= P n(x,C),

where the second inequality follows from (2.24), the third follows from (2.23), and thefourth from the induction step.

17

Finally, for C ∈ B, we notice that∫Xπ(dx)v(x,C) =

∫X

(π(dx)− ε∗π(A)ρ(dx)

)v(x,C) + ε∗π(A)ρ(C) = π(C).

This completes the proof of the lemma. ♦

The next lemma shows that π dominates ρ.

Lemma 5 Let C ∈ B. Then

π(C) = 0 implies ρ(C) = 0. (2.27)

Proof From the careful choice of ε∗ = ε/2 used to define P (x,C) in definition (2.20), wehave

P (x,C) = P (x,C)− ε∗ρ(C) > ε∗ρ(C) whenever x ∈ A and C ∈ B. (2.28)

Since π is an invariant probability measure for P (·, ·),

π(C) =∫XP (x,C)π(dx) ≥

∫AP (x,C)π(dx)

≥ ε∗ρ(C)π(A) = ε∗ρ(C)π(A)(1− ε∗)= π(∆)ρ(C)(1− ε∗)

by using the identity π(∆) = ε∗π(A). From Lemma 1 applied to the Markov chain withtransition probability function P (·, ·) on (X , B) and distinguished point ∆, we find thatπ(∆) > 0. This establishes (2.27). ♦

Completion of the proof of Proposition 2 Let D0 = D0 −∆. From (2.22), (2.25),and (2.26), we have π(D0) = 1, and

supC∈B|P n(x,C)−π(C)| = sup

C∈B

∣∣∣∫XP n(x, dy)v(y, C)−

∫Xπ(dy)v(y, C)

∣∣∣→ 0 for each x ∈ D0.

This means thatπ(X −D0) = π(X − D0) = 0.

From Lemma 5, it follows thatρ(X −D0) = 0.

Now, from the definition of π(·) in (2.21),

π(X −D0) = π(X −D0) + ε∗ρ(X −D0)π(A) = 0.

This completes the proof that π(D0) = 1 and

supC∈B|P n(x,C)− π(C)| → 0 for all x ∈ D0.

♦

We now drop the condition n0 = 1 and prove Theorem 1.

18

Proof of Theorem 1 LetM ={m : there is an εm > 0 such that inf

x∈APm(x, ·) ≥ εmρ(·)

}.

Then g.c.d.(M) = 1. Let m1,m2 ∈M. Then for x ∈ A and C ∈ B,

Pm1+m2(x,C) ≥∫APm2(y, C)Pm1(x, dy) ≥ εm2ρ(C)εm1ρ(A) = εm1εm2ρ(C).

Thus m1 + m2 ∈ M. Since g.c.d.(M) = 1, there is an integer L such that m ≥ L impliesthat m ∈ M. Now from (1.4), for [π]-a.e. x, there is an s (which may depend on x)such that P s(x,A) > 0. Fix an m ∈ M. For any integer k, such that km − s ≥ L, wehave P km(x,A) ≥

∫A P

km−s(y, A)P s(x, dy) ≥ εkm−sρ(A)P s(x,A) > 0. Thus for [π]-a.e. x,P km(x,A) > 0 for all large k. This means that the Markov chain {Xnm, n = 0, 1, . . .}satisfies all the conditions (1.1), (1.4) and (1.5) with n0 = 1. From Proposition 2, there is

a D0 such that π(D0) = 1 and for each x ∈ D0, ∆km(x)def= supC∈B |P km(x,C)− π(C)| → 0

as k →∞. We also have P r(y,Dc0) = 0 for 0 ≤ r ≤ m− 1 for [π]-a.e. y, since 0 = π(Dc

0) =∫P r(y,Dc

0)π(dy) for 0 ≤ r ≤ m− 1. For any n, write n = km + r for 0 ≤ r ≤ m− 1. LetD1 = {x : P r(x,Dc

0) = 0, 0 ≤ r ≤ m− 1}. Then π(D1) = 1 and for x ∈ D1,

∆n(x) ≤∫D0

supC∈B|P km(y, C)− π(C)|P r(x, dy) ≤

∫D0

supC∈B|P km(y, C)− π(C)|

m−1∑r=0

P r(x, dy)

which goes to zero as k →∞ by the bounded convergence theorem. Since n→∞ impliesk →∞, this proves that ∆n(x)→ 0 for x ∈ D1 where π(D1) = 0. ♦

2.3 Proof of Theorem 2 for General Markov Chains

As mentioned earlier, the key to the proof of Theorem 2 is to recognize an embeddedMarkov chain which satisfies the conditions of Theorem 1. The proof of Theorem 2 iscompleted after Lemma 9.

Let Ym = Xmn0 , m = 0, 1, . . . and put Q(x,C) = P n0(x,C) for x ∈ X and C ∈ B. Thesubsequence {Y0, Y1, . . .} is a Markov chain with transition probability function Q(x,C)and we will call it the embedded Markov chain. Define

Ar ={x :

∞∑m=1

Pmn0−r(x,A) > 0}, r = 0, 1, . . . , n0 − 1. (2.29)

Since P n0(x,A) ≥ ε for all x ∈ A, one can also define Ar by

Ar ={x :

∞∑m=k

Pmn0−r(x,A) > 0}

for any k ≥ 1.

i.e. Ar is the set of all points from which A is accessible at time points which are of theform mn0 − r for all large m and A0 is the set of all points from which A is accessible inthe embedded Markov chain.

Lemma 6 below shows that the embedded Markov chain satisfies the conditions ofTheorem 1 with the normalized restriction of π to A0 as its invariant probability measure.

Lemma 6 Under the conditions of Theorem 2,

π(A0) > 0.

19

Let

π0(C) =π(C ∩ A0)

π(A0).

The embedded Markov chain {Y0, Y1, . . .} satisfies Conditions (1.4) and (1.5) of Theorem 1with π0 as an invariant probability measure and with the n0 appearing in (1.5) equal to 1.

Proof Condition (1.4) states that π({x : Px(T (A) < ∞) > 0}

)= 1. Just the fact that

this probability is positive and Condition (1.1) allow us to use Corollary 1 to conclude thatπ(A) > 0. Condition (1.5) implies that A ⊂ A0. Thus π(A0) > 0 and hence π0 is a welldefined probability measure. Clearly,

π(C) =∫π(dx)Q(x,C) for all C ∈ B, (2.30)

π0(A0) = 1, (2.31)

andQ(x, ·) ≥ ερ(·) for all x ∈ A. (2.32)

Notice that∑2≤m<∞

Qm(x,A) =∫XQ(x, dy)

∑1≤m<∞

Qm(y, A) =∫A0

Q(x, dy)∑

1≤m<∞Qm(y, A).

Hence Q(x,A0) > 0 implies that∑

2≤m<∞Qm(x,A) > 0, i.e. x ∈ A0. In other words,

x 6∈ A0 implies Q(x,A0) = 0. (2.33)

From (2.30) and (2.33) we have the equality

π(A0) =∫Xπ(dx)Q(x,A0) =

∫A0

π(dx)Q(x,A0)

which implies that Q(x,A0) = 1 for [π]-almost all x ∈ A0. Hence∫Xπ0(dx)Q(x,C) =

1

π(A0)

∫A0

π(dx)Q(x,C) =1

π(A0)

∫Xπ(dx)Q(x,C ∩ A0) = π0(C).

(2.34)Equations (2.34), (2.31) and (2.32) establish the lemma. ♦

Defineπr(C) =

∫A0

π0(dx)P r(x,C)

for r = 1, 2, . . . , n0 − 1 and

π(C) =1

n0

n0−1∑r=0

πr(C).

Note that πr is the distribution of Xr when Y0 = X0 has initial distribution π0. The nextlemma shows that averages of the n0 successive transition functions of the chain convergeto π(C) for [π0]-almost all x.

20

Lemma 7 Define

B0 ={x : x ∈ A0, sup

C∈B|Pmn0(x,C)− π0(C)| → 0 as m→∞

}. (2.35)

Under the conditions of Theorem 2

π0(B0) = 1. (2.36)

Moreover for each x ∈ B0,

supC∈B|Pmn0+r(x,C)− πr(C)| → 0 as m→∞ for r = 0, 1, . . . , n0 − 1, (2.37)

and hence

supC∈B

∣∣∣ 1

n0

n0−1∑r=0

Pmn0+r(x,C)− π(C)∣∣∣→ 0 as m→∞. (2.38)

Proof From Lemma 6 the embedded Markov chain satisfies the conditions of Theorem 1with the n0 appearing in (1.5) equal to 1. From Proposition 2 it follows that

supC∈B|Pmn0(x,C)− π0(C)| → 0 as m→∞

for [π0]-almost all x. This establishes (2.36). For r = 0, 1, . . . , n0 − 1 and x ∈ B0,

supC∈B|Pmn0+r(x,C)− πr(C)| = sup

C∈B

∣∣∣ ∫X

(Pmn0(x, dy)− π0(dy))P r(y, C)∣∣∣

≤ supD∈B|Pmn0(x,D)− π0(D)| → 0

as m→∞, establishing (2.37). ♦The next lemma shows that the conclusions of the previous lemma hold [π]-almost

everywhere.

Lemma 8 Under the conditions of Theorem 2,

πr(Ar) = 1 for r = 1, . . . , n0 − 1

and (2.38) holds for [π]-almost all x.

Proof Consider the original Markov chain X0, X1, . . .. Let E ∈ B and let π0(E) = 1.Then ∫

Xπ1(dx)P n0−1(x,E) =

∫x∈X

∫y∈A0

π0(dy)P (y, dx)P n0−1(x,E)

=∫A0

π0(dy)P n0(y, E)

= π0(E) = 1

and hence P n0−1(x,E) = 1 for [π1]-almost all x. In particular we take E = B0 and rewritethe conclusion as π1(B1) = 1 where the sets Br are defined by

Br ={x : P n0−r(x,B0) = 1

}, r = 1, 2, . . . , n0 − 1.

21

Let x ∈ B1. Then∑m≥2

Pmn0−1(x,A) ≥∫B0

P n0−1(x, dy)∑m≥2

P (m−1)n0(y, A) > 0

and hence x ∈ A1. Thus B1 ⊂ A1. Similarly, πr(Br) = 1 and Br ⊂ Ar for all r. Noticethat for x ∈ B1

supC∈B|Pmn0+r+n0−1(x,C)− πr(C)| ≤

∫y∈X

supC∈B|Pmn0+r(y, C)− πr(C)|P n0−1(x, dy)

=∫y∈B0

supC∈B|Pmn0+r(y, C)− πr(C)|P n0−1(x, dy).

From (2.37) it follows that

supC∈B|Pmn0+r+n0−1(x,C)− πr(C)| → 0 as m→∞ for [π1]-almost all x. (2.39)

As a consequence, as m→∞

supC∈B

∣∣∣ 1

n0

n0−1∑r=0

Pmn0+r(x,C)− π(C)∣∣∣ = sup

C∈B

∣∣∣ 1

n0

n0−1∑r=0

(Pmn0+r(x,C)− πr+1(C)

)∣∣∣→ 0

(2.40)for [π1]-a.e. x.

A similar argument shows that (2.38) holds for [πr]-almost all x and all r and hence for[π]-almost all x. ♦

We now establish that π = π by using the full force of Condition (1.4).

Lemma 9 Under the conditions of Theorem 2, πr is the restriction of π to Ar, for r =1, 2, . . . , n0 − 1 and

π = π.

Proof We have already shown that πr(Ar) = 1, r = 1, 2, . . . , n0 − 1. We will now showthat A0, . . . , An0−1 act like cyclically moving subsets in the sense that

x ∈ Ac0 implies P (x,A1) = 0.

Suppose that P (x,A1) > 0. Then∑m≥1

Pmn0(x,A) ≥∫A1

P (x, dy)∑m≥1

Pmn0−1(y, A) > 0,

which implies that x ∈ A0. Thus x ∈ Ac0 implies that P (x,A1) = 0. Now for C ∈ B,

π1(C) = π1(C ∩ A1)

=1

π(A0)

∫A0

π(dx)P (x,C ∩ A1)

=1

π(A0)

∫Xπ(dx)P (x,C ∩ A1)

=π(C ∩ A1)

π(A0).

22

Since π1(A1) = 1, this implies that π(A1) = π(A0) and that π1 is the restriction of π to A1.A similar conclusion holds for πr for other values of r.

We now use the full force of Condition (1.4) which can be restated as π(∪n0−1r=0 Ar) = 1.

This together with the fact that πr is the restriction of π to Ar, r = 0, 1, . . . , n0− 1 impliesthat the probability measures π and π are absolutely continuous with respect to each other.From this observation and Lemma 8, for any C ∈ B,

Hmn0(x,C) =1

mn0

mn0∑j=1

P j(x,C)→ π(C)

for [π]-almost all x. Now,

π(C) =∫Xπ(dx)Hmn0(x,C)

m→∞−→∫Xπ(dx)π(C) = π(C).

This shows that π = π. ♦

We now complete the proof of Theorem 2.Proof of Theorem 2 It is clear that Lemmas 8 and 9 establish conclusions (1.8) and (1.9)of Theorem 2. Let f(x) be a measurable function satisfying

∫|f(x)|π(dx) < ∞. From a

slight extension of Corollary 2 as applied to the embedded Markov chain for the averagesof f(·) over the whole chain, we obtain

π0(Bf ) = 1

where

Bf =

{x : Px

{1

n

n∑j=1

f(Xj)→∫f(x)dπ(x)

}= 1; Ex

(1

n

n∑j=1

f(Xj))→∫f(x)dπ(x)

}.

From the argument at the beginning of the proof of Lemma 8 we have P n0−1(x,Bf ) = 1for π1-almost all x. The definition of Bf is such that if P n0−1(x,Bf ) = 1 then x ∈ Bf .Hence π1(Bf ) = 1, and similarly πr(Bf ) = 1 for r = 2, 3, . . .. This together with the factthat π = π establishes (1.10) and (1.11). ♦

2.4 Remarks on Successive Substitution Sampling

Theorems 1 and 2 pertain to arbitrary Markov chains. We now give a result that facilitatesthe use of our theorems when the Markov chain used is the one obtained from the succes-sive substitution sampling algorithm, which is the most commonly used Markov chain inBayesian statistics. We assume that for each i, the conditional distributions πX(i)|{X(j) j 6=i}have densities, say pX(i)|{X(j) j 6=i}, with respect to some dominating measure ρi.

Theorem 6 Consider the successive substitution sampling algorithm for generating obser-vations from the joint distribution π of (X(1), . . . , X(p)) as described in Section 1. Supposethat for each i = 1, . . . , p there is a set Ai with ρi(Ai) > 0, and a δ > 0 such that for eachi = 1, . . . , p

pX(i)|{X(j) j 6=i}(x(1), . . . , x(p)) > 0 (2.41)

23

wheneverx(1) ∈ A1, . . . , x

(i) ∈ Ai, and x(i+1), . . . , x(p) are arbitrary,

and

pX(i)|{X(j) j 6=i}(x(1), . . . , x(p)) > δ whenever x(j) ∈ Aj, j = 1, . . . , p. (2.42)

Then Conditions (1.4) and (1.5) are satisfied with n0 = 1. Thus, (1.6) is also satisfied, andthe conclusions of Theorems 1 and 2 hold.

Proof Let pi = pX(i)|{X(j) j 6=i} and A = A1 × · · · × Ap. The transition function used insuccessive substitution sampling is given by

P (x,A) =∫Ap1(y(1), x(2), . . . , x(p))p2(y(1), y(2), x(3), . . . , x(p)) . . . pp(y(1), . . . , y(p)) dρ1(y(1)) . . . dρ1(y(p)).

It is now easy to see that Condition (2.41) verifies (1.4) for all starting points x andthat (2.42) verifies (1.5) with n0 = 1. ♦

We note that Condition (2.41) is often checked for all (x(1), . . . , x(p)).

Conditional Distributions Need Not Determine the Joint Distribution

In Section 1, we described how to form a transition function from the two conditionaldistributions πX1|X2 and πX2|X1 obtained from a bivariate distribution π. We mentionedthat for a Markov chain with such a transition function to converge in distribution toπ it is necessary that πX1|X2 and πX2|X1 determine π. Some researchers have ponderedover the question of when do the conditional distributions determine the joint distribution.Besag (1974) noted that uniqueness is guaranteed if the distributions are discrete and givepositive probability to a rectangle set.

One can give a simple nondegenerate example to show that in general, the two condi-tional distributions do not determine the joint distribution. Let X1 have a density functionp(x) such that ∑

−∞<m<∞p(m+ r) = cr <∞ for each r ∈ [0, 1).

The density function p(x) = 12

exp (−|x|), for instance, satisfies this condition. Let πX2|X1

be the distribution that puts masses

1/2 at x1 + 1and (2.43)

1/2 at x1 − 1.

This determines the other conditional distribution πX1|X2 . This puts masses

p(x2 + 1)

p(x2 + 1) + p(x2 − 1)at x2 + 1

and (2.44)p(x2 − 1)

p(x2 + 1) + p(x2 − 1)at x2 − 1.

24

It can be seen that the two conditional distributions (2.44) and (2.43) do not uniquelydetermine a joint distribution for (X1, X2). Fix r ∈ [0, 1) and consider the discrete distri-bution pr on the points m + r, m = . . . ,−1, 0, 1, . . . defined by pr(m + r) = 1

crp(m + r).

Let Y1(r) be distributed according to pr, and let the conditional distribution of Y2(r) givenY1(r) be the distribution defined in (2.43). It is easy to see that the distribution of Y1(r)given Y2(r) is that given in (2.44), and the joint distribution of (Y1(r), Y2(r)) has the sameconditional distributions as (X1, X2).

It is even possible to find joint distributions with continuous marginals for which theconditionals are given by (2.44) and (2.43). Let f(r) be any probability density on [0, 1).Let R have density function f(r) and put (Z1, Z2) = (Y1(R), Y2(R)). Clearly the conditionaldistributions of (Z1, Z2) are as in (2.44) and (2.43). The marginal distribution function ofZ1 is given by

P (Z1 ≤ x) =∫

[0,1)

( ∑m:m+r≤x

pr(m+ r))f(r)dr =

∫ x

−∞

p(y)f(y − [y])

cy−[y]

dy

A similar expression can be written down for the distribution function of Z2. Notice thatZ1 and Z2 have density functions.

3 Remarks on the Sampling Plan

In Section 1 we mentioned that there are a number of ways of using the Markov chain toestimate π or some aspect of π. One can generate G independent chains, each of length n,and retain the last observation from each chain, obtaining a sample X [1]

n , X[2]n , . . . , X

[G]n

of independent variables. At another extreme, one can generate a very long sampleX0, X1, X2, . . . , XnG, and use Xn, X2n, . . . , XGn, which form a nearly i.i.d. sequence fromπ. This is at approximately the same cost in CPU time. (Clearly intermediate solutionsare possible). If the objective is to estimate an expectation

∫f(x)π(dx), then there is no

reason to discard the intermediate values from a long chain, and one can use

1

n(G− 1)

nG∑i=n+1

f(Xi). (3.1)

The almost sure convergence of (3.1) follows from Theorem 2 under the assumption∫f(x)π(dx) <∞ (note that we do not need the aperiodicity condition (1.6)). Thus, from

the point of view of estimating a particular expectation∫f(x)π(dx) or probability it is

clear that use of (3.1) is preferable, and so it is natural to ask why one should botherto prove results such as (1.7). In the Bayesian framework, there is another aspect thatmust be considered, which is that generally, in the exploratory stage, one is interested incalculating posterior distributions and densities for a large number of prior distributions.It will usually not be feasible to run a separate Markov chain for each prior of interest (thetime needed is often on the order of several minutes for each prior). Instead, one will wantto get a sequence of random variables X1, . . . , Xr distributed according to the posteriordistribution with respect to some fixed prior, and then use that same sequence to estimatethe posterior with respect to many other priors. (We discuss how this may be done in thenext paragraph.) The important point here is that if there are a large number of priors

25

involved, then the manipulations of the sequence X1, . . . , Xr to produce the posterior foreach prior must be done very quickly. This restricts the size of r, and so one will generallywant the sequence X1, . . . , Xr to be independent. This precludes running a very long chainand taking sample averages as in (3.1). Instead, one will want to generate independentchains and retain the last random variable in each chain or take a long chain and retainonly random variables at equally spaced intervals.

We now discuss in more detail how one might use one sequence X1, . . . , Xr to calculateposteriors with respect to many priors. We depart from the notation of the paper and switchto the notation usually used in Bayesian analysis. Suppose that νh is a family of priors forthe parameter θ. Here, h lies in some interval and we think of it as a hyperparameter forthe prior. Suppose that we are in the dominated case, i.e. there is a likelihood functionlX(θ), where X now represents the data.

Let νh,X be the posterior distribution of θ when the prior is νh. We know that νh,X isdominated by νh and

dνh,Xdνh

(θ) = ch(X)lX(θ),

where ch(X) is a normalizing constant.Consider the case where we can generate observations θ1, θ2, . . . , θr from ν0,X and there-

fore estimate∫f(θ)dν0,X(θ) by (1/r)

∑ri=1 f(θi). We will indicate now how we can obtain

estimates of∫f(θ)dνh,X(θ) for h 6= 0.

Suppose that νh is dominated by ν0. Then it is clear that νh,X is dominated by ν0,X

anddνh,Xdν0,X

(θ) =ch(X)

c0(X)

dνhdν0

(θ)

since the likelihood lX(θ) cancels. We may write

∫f(θ)dνh,X(θ) =

∫f(θ)

dνh,Xdν0,X

(θ)dν0,X(θ) =ch(X)

c0(X)

∫f(θ)

dνhdν0

(θ)dν0,X(θ).

Substituting f(θ) ≡ 1 in the above we can obtain the constant ch(X)c0(X)

and write

∫f(θ)dνh,X(θ) =

∫f(θ)dνh

dν0(θ)dν0,X(θ)∫ dνh

dν0(θ)dν0,X(θ)

.

Thus, we may estimate∫f(θ)dνh,X(θ) by

r∑i=1

f(θi)wh,i where wh,i =dνhdν0

(θi)∑ri=1

dνhdν0

(θi).

This is the well-known “ratio estimate” in importance sampling theory. The key here is thatits calculation requires only knowledge of the ratio

dνh,Xdν0,X

up to a multiplicative constant.

See Hastings (1970).Now in some Bayesian problems, for instance problems with missing or censored data,

the likelihood function lX(θ) is either extremely difficult or impossible to calculate. (Anexample of this arises in Doss (1991).) The fact that this likelihood cancels means that

26

the estimation of the expectation under the prior νh requires only the recomputation of rweights, and this can be done very fast.

It will often be the case that we wish to consider not just one function, but rathera family of functions. As a simple example, if we wish to estimate the entire posteriordistribution of θ, then in effect we wish to consider ft(θ) = I(θ ≤ t) for a fine grid ofvalues of t. For some applications we have been able to do the computations fast enoughto dynamically display the estimates of the posterior distributions

∫I(θ ≤ t)dνh,X(θ) as h

varies, using the program Lisp-Stat described in Tierney (1991), for r as large as 500 ona workstation doing about 1.5 million floating point operations per second. See Doss andNarasimhan (1994).

Acknowledgements

We are very grateful to the reviewers for their helpful remarks, and especially to the as-sociate editor for a particularly careful reading of the paper and for pointing out somesimplifications in our proofs.

References

Amit, Y. (1991). On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions. J. Multivariate Anal. 38 82–99.

Asmussen, S. (1987). Applied Probability and Queues. Wiley, New York.

Athreya, K. B., Doss, H., and Sethuraman, J. (1992). A proof of convergence of the Markovchain simulation method. Technical Report No. 868, Department of Statistics, FloridaState University.

Athreya, K. B. and Ney, P. (1978). A new approach to the limit theory of recurrent Markovchains. Trans. Amer. Math. Soc. 245 493–501.

Besag J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. RoyalStatist. Soc. Ser. B 36 192–236.

Chan, K. S. (1993). Asymptotic behavior of the Gibbs sampler. J. Amer. Statist. Assoc.88 320–326.

Chung, K. L. (1967). Markov Chains. Second Edition, Springer Verlag, New York.

Doob, J. L. (1953). Stochastic Processes. Wiley, New York.

Doss, H. (1991). Bayesian nonparametric estimation for incomplete data via successivesubstitution sampling. Technical Report No. M850, Department of Statistics, FloridaState University.

Doss, H. and Narasimhan, B. (1994). Bayesian Poisson regression using the Gibbs sampler:Sensitivity analysis through dynamic graphics. Technical Report No. 895 , Departmentof Statistics, Florida State University.

Feller, W. (1950). An Introduction to Probability Theory and Its Applications, Volume I.Third edition, John Wiley, New York.

27

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and theBayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelli-gence 6 721–741.

Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Appl.Statist. 41 337–348.

Goodman, J. and Sokal, A. D. (1989). Multigrid Monte Carlo method. Conceptual foun-dations. Phys. Rev. D 40 2035–2071.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and theirapplications. Biometrika 57 97–109.

Hoel, P., Port, S., and Stone, C. (1972). Introduction to Stochastic Processes. HoughtonMifflin, Boston.

Karlin, S. and Taylor, H. M. (1975). A First Course in Stochastic Processes. AcademicPress, New York.

Nummelin, E. (1984). General Irreducible Markov Chains and Non-Negative Operators.Cambridge University Press, Cambridge.

Orey, S. (1971). Limit Theorems for Markov Chains Transition Probabilities. Van Nos-trand, New York.

Revuz, D. (1975). Markov Chains. North-Holland, Amsterdam.

Schervish, M. and Carlin B. (1992). On the convergence of successive substitution sampling.J. Computational and Graphical Statistics 1 111–127.

Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by dataaugmentation (with discussion). J. Amer. Statist. Assoc. 82 528–550.

Tierney, L. (1991). Markov chains for exploring posterior distributions. Technical ReportNo. 560, School of Statistics, University of Minnesota.

Tierney, L. (1991). Lisp-Stat. Wiley, New York.

Thomas, A., Spiegelhalter, D., and Gilks, W. (1992). BUGS: A program to performBayesian inference using Gibbs sampling. In Bayesian Statistics 4 (ed. J. Bernardo, J.Berger, A. Dawid, and A. F. M. Smith), 837–842. Clarendon Press, Oxford, UK.

28

On the convergence of the Markov chain simulation method

Documents

Transcript of On the convergence of the Markov chain simulation method