On the adaptive control of a partially observable binary markov decision process

12
ON THE ADAPTIVE CONTROL OF A PARTIALLY OBSERVABLE BINARY MARKOV DECISION PROCESS Emmanuel Fern£ndez-Gaucherand, Aristotle Arapostathis, and Steven I. Marcus Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-1084 1. Introduction. As noted in [AM], despite the considerable amount of work in adaptive stochastic control (see [KUM1] for a survey), there has been little work on problems with incom- plete or noisy state observations, aside from the linear case. A first step in this direction was taken in [AM}t in which the adaptive estimation of the state of a finite state Markov chain with incomplete state observations and in which the state transition probabilities depend on unknown parameters is studied. In this contcxt~ the adaptive estimation problem is that of computing re- cursive estimates of the conditional probability vector of the state at time ~ given the observations up to time t, when the transition matrix P is not completely known (i.e., it depends on a vector of unknown parameters 0 -- this dependence is expressed as P(8)). The approach to this problem which we adopted in [AM] has been widely used in linear filtering: wc use the previously derived recursive filter for the conditional probabilitics~ and we simultaneously recursively estimate the parameters, plugging the parameter estimates into the filter. This adaptive estimation algorithm is then analyzed via the Ordinary Differential Equation (ODE) Method. That is, it is shown that the convergence of the parameter estimation algorithm can be analyzed by studying an "averaged" ordinary differential equation. The most crucial and difficult aspect of the proof is that of showing that, for each value of the unknown parameter, an augmented Markov process has a unique invariant measure, and that the expectations of certain functions of the augmented state converge geometrically to the same expectations under the invariant measure. The convergence of the recurslve parameter estimates is studied, and optimality of the adaptive state estimator is proved, for a given criterion. In this paper, we take some initial steps in the direction of applying similar techniques to adaptive finite state Markov decision (control) problems with incomplete observations. One intriguing set of problems for which some results are available when the parameters are known are those involving quality control and machine replacement~ and we study the properties of these problems in this paper, with the eventual aim of developing optimal adaptive stochastic controllers. However, the presence of feedback makes this stochastic adaptive control problem much more difficult than the adaptive estimation problem of [AM]. We consider here only problems involving two states and two actions; for some more general analysis, see [FAM]. 2. The Two-State Binary Replacement Problem_ Consider the situation in which a ma- chine/production process deteriorates over time under operation. Tim ("core") state of the ma- chine is such that 0 is the more desirable ("as new" or "good") state, and 1 represents the failed (or "bad") state. The efficiency of the machine, or quality of the items produced is a function of the core state. Items are produced at the beginning of each time epoch t, for t 6 No := {0} U N,

Transcript of On the adaptive control of a partially observable binary markov decision process

ON THE ADAPTIVE CONTROL OF A PARTIALLY

OBSERVABLE BINARY MARKOV DECISION PROCESS

Emmanuel Fern£ndez-Gaucherand, Aristotle Arapostathis, and Steven I. Marcus

Department of Electrical and Computer Engineering

The University of Texas at Austin

Austin, Texas 78712-1084

1. Introduction. As noted in [AM], despite the considerable amount of work in adaptive

stochastic control (see [KUM1] for a survey), there has been little work on problems with incom-

plete or noisy state observations, aside from the linear case. A first step in this direction was

taken in [AM}t in which the adaptive estimation of the state of a finite s tate Markov chain with

incomplete state observations and in which the state transition probabilities depend on unknown

parameters is studied. In this contcxt~ the adaptive estimation problem is that of computing re-

cursive estimates of the conditional probability vector of the state at time ~ given the observations

up to time t, when the transition matrix P is not completely known (i.e., it depends on a vector

of unknown parameters 0 - - this dependence is expressed as P(8)). The approach to this problem which we adopted in [AM] has been widely used in linear

filtering: wc use the previously derived recursive filter for the conditional probabilitics~ and we

simultaneously recursively estimate the parameters, plugging the parameter estimates into the

filter. This adaptive estimation algorithm is then analyzed via the Ordinary Differential Equation

(ODE) Method. That is, it is shown that the convergence of the parameter estimation algorithm

can be analyzed by studying an "averaged" ordinary differential equation. The most crucial and

difficult aspect of the proof is that of showing that, for each value of the unknown parameter,

an augmented Markov process has a unique invariant measure, and that the expectations of

certain functions of the augmented state converge geometrically to the same expectations under

the invariant measure. The convergence of the recurslve parameter estimates is studied, and

optimality of the adaptive state estimator is proved, for a given criterion.

In this paper, we take some initial steps in the direction of applying similar techniques

to adaptive finite state Markov decision (control) problems with incomplete observations. One

intriguing set of problems for which some results are available when the parameters are known are those involving quality control and machine replacement~ and we study the properties of

these problems in this paper, with the eventual aim of developing optimal adaptive stochastic

controllers. However, the presence of feedback makes this stochastic adaptive control problem

much more difficult than the adaptive estimation problem of [AM]. We consider here only problems

involving two states and two actions; for some more general analysis, see [FAM].

2. The Two-State Binary Replacement Problem_ Consider the situation in which a ma-

chine/production process deteriorates over time under operation. Tim ("core") state of the ma-

chine is such that 0 is the more desirable ("as new" or "good") state, and 1 represents the failed

(or "bad") state. The efficiency of the machine, or quality of the items produced is a function of

the core state. Items are produced at the beginning of each time epoch t, for t 6 No := {0} U N,

218

and at that moment one of two decisions must be made: continue to produce (0) or repair/replace

(1); the word"binary" refers to the fact that there are only two actions. Under production, the

machine may deteriorate to the "bad" state 1, and once in that state it stays there if no re-

pair/replace action is taken. If the decision to repair/replace is taken, then the machine will be

in the "good" state by the beginning of the next time epoch w.p.1. Imperfect observations of the

state of the system arc ~vailablc wlfilc producing.

Putting this problem in the standard framework of partially observed Marker dcclsion pro-

cesses (POMDP) (see, e.g., [BE], [Me], [HLM]),we have the following model. Let (f2, F, P) be a

probability space and let X = Y --- {0,1}, U --- {0, 1). The system's state process will be modelled

as a finite state controllcd Marker chain with ("core") state space X and action space U with

2 × 2 transition matrices {P(u))ueu- Thus, the core process is given by a random process {z~}

on (f2, F, P) , t e N0, where for a sequence of U-valued random variables (uk}~,=0 on (fl, F,P),

the controls (or decisions),

e { = , + ~ = j l = , = i , = , _ , , . . . , = o ; u , , u , _ ~ , . . . . us} = [P(u , ) l , . j =: p , .A~, ) , t e No

Only partial observations of {xt}teNo are available in the form of a random process {yt}teN ° taking values in the ("message") space Y. The sequence of events is assumed as follows: at

time epoch t the system is in state xt, observation y~ bccomcs available, and action ut is taken;

traa~sition to a state zt+l has occurred by the beginning of time epoch t + l , another obscrwtion

Yt+l becomes available, and then a new decision uc+l is made; and so on. The core and observation

processes are related probabilistically as

P {Yt+l = klY¢,...,yo;x~+l -= i , . . . ,xo;Uc, . . . ,Uo}

= p {~,+~ = kl=,+~ = i ,u~} =: q~,k(~t), t e No

_(o) P{y0 = k l z 0 = i } = : ~i ,k

which leads to the definition of a collection of 2 x 2 observation matrices Qo and {Q(u)}~eu such

that

O(u) := [ q , , ~ ( " ) ] , ~ , ~ o ' ; Qo := [q~k)],~x,~o"

It is assumed that the probability distribution of the initial state, Po := [P{x0 = i}]i~x E A, is

available for decision making, where A := (p e ~2 :p(0 > 0, lep = 1}; here, 1 = [1,1]', "prime"

denotes transposition, and p(1) denotes the ith component of p. We endow A with the metric

topology induced by the norm H" II1 given by HP[[, := ½1~P • Define recursively the information spaces

H 0 : = A x y ; ] / e : = H t - l x U x Y , ~ E N ; H o o : = g o x ( U x Y ) ° °

each equipped with its respective product topology. An element h~ E H~, t E No, is called an

observable history and represents the information available for decision making at time epoch t.

It is straightforward to show that X, Y, U, and A are Betel spaces, and hence so is Ht,t E No

(see [BS, p. 119]); this leads to well-defined probabilistic structure [BS].

Let/~t : Ht ---* U be a measurable map. An admissible control law, policy, or strategy # is a

sequence of maps {#,( ' ) } ,eN,, ox {t*0('), . . . ,#n( ' ) } for a finite horizon. When/t , ( . ) =/~(.) for all

219

values of t, then the policy is said to be stationaxy. Let c : U × X ~ [0, M] be a given measurable

map, where M is a positive scalar; c(u, x) is illterpreted as the cost incurred given that the system

was in state x and control action u was selected. To each admissible strategy/~ and probability

distribution p0 of thc initial state, the following expectcd costs are associated.

Finite Horizon:

Discounted Cost:

Average Cost:

Jp(~,po) := lira Z;, tc(ut,x, , 0 < ~ < 1 (De)

] 2(~,p0) := zi,. sups; . 1 ~ c ( u t , x , ) (AC) n ~ c o tmO

where E~'. is the expectation with respect to the (unique) probability measure on tloo induced by

P0 and the strategy #, or an appropriate marginal (see [BS, pp. 140-144 and 249]). For a strategy

#, {u, = #t(h,)}t~ 0 is the control process governing state transitions. The optimal control (or decision) problem is that of selecting an (optimal) admissible s trategy such that one of the above

criteria is minimized over all admissible strategies. The optimal (DC) cost function is obtained

as F#(p0) := in f {J~(#,p0) : # is an admissible strategy}, for each p0 E A. Similarly denote by / t

I'0(. ,n) and F(.) the optimal cost functions for the horizon n (FH) and (AC) cases, respectively.

It is well known that a separation principle holds for the problems listed above (see [BS],

[KV]). Specifically, let {Pqt}t~t~o denote the conditional probability distribution process, whose

i th component is given by

p(i) P{xt = lint,, y 0 ; u H , , u0}, t ~ N t i t :~-- " " ' " "

p(i) := p~i) 0[0

Then, assuming that P{ht,ut,yt+l = k} 7t 0 for t E No and for each k E Y, and using Bayes'

rule it is easily shown that (see also [KV, Sect. 6.6], [AS])

-(2~(ut)P'(u,)ptlt p,+,l,+, = ~ 1 ~ , ' : [~ '+ ' = k] (21) k E Y

for each sample path, where I[A] denotes the indicator function of the event A and the 2 x 2 matri-

ces QkCu) axe given by ~k(u) := diag {qi,k(u)}. Note that P'I ' is a function of CYt,P,-II=-,, u , - l ) .

Define a separated admissible law for time epoch t as measurable map/~, : A --* U; separated admissible strategies are defined similarly as before. It is straightforward to show that the process

{pllt } obtained via feedback using a separated strategy is a Markov process [AS]. A separated

admissible law I' can be regarded as an admissible law via h, ~-*/~=(Ptlt), where Pqt is obtained

from ht by applying (2.1) reeursively. Then the partially observed, finite horizon problem (FI-I) is

equivalent (i.e., equal minimum costs for each p0 E A) to the completely observed finite horizon

problem, with state space A, of finding a separated admissible strategy which minimizes

220

' ' (FtI') J'~O,,Po,.) := ~4'o P P,l,c( ~,

where ut :--- gt(Ptlt) , c(u) = [c(u, i)]i~ x. Similarly, (DC') and (AC') are defined.

Returning to the binary replacement problem with two states, the model of the POMDP

takes the form

- - - - 1

O := O(0) = 0 ( 1 ) = 1 - q

where 8 E [0,1] and q E (0.5, 1) (i.e., the core s ta te is s t r ic t ly par t ia l ly observed). Actually, it

is na tura l to expect tha t a is some small posit ive number. Note tha t if the decision to replace

has been taken at t ime epoch t, then the s ta te of the machine is known at the beginning of

t ime epoch t + 1, hence the observation at this t ime is i r rdevan t and thus Q(1) can be chosen

arbitrari ly. For tile observation quali ty s t ructure under decision u = 0, the probabi l i ty of making

a correct observation of the s ta te is q. The condit ional probabi l i ty vector can be written as

Pllt = [1 - p o t , ptlt] I, where Ptlt is the condit ional probabi l i ty of the process being in the bad state.

Fi'om (2.1)-(2.3), with Polo = po := [0 llPo, we get

p,+u,+ , = T(1, P,t,, u,)u,+, + T(0, P,t,, n,)(1 - U,+,) (2.4)

where ut is the decision made at t ime epoch t, and for p E [0,1]

V(O,p,O ) := q(1 - p)(1 - 8) + (1 - q)[p(1 -- 8) + 8] (2.5)

v ( 1 , p , 0 ) := (1 - q)(~ - p)(1 - e) + q[p(~ - o) + o] (z0)

V(0,p , 1) := 1 - q ; V(1,p, 1 ) : = q (2.7)

(1 - q)IpO - 8) + 8] q[p(1 - e) + 8] T(0 ,p ,0 ) := V(O,p,O) , T(1, p,O):= V(1,p,O) (2.8)

T ( k , p , 1 ) := 0; k = 0 , 1 . (2.9)

Here, given an a pr ior i probabi l i ty p of the system being in the bad state, V(k, p, u) is interpreted

as the (one-step ahead) condit ional probabi l i ty of the observation being k, under decision u.

Similarly, T(k, p, u) is in terpre ted as the a posteriori conditional probabi l i ty of the system being in

the bad state , given tha t decision u was made, observation k obtained, and an a pr ior i probability

p. Note that {Ptlt) = {ptlt(e)) • We now s tudy some impor t an t proper t ies of T(k, . , 0), k = O, 1. Define

fo(p) := r ( 0 , p , 0 ) - p , /~(p) := T(1 ,p , 0) - p.

Thus, the roots of fo( .) mad f l ( ' ) are the fixed points o f T ( 0 , . ,0) and T ( 1 , - , 0 ) , respectively. The

per t inent quadra t ic equat ion for f0( ' ) is

eg [(2q -- 1)(1 -- #)] -- ¢o [(2q -- 1)(1 -- 6) + (1 - q)0] -I- (1 -- q)O = 0

221

and for 8 ~ 1 its roots are , (1 - - q ) 0

~0 = 1; ~0 = ~2q - 1 ) (1 - 0 )

For 0 = 1, both roots are equal to 1, Replacing q by (1 - q) abovc, the expressions corresponding

to fi(') are obtained. Then, the following can be shown (see [FAM]).

Lemma 2.1: For q 6 (0.5,1) and 8 6 (0,1), the following holds:

(a) T(k,., u) is monotone increasing in [0, 1), for each u = 0,1 and k = 0,1;

(b) T(0,. ,0) < T(I,. ,0), in [0,1);

(c) p < r (1 ,p ,0) , for p E [0,1);

(d) if ~ < q, then ~0 E (0,1) and Pl < T(O, pl,0) and T(O, p2,0) < p2, for Pl E [0,~0) and

p~ e (~o, z); (e) ifq < I__.~, thcn ¢o > 1 and p < T(0,p, 0) forp G [0,1).

Remark 2.1: When 8 = 1 (not a very interesting situation), it follows that T(0, . , 0) = T(1, . , 0) = 1 on [0, i]. Also, parts (c) and (d) above can be interpreted as meaning that ma observation of

the process bcing in the bad state is trustcd more than an observation of the process being in the

good state, in general.

For an optimization problem within some class of decision processes, it is of great importance

to establish qualitative properties on the structure of the optimal policies. Such insight may aid

in accelerating the computation of optimal policies for specific problems, and also hints at what

"good," possibly easily implementable, control laws should be for the physical system being

modelled [HS]. Results of this nature can be found, e.g., in machine replacement problems [W2],

[RO], [BE], competing queues schemes [SM], etc. Also, when an adaptive control scheme based on

a Certainty Equivalence Principle is to be studied, structural specificiation of optimal (or "good")

policies is of critical importance in order to determine the policies over which adaptation must

take place. This is well known for the self-tuning regulator [KV]; for other applications, see also

[MMSI, [KUM2]. For the cost structure of our problem, we choose for production c(0, i) = Ci, with Co = 0,

Cl = C, and for replacement c(1, i) = c(1) = .R, 0 < C < .R < oo. Considering a (DC') criterion,

it is straightforward to show that it is always optimal to produce at p = 0 [AKO], [Wl]. Thus,

since the region Ar C A in whida it is optimal to repair is convex [L01], then optimal strategies

will be monotone in p E [0,1], and hence to each such optimization problem thcre corresponds

an a E (0,1] such that the structure of an optimal stationary strategy is as shown in Figure 1.

Furthermore, it is easily shown that a = 1, i.e., it is optimal to produce for all p E [0,1], if and

only if R > ~ ; [Wl]. Such separated policies are said to be of the contro1-11mlt type, with the corresponding a being the "control-limit."

PRODUCE REPLACE

I I I °

Figure I: Control-limit policy structure

222

Under a control-limit stationary feedback strategy and with nonvoid replacement region, i.e.

a E (0, 1), it is clear from Lemma 2.1 that the time to replace will be finite, and uniformly

bounded with respect to the initial probability distribution, if 1/2 < q < 1 ~ . However, since 0 is

to be expected to be "small," this condition only says that the system is close to being completely

unobservable, and thus periodic replacement should be scheduled. In general, when the quality

of the observations may be more significant, e.g. ~ < q, then we have the following.

Lemma 2.2: Let q E (0.5, 1), and let p be a control-limit stationary policy with nonvold re-

placement region. Then there exists an M E N such that, under feedback by bt, M consecutive observations of the core process being in the bad state result in at least one replace action,

independently of the initial probability of being in the bad state.

proof: Define recursively T "+1 (1, p, 0) := T(1, T"(1, p, 0), 0), for n E N and T ~ (1, p, 0) = T(1, p, 0)

for p E [0,1]. Now, ~ = 1 is the only stable fixed point of T(1,- ,0): for any p (~ [0,1}

T"(1, p, 0) ----* 1 since by (e) in Lemma 2.1, T"-~(1,p,0) < T"(1,p,0). Then, for fixed n - - t o o

a E (0,1), there is a minimum M' E N such that a < TM'(1, 0, 0). By part (a) in Lemma

2.1, a < TM'(1,p,O) for any p E [0, 1]. Taking M = M' + 1 to account for a starting probability

of being in the bad state within the replace region, the result follows. Q.E.D.

3. The Adaptive Two-State Binary Replacement Problem. If the parameter 8 in P(0) (see

(2.2)) is unknown, we cannot compute P,tt, nor can wc determine optimal strategies. The "cer-

tainty equivMence" approach which we will adopt involves recursively computing estimates of the

unknown parameter ~4, at each decision epoch t, use the lates~ available estimate in the filtering

equation (2.4) to compute Pt+flt+l(~z) =: t~t+llt+t, where the decision ut is made taking ~t as if it were the true parameter. For decision-making, it is assumed that a set of stationary separated

strategies CL = {p(., ~)}0e[0,1], parameterized by 6, is available; these policies will be restricted to be of the control-limit type with 0 < a(0) < 1 (not necessarily optimal for each value of 6).

We denote by O = [0,1] the parameter space in which 0 takes its values. The dependence 0f

the transition matrix, one-step ahead probabilities, and aposteriori bad state probabilities on 0

will be expressed in the form P(u, a), V(k, p, u,8), and T(k,p, u, 8), respectively; of course, from

(2.2) and (2.5)-(2.9), we see that only P(0,0), V(k,p,O,t~), and T(k,p,O,a) depend on 0. Also,

the dependence of a policy (or of the "control-limit") in Figure 1 on $ is denoted by V.(p,8) (or

a($)). We also let $0 denote the (unknown) true value of the parameter, which we assume to be constant.

The certainty equivalence approach to this control problem is similar to that used for adaptive

estimation in {AM], with the additional complication of feedback here. For a policy p(., 8) 6 CL,

the adaptive algorithm takes the form:

e, = v, - v 0 , ~ , - , l , - , , ~ ' 0 ; , - , I , - , , 0,) , O,) (3.1)

~ ,+ , = ~ o ( 0 , + t-24-f~;~.,~,,~,) (3.2)

~,+1j,+1 = T(Z,~,I,,#(~,~,,~,+I),O,+I) "~,+1 + r(O,P,l,#,(~,l,,~,+~),~,+~)" ( i - v,+~) (3.3) where fi010 = po and $1 E [0, 1] is arbitrary. Here, e~ is the prediction error, Rt is a positive definite

matrix which modifies the search direction, eZt is an approximation of the negative gradient of et

223

with respect to 0 (evaluated at 0,), and V(k,p,u,O) and T(k,p,u,O) are given in (2.5)-(2.9). Note

that the values that ~t, Pt+Ht+l, zt and y, take depend implicitly on the whole path t~,tlt=l.l~ ~t+l The

map ~ro is a projection onto the parameter space. Its inclusion is necessary since otherwise 0t will

not necessarily be in e . The recurslve estimate (3.1)-(3.3) is of the type analyzed by Kushner and

Shwartz [KUS1], [KUS2], [KUS3], [KSI, and Ljung and SSderstrSm [LS]. The objective is first to prove convergence of 0, to 00 in an appropriate sense, mad then to prove the long-run average cost

( or asymptotically discounted cost, see [FAMJ, [HLM]) due to the adaptive policy is the same as

would have been incurred if the true parameter had been known.

We use a Ganss-Newton search direction computed via

Jr,+, = ~ , + ~ [ , , % - ~ , ] , R~ = z. (3.4)

It is -~eful to f~st write (3.1) and (3.3) for the con s t~ t sequence {~, = 0}~+~ [KOS2, p. 22~]. That is, for e ~ h 0 and ~(. ,0) e eL , we define the process {:~,(0),y,(0),p,(e)}, where {~,(0)}

is governed by P(#(pt(O),O),Oo}, {yt(0)) is related to {xt(0)} by Q defined in (2.3), {pt(0)} is defined recursivcly by

p,+t(O) = T[ t ,p , (a ) ,~ . (m(o) ,o ) ,a l . u,+~(o) + T[O,p,(a),g(p,(O),O),O]. (x - y=+~(O)), (3.5)

with po(O) = po, and ~,(o) = ~,(o) - v[1 , p ,_l (o) , . (p ,_ l (o ) , o), 01. (3.6)

In [AM], the approximate gradicnt ~t is obtaincd by deriving an equation for 0¢t(0)/00 (for

fixed O) and then evaluating at 0 = 0,. Thus,

oc , /oo = o~, /oo - o y / o o

Contrary to the situation considered in [AM], the above derivatives are not well defined for all

sample paths. This problem arises due to the discontinuity of #(p, 0) in its first argument at

p = a(9), and the dependence on 0 of both a(#) and pt(O). Preliminary simulation results

suggest a smooth dependence of a(8) on 0. This, combined with the fact that both {y,(0)}

and {p(pt(6), 0)} take only two values make the following statement seem plansible: using an

approximation that treats the derivatives as if they exist everywhere, and are equal to zero,

then the algorithm will exhibit a satisfactory performance, as measured by the expectation with

respect to an invariant measure (shown to exist uniquely, for each value of 8, in the sequel).

The rationale behind the above is the fact that the expectation operator results in additional "smoothing," which may enable us to show that ~sE e [p,(0)] = E e [~P~pt(O)], a fundamental

result in the developments in [AM], where E e is the expectation operator with respect to the

invariant measure. Then, using (2.6) and (2.8), we have that

oe,CO)/oe = - ( 1 - 2q)[1 - p , - i (o ) + 0 - o)op,_,CO)/oe], if ~,(p,_,(o), o) = o, (3.7)

o~,(O)lOe = 0, if ~,(p,_~Ce),e) = I,

Using the aforementioned approximations, we obtain et(O) := ~-gpt(O) from (3.5) as

(3.8)

224

(~,+, (0) = ('711, p,(O), p(p,(0), 0), 0]~t(0) + 6[1, p,(O), p(p,(O), 0), 0]}" Y,+I(0)

+ {'r(0, p,C0), ~Cp,(o), o), o1¢,(o) + ~[o, p,(o), u(p,(o), o), o]}. (~ - v,+,(o)), (3.~)

where ~s(0) := 0, and

. r ( k , p , o , o ) = ~(~ - ~)(~ - o) [v(a,p,o,o)} ~ , ~ = o , ~

~(~ - q)(~ - P) k = O, ~(~,p,o,o) = [v(~,p,o,o)]~'

• 7 ( k , p , l , 0 ) = , 5 ( k , p , l , 0 ) = O , k = 0 , 1

@t in (3.2) is tiros calculated

m, = (1 - 2q)[~ - r~,-ll ,-1 + (x - 0,)~,_11, if P ( P , - q , - I , 0,) = 0 (3.10)

if p(~,_ip_, ,0 , ) = 1 (3.11)

where

~,+1 = {711, ~3tl,, P(PtI', 0,+1), 0,+1]~, + 6[1, P'I', P(/~qt, 0t+x), 0t+1]) • Y,+a

+ {-~[o, k,l , , u(~, l , , o,+,), ~,)x]~, + ~[o, ~,l, , ~(~,1,, ~,+1), 0 ,+1]} . (1 - ut+l ) , (3.12)

and ~o := 0. Equations (3.1)-(3.4), (3.10)-(3.12) constitute the adaptive algorithm. To facilitate the

analysis, we define the Markov chain ~, := (x, ,y, , t~q,~,) and, for fixed 0, the Maxkov chain

,~,(0) = (x,(O), y,(O), pt(O), ¢,(0)). Then ~ := {0, 1} x {0, 1} x [0,1] x R is the state space of either

of these processes, and we let BIz denote its Borel a -a lgebra . With this notation, we can write

. ,+ , = . c ( . , + ~ , c ( , 7 , , ~ , ) ) (3.13)

with C a convex, closed set, at = 1/(t + 1), and r/t = (0 , R,). Notice that {Th,~t_l } is a Markov k - I

chain. The ODE approach proceeds as follows: Let r , -- ~ al and denote by ~-70(.) the piecewise i = l

linear function with value rl~ at f t . Define the shifted function gk(.) by F/k(r) = ~ ( r + rk), r > 0, and observe that F/k(0) = r/t. The idea is then to show either weak or w.p.1 convergence,

as k - , oo, of the sequence {~k(.)} to the solutions of an ordinary differential equation (ODE)

associated with the algorithm (3.13). The study of the asymptotic behavior of {r/k} is thus

reduced to the analysis of the associated ODE. In [AM], we utilized a. theorem of Kushner [KUS2, Th. 3] in order to show almost sure

convergence for the adaptive estimation algorithm. Define the "partial" transition function, for

A E B~, qZ(Al~,r/) = P{~n+x E AI~, = ~,,/,+1 = ~},

and recursively, for j > 1,

This is thus the homogeneous probability transition function of the Markov process ~,(0) for tJxcd 9, since ~t depends only on 0t and not on/~t; because of this, we will sometimes use the

225

notation qi(AI~ , O) for tile same transition function. One of the crucial hypotheses in Kushncr's

theorem is:

(El) Thcrc exists a function ~ on C such that, for each n E No and r/E C, the function

N

is well defined and bounded in norm by l{a,+l, for some K < co, w.p.1. In [AM], (HI) is proved by a detailed analysis of the sample paths of {~,(e)}, which turn

out to be stable in a very strong sense. In the adaptive control problcm considercd here, this

is not true, and a different approach toward verifying (bll) must be takmL In addition, it is

much more difficult to show for this problem that a given candidate for G(-) is continuously

diffcrentiable, and this is part of the hypothesis if one is to directly apply Kushner's theorem.

However, (HI) can be shown via the geometric ergodicity of {~t(8)}, wlfich is defined as follows (c.f., [ME;], [NUM], [Ol l l ) . For each O, an bzvariant probability measure for the process {~,(8)} is

a probability measure ~r(-, e) on BI; such that

~(A,O) = f xCd~',o)¢CAl~',O), A e B~.

Define the total variation norm [[Al - A21[ for At, A2 probability measures on B~ by

Ila, - azll :--- s u p l f f aa, - f /dA~

where the supremum is taken over all Borel functions f : P. ~ [-1, 1].

Definition 3.1: Fix 8 E [0, 1]. The Markov chain {~t(8)] is uniformly geometrically ergodic if there exist r < 1 , A < co, and a measure ~r(., $) on B~: such that, for every probability measure

A0 on B~. and k E N,

I f A°(d")qk('['" O) - lr(" O)][ < A r k

If U(r/) := f G(r/, ~')Tr(d~', e), then hypothesis (H1) is clearly a consequence of uniform geomet-

ric ergodicity, which we will concentrate on proving. The following is a conscquence of [OR,

Proposition 6.1 and Theorem 7.1].

Prouosltion 3.h Fix e E [0, 1]. Suppose that the M~rkov chain (~t(O)}is aperiodic and that there

exists a a-f ini te measure ~ on Br. such that Doeblin's condition is satisfied: for each A E Bs

such that ~(A) > 0, there exist n > 0 and e > 0 such that

11

~qk(Al~,0 ) >,, forall~ e~ (3.14) k=1

Then the Maxkov c h i n {~,(8)} has an invariant probability measure ~r(-,0) and is unlform]y gcometricedly ergodic.

In the terminology o~ Orey [OR], Doeblin's condition implies that the Markov chain is uni-

formly ~-rccun'ent. In the next section, we verify the hypotheses of Proposition 3.1, thus proving

uniform geometric ergodicity of {~,(0)} for each 0 E [0,1], and hence verifying (Ill).

226

.4. Uniform Geometric Ergodicity of {{t(0)}. In tiffs section, we consider, for each fixed O E

[0,1] aa~d fixed policy pC', o) • eL , the Markov chain {~t(0) = (st(O), yt(8), p~(O), ~t(O))} defined

above. In order to simplify the notation, let zi(0) = (pt(O), ¢s(t?)); then zt(O) takes values in

S := [0,1] x ~ (Bs denoting its ]3orel tr-algebra) and satisfies an equation of the form

,,+,(O)= xt~,,,CO),~,(p,(e),e),oJ.u,+,(o) + xlO,,,(o),~,Ca, Ce),o),e].O-u,+,(8)) (4.1)

with the obvious definitions of X. We also define, for A • Bs, and i , j , k, m = O, 1,

[ r (Alz)] ;+=ka+~. = / ' { 6 + , (e) • (j, m, A) I6(a ) = (i,/~, z)}. (4.2)

The transition function can then be written in the form, for A E Bs and z = (p,~),

r (AIz) = [ eO,(a,e),oo)O P(#(p,O),Oo)[I- Q] } . L(A,z), (4.3) [p(~(a,a),ao)O - p o , ( p , o ) , e o ) i Z - O]

where Z(A,z) = [ I(x[O'z'/ '(P'e)'Ol • A). I~ 0 ] (4.4)

0 *(X[1,z,#(p,e),O] • A).I= ;

here, Q := diag(q, 1 - q) and I2 denotes the 2 x 2 identity matrix. Using (2.2), we thus have

[OoOo)q OoO-q) 0 ] 1 - q 0 ;q

r (AIz)= O-Oo)q c o O - q ) O - o o ) ( 1 - q ) Oo / "r(A'z)' (4.5a) 0 1 - q 0 qq J

for #(p, 0) = 0, whilc if tx(p, O) = 1,

r (AIz) = 0 1 - q • _./(A, z). (4.5b) 0 1 - q

q 0 1 - q

Proposition 4.1: Let q • (0.5,1), O • [0,1], and a policy /~(.,0) E CL be fixed. Then for every t E N, there exists et > 0 sudl that for all ~0(8),

e D o ( e ) = 1, v ,(o) = a , . . . , ~,(e) = 11¢o(e)} = , , (4.6)

Proof: Notice first that (omitting, for simplicity, the explicit dependcnce on O) for k = O, 1,

P{u,+~ = k l v , , * , , m, ~0}

= P ( u , + l = k, x ,+ , = 0 M , . , , p,, ~0} . P{~:,+~ = 01u,,*, , p,, ~0 }

+ P{v ,+I = k, z,+~ = 1Iv,, ~:,, p,, 6 } " e { z , + 1 = l lv , , z , , p,, ~o} (4.7)

An easy calculation from (4.5) shows that there is a number Z > 0 such that the quantity in (4.7)

is greater than or equal to Z for all ~0. Now, letting {~ := 1} = {Y0 = 1,yx = 1, . . . ,Yt = 1}, we have

P{L+I = 11~°} = P{Yt+I = 11~0,~ = 1}. P{~ = 11~o } (4.8)

227

By the smoothing property of conditional expectations, the first quantity on the right hand side of (4.8) can be written via indicator functions as

Z I E [ ; ( , / , + ~ = i ) l ~ o , ~ = !, ~,, p,]l~o,_U, = E > z.

Performing this calculation recursively, using the Markov property of {~(t?)}, shows that the

proposition follows if we let e~ = (~)t. Q.E.D.

In order to verify the hypotheses of Proposition 3.1, we now define a measure ~ on Bs. This

is in terms of thc "recurrent" state (0, 0,0, 0) e ~, which is intcrpretcd as follows. Notice from

(2.4), (3.5), and (3.9) that if #(pt(0),8) = 1, then x,+z(8) = p,+z(9) = (,+,(8) = 0, while there

is a positive probability that y,+l(0) = 0. We then define ~o to be a measure on BE; such that

~o({(0,0,0,0)}) = 1 and ~o(A) = 0 if (0,0,0,0) ~ A.

Theorem 4.1: Let q e (0.5,1), O e [0, 1], and a policy #(.,O) e e L be fixed. Then {(,(0)} satisfies

(HI). Proof: As noted in Section 3, we need only verify the hypotheses of Proposition 3.1. With ~p

defined as above, the only sets A e B(I3) with ~(A) > 0 are those containing (0, 0, 0, 0); hence,

D0chlln's condition nced only be verified for A = {(0,0,0,0)}. This follows easily if we take

(a) n in Doeblin's condition to be M(•) + 1, where M(a) is defined in Lemma 2.2; and (b) e in

Docblin's condition to be e,, where en is defined in Proposition 4.1 and n is defined in (a). The

only remaining hypothesis of Proposition 3.1 is the aperiodicity of {~t(#)}, which follows easily

from the definition (see [OR,, pp. 12-15]). Q.E.D.

5. Conclusions. This paper represents the beginning stages of a program to address the adap-

tive control of partially observable Markov decision processes (POMDP) with finite state, action,

and observation spaces. We have reviewed the results on the control of POMDP with known

parameters, and, in particular, the results on the control of quality control/machine replacement

models. We have chosen to study the adaptive control of a problem with simple structure: the

two-state binary replacement problem. An adaptive control algorithm was defined, and initial

results in tim direction of using the ODE method were obtained. The next steps, on which work

is in progress, are (1) the verification of the other hypotheses of Kushner's theorem (or the proof of a modification of the theorem for this algoritlma); (2) the analysis of the limit points of the

ODE and of the convergence of the sequence of parameter estimates {~}; (3) an analysis of the

optimality of the adaptive control algorithm.

The principal contribution of this paper has been the derivation of some ergodic properties

for the process i f t(0)}, for policies/~(.,6) C CL. It should be noted that such properties are to

some degree independent of the {(t} process used: the sarae would be true for any process with

a "recurrent" state and which can be put in a form similar to (3.9).

A.c.knowledKments: The authors would like to thank Dr. Scan Meyn of the Australian Na- tional University for helping us to understand geometric ergodieity and to prove Theorem 4.1.

This research was supported in part by the Air Force Office of Scientific Research under

grant AFOSR-86-0029, in part by the National Science Foundation under grant ECS-8617860,

in part by the DoD Joint Services Electronics Program through the Air Force Office of Scien-

tific Research (AFSC) Contract F49620-86-C-0045, mad in part by the Texas Higher Education

Advanced Technology Program.

228

[AKO]

[AMI

[AS]

[BE]

[BS]

[FAM]

[HLM] [HS]

[KS]

tKUMI}

[KUM2]

[KUS2]

[KUS3]

[KV]

[LOll [LS]

[ME] [MMS]

[MO]

[NUM]

[oRl

[RO]

[SMI

[w~]

[w21

References V,A, Andriyanov, I.A. Kogan and G.A. Umnov, "Optimal Control of a Paxti~lly Observ- able Discrete Markov Process," Aut. Remot. C., 4, 1980, 555-561. A. Arapostathis and S.I. Marcus, "Analysis of an Identification Algorithm Arising in the Adaptive Estimation of Markov Chains," Mathematics of Control, SignaI~ and Sy~temJ, to appear. K.J. AstrSm, "Optimal Control of Markov Processes with Incomplete State Information," J. Math. Anal. Ap?l., 10, 1965, 174-205. D.P. Bertsckas, Dynamic Programming: Deterministic and StochaJtic ModeIJ, Prentice- Hall, Englewood Cliffs, New Jersey, 1987. D.P. Bertsekas and S.E. Shreve, Stochastic Optimal Control: The Discrete Time CaJe, Academic Press, New York, NY, 1978. E. Fern~.ndez-Gaucherand, A. Arapostathis, and S. I. Marcus, "On the Adaptive Control of a Partially Observable Markov Decision Process," Proc. $7th IEEE Cote on Decision and Control, Austin, Texas, 1988. O. Hern£ndcz-Lerma, "Adaptive Markov Control Processes," preprint, 1987. D.P. Heyman and M.J. Sobel, Stochastic Models in Operationa Research, Vol- II: Stochaa. tic Optimization, McGraw-Hill, New York, 1984. H.J. Kushner and A. Shwartz, "An Invariant Measure Approach to the Convergence of Stochastic Approximations with State Dependent Noise,"SIAM J. Control and Optim., 22, 1984, 13-27. P.R. Kumar, "A Survey of Some Results in Stochastic Adaptive Control," SIAM .1. Control and Optim., 23, 1985, 329-380. P.R. Kumar, "Optimal Adaptive Control of Linear Quadratic Gaussian Systems," SIAM J. Control and Optimiz., 21, 1983, 163-178. H.J. Kushner, "Stochastic Approximation with Discontinuous Dynamics and State De- pendent Noise: w.p.1 and Weak Convergence," J. Math. Anal. AppL, 82, 1981, 527-542. It.J. Kushner, "An Averaging" Method for Stochastic. Approximations,,. with Discontinuous Dynamics, Constraints, and State Dependent Nome, m Recent Advancea in Statistics, Rizvi, Rustagi and Siesmund, Eds., Academic Press, New York, 1983, 211-235. H.J. Kuslmer, Approximation and Weak Convergence Method~ for Random Proceaseh, MIT Press, Cambridge, MA, 1984. P.R. Kinnor and P. Varaiya, Stochastic Systems: E~timation, Identification and Adaptive Control, Prentice-Hall, Bnglewood Cfiffs, New 3ersey, 1986. W.S. Lovejoy, "On the Convexity of Policy Regions in Partially Observed Systems," Opcra~ionn Rcn., 35, 1987, 619-621. L. Ljung and T. SSderstrSm, Theory and Practlcc of Rccurnive Identification, MIT Press, Cambridge, HA, 1983. S.P. Meyn, ' Ergodic Theorems for Discrete Time Stochastic Systems Using a Generalized Stochastic Lyapunov Function," 1988, preprint. D.J. Ma, A.M. Makowski and A. Shwartz, "Estimation and Optimal Control for Con- strained Markov Chains," Proc. $hth IEEE Con]. on Decision and Control, Athens, Greece, 1986, 994-999. G.E. Monahan, "A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms," Management Sci., 28, 1982, 1-16. E. Nummelin, General Irreducible Markov Chains and Non-Negative Operator~, Cam- bridge University Press, New York, 1984. S. Orey, Limit Theorems for Markov Chain Transition Probabilities, Van Nostrand Rein- hold Mathematical Studies 34, London, 1971. S.M. Ross, Introduction to Stochastic Dynamic Programming, Academic Press, New York, NY, 1983. A. Shwartz and A.M. Makowski, "An Optimal Adaptive Scheme for Two Competing Queues with Constraints," 7th Interior. Conf. on Analys. and Optim. of SystemJ, Antibes, France, 1986, 515-532. C.C. White, "A Markov Quality Control Process Subject to Partial Observation," Man. agement $ci., 23, 1977, 843-852. C.C. White, "Optimal Control-Limit Strategies for a Partially Observed Replacement Problem," Internal. J. Systerna Scieucc, 10, 1979, 321-331.