Model-based Reinforcement Learning with State Aggregation

14
Model-based Reinforcement Learning with State Aggregation Cosmin Paduraru, Robert Kaplow, Doina Precup and Joelle Pineau McGill University Abstract. We address the problem of model-based reinforcement learning in in- finite state spaces. One of the simplest and most popular approaches is state ag- gregation: discretize the state space, build a transition model over the resulting aggregate states, then use this model to compute a policy. In this paper, we pro- vide theoretical results that bound the performance of model-based reinforcement learning with state aggregation as a function of the number of samples used to learn the model and the quality of the discretization. To the best of our knowl- edge, these are the first sample complexity results for model-based reinforcement learning in continuous state spaces. We also investigate how our bounds compare with the empirical performance of the analyzed method. 1 Introduction Data-efficient reinforcement learning methods have been the focus of much recent re- search, due to their practical importance. Model-based reinforcement learning (MBRL) is widely accepted as a potential solution for problems where data-efficiency is impor- tant. The main idea is to use samples of experience to build a model of the environment; this model can then be used to compute a policy using a variety of methods, e.g. value iteration, policy iteration, approximate dynamic programming etc. MBRL has been ex- tensively studied for Markov Decision Processes (MDPs) with finite state spaces. For example, Kearns & Singh (1999) proved that the convergence rates of Q-learning and MBRL methods are the same. Strehl et al (2006a,2006b) provided PAC-style guaran- tees for a variety of MBRL methods. Mannor et al (2007) studied the bias and variance of the model estimates in the context of policy evaluation. In infinite state-spaces, a standard approach to MBRL uses state aggregation. The state space is grouped in partitions, a transition model is learned over those partitions, and then used to optimize a policy. Kuvayev & Sutton (1996) demonstrated empiri- cally that using Dyna-style updates with a learned state aggregation model can signifi- cantly speed up on-line, value-based reinforcement learning. The Parti-game algorithm (Moore & Atkeson, 1995) and other variable resolution dynamic programming meth- ods (e.g. Munos & Moore, 2002) are aimed at finding a good state aggregation. They analyze the error of the value function estimates. However, there is no analysis of the performance of the greedy policy induced by the learned value function. Least-squares policy iteration (Lagoudakis & Parr, 2003) builds an expectation model of the next state based on a batch of data. They treat the case of linear function approximation, but in the special case of state aggregation, the expectation model and transition model are identical. However, no sample complexity bound are provided for the algorithm.

Transcript of Model-based Reinforcement Learning with State Aggregation

Model-based Reinforcement Learning with StateAggregation

Cosmin Paduraru, Robert Kaplow, Doina Precup and Joelle Pineau

McGill University

Abstract. We address the problem of model-based reinforcement learning in in-finite state spaces. One of the simplest and most popular approaches is state ag-gregation: discretize the state space, build a transition model over the resultingaggregate states, then use this model to compute a policy. In this paper, we pro-vide theoretical results that bound the performance of model-based reinforcementlearning with state aggregation as a function of the number of samples used tolearn the model and the quality of the discretization. To the best of our knowl-edge, these are the first sample complexity results for model-based reinforcementlearning in continuous state spaces. We also investigate how our bounds comparewith the empirical performance of the analyzed method.

1 Introduction

Data-efficient reinforcement learning methods have been the focus of much recent re-search, due to their practical importance. Model-based reinforcement learning (MBRL)is widely accepted as a potential solution for problems where data-efficiency is impor-tant. The main idea is to use samples of experience to build a model of the environment;this model can then be used to compute a policy using a variety of methods, e.g. valueiteration, policy iteration, approximate dynamic programming etc. MBRL has been ex-tensively studied for Markov Decision Processes (MDPs) with finite state spaces. Forexample, Kearns & Singh (1999) proved that the convergence rates of Q-learning andMBRL methods are the same. Strehl et al (2006a,2006b) provided PAC-style guaran-tees for a variety of MBRL methods. Mannor et al (2007) studied the bias and varianceof the model estimates in the context of policy evaluation.

In infinite state-spaces, a standard approach to MBRL uses state aggregation. Thestate space is grouped in partitions, a transition model is learned over those partitions,and then used to optimize a policy. Kuvayev & Sutton (1996) demonstrated empiri-cally that using Dyna-style updates with a learned state aggregation model can signifi-cantly speed up on-line, value-based reinforcement learning. The Parti-game algorithm(Moore & Atkeson, 1995) and other variable resolution dynamic programming meth-ods (e.g. Munos & Moore, 2002) are aimed at finding a good state aggregation. Theyanalyze the error of the value function estimates. However, there is no analysis of theperformance of the greedy policy induced by the learned value function. Least-squarespolicy iteration (Lagoudakis & Parr, 2003) builds an expectation model of the next statebased on a batch of data. They treat the case of linear function approximation, but inthe special case of state aggregation, the expectation model and transition model areidentical. However, no sample complexity bound are provided for the algorithm.

There are also approaches to data-efficient reinforcement learning that process anentire batch of data at once, but do not build a transition model explicitly, such as expe-rience replay (Lin, 1992) or fitted value iteration (Ernst et al., 2005). In particular, Antoset al. (2007) provide PAC-style bounds for the performance of fitted value iteration incontinuous state spaces.

In this paper, we provide a theoretical analysis of MBRL with state aggregation.We prove a bound on the L∞ loss of a policy based on an approximate model. Thebound has two terms: one depending on the quality of the state aggregation, and onedepending on the quality of the transition model between different aggregate states. Thebound highlights the intuitive trade-off between the resolution of the aggregation andthe number of samples needed to compute a good policy.

We note that almost all previous error bounds present in the literature on valuefunction approximation bound the error in the value function estimation. In contrast,our bound directly measures the quality of the policy induced by the value estimate. Inother words, we quantify how well the policy computed on the approximate model willperform in the original MDP. This is more directly related to the actual performanceobtained when using this method. We further bound the second term based on the L1

norm of the error in the estimation of the transition probability distributions, whichallows us to quantify the relationship between the performance of the algorithm and thenumber of samples available to learn the model. To our knowledge, these are the firstPAC-style guarantees for MBRL in infinite state spaces. We also illustrate the empiricalperformance of the method in contrast with the theoretical bounds.

2 MBRL with State Aggregation

We assume that the agent has to solve a Markov Decision Process M = 〈S,A,P,R〉,where S is the set of states, A is a set of actions, P is the transition probability distribu-tion, with P(s′|s,a) denoting the probability of the transition s a→ s′ and and R : S×A →R is the reward function. In this paper we assume that A is finite, but S may be finiteor infinite. In the latter case, we assume P to be a probability density function. We alsoassume a given discrete partition Ω of the state space S.

The goal of the agent is to compute an optimal policy π∗, a mapping from states toactions which optimizes the expected long-term return. The optimal state-action valuefunction, Q∗ : S×A→R, reflects the optimal expected return for every state-action pair,and is the solution to the well-known Bellman equations:

Q∗(s,a) =

[R(s,a))+ γ∑

s′P(s′|s,a)max

a′Q∗(s′,a′)

],∀s,a

where γ ∈ (0,1) is the discount factor. If the state space is continuous, this result can begeneralized to the Hamilton-Jacobi-Bellman equations, under mild conditions, whichallow replacing the sum above by an integral and the max over actions by a sup (Puter-man, 1994). Value iteration-type algorithms work by turning this set of equations intoupdate rules, and applying these iteratively until the value estimates stabilize.

We assume that the agent has access to a set of sampled transitions s,a,s′, but doesnot know the transition model P. We also assume that the reward function R is given.This assumption is made for the clarity of the results, and can be lifted easily.

In MBRL, the set of samples is used to estimate a transition model over partitions,PΩ(ω′|ω,a), by computing the empirical frequency of the transition ω a→ ω′ in the data.If no transition is observed for a given ω,a pair, the model is given some default initialvalue. In our later experiments we will consider two such initializations: the uniformmodel, in which a uniformly random transition to all partitions is assumed, and a loop-back model, in which we assume a transition back to ω with probability 1.

Once the model is estimated, value iteration (or other methods) can be used to com-pute a value function, QΩ, and a policy that is greedy with respect to this value function,π∗

Ω.

3 Theoretical Results

Our goal is to bound the difference between the optimal value function, Q∗, and the truevalue of the policy learned in the partition-based model, Qπ∗Ω , in the MDP M. We willachieve our bound in two steps. In the first part, we compare the optimal policy for the“exact” partition-based model to π∗, the optimal policy in the original MDP. Intuitively,the “exact” model is the model that would be learned with an infinite amount of data,and we define it formally below. We show that the performance of the optimal partition-based policy will depend on a measure of the quality of the partition. For the secondpart, we will bound the distance between the performance of π∗

Ω and the performanceof the optimal policy for the “exact” partition-based model. The bound will of coursedepend on the number of samples used to learn the approximate model. Putting the twobounds together allows us to describe the performance of π∗

Ω in the original MDP M.Given MDP M, a density function over the state space d : S → R, and a finite parti-

tion of the state space Ω, we define the exact partition-based model PΩ as follows: forany ω,ω′ ∈ Ω,

PΩ(ωt+1 = ω′|ωt = ω,at = a,d) =P(st+1 ∈ ω′|at = a,ω,st ∼ d).

For simplicity, we will mostly use the shorthand notation PΩ(ω′|ω,a,d) = PΩ(ωt+1 =ω′|ωt = ω,at = a,d). The dependency on d, which had to be made explicit, meansthat st is generated from d. In practice, d could be the stationary distribution of somepolicy, or some other distribution over states that is used to generate the set of traininginstances.

In order to keep the proofs more readable, we assume that the rewards in the orig-inal MDP M are a function of the partition only. Namely, we assume that R(s,a) =R(ω(s),a), where ω(s) is the partition to which s belongs. Therefore, we can just de-fine RΩ(ω,a) = R(s,a) for some s ∈ ω. We will later analyze what happens when thisassumption is not met.

Now we can define π∗Ω to be the optimal policy of MΩ = (Ω,A,PΩ,RΩ), Q

π∗ΩΩ to be

the action-value function of π∗Ω when executed in MΩ, and Qπ∗Ω to be the value of π∗

Ωwhen executed in M.

The following results use Lp norms, defined as ‖ f‖p = (R

| f (x)|pdx)1/p. The limitas p goes to ∞ of the Lp norm is called L∞, defined as ‖ f‖∞ = supx | f (x)|.

The first result describes how well π∗Ω will perform in M.

Lemma 1. In the setting described above we have∥∥∥Q∗−Qπ∗Ω∥∥∥

∞≤ 2γ

1− γ

∥∥∥Qπ∗ΩΩ

∥∥∥∞

maxa,ω

sups∈ω

∑ω′∈Ω

supx∈ω

∣∣P(ω′|s,a)−P(ω′|x,a)∣∣ .

The proof of this result can be found in Appendix A.Now we analyze what happens when we use a learned partition model instead of

the exact partition model. This is an instance of the general problem of computing theoptimal policy for one finite MDP and running it in another finite MDP that has thesame state space, but a different transition model. Thus, we can adapt Lemma 1 to dealwith this case by simply considering that there is a one-to-one mapping between statesand partitions. If we define the learned MDP as MΩ = (Ω,A, PΩ,RΩ), where PΩ is thelearned transition model, following the exact same steps as in the proof of Lemma 1results in ∥∥∥Q

π∗ΩΩ −Q

π∗ΩΩ

∥∥∥∞≤ 2γ

1− γ

∥∥∥Qπ∗ΩΩ

∥∥∥∞

maxa,ω ∑

ω∈Ω

∣∣PΩ(ω′|ω,a)−PΩ(ω′|ω,a)∣∣ . (1)

The sum appearing in the bound above is simply the L1 distance between the exactpartition-based model and the correct model. To get a sample complexity result, wewould like to establish a high-probability bound for the maximum value of this distance.Such a bound is established by the following result, proven in Appendix B.

Lemma 2. For any finite MDP M, ε > 0, if z ≥ 20|S|/ε2 then

P(maxs,a

∥∥P(·|s,a)−P(·|s,a)∥∥

1 ≤ ε)

≥ (1−3e−zε2/25)|S‖A|P(min(N) ≥ z)

It depends on the state visitation vector N, an |S‖A|-dimensional random vector havingthe (s,a)th component equal to N(s,a) (the number of times the state-action pair s,a wasvisited during data collection). We note that probability of error goes to 0 exponentiallyas the number of samples z → ∞. Also, z does not really depend on the size of thestate space, but on the worst branching factor of any action (i.e., the maximum numberof states to which it can transition). In many MDPs the transitions are sparse, so thebranching factor will be much lower than S.

Before we state our main result, we also need to bound∥∥∥Qπ∗Ω −Q

π∗ΩΩ

∥∥∥∞

. This can bedone by adapting the second part of the proof of Lemma 1 to show that for any policy

π,

‖Qπ −QπΩ‖∞ ≤ γ

1− γ‖Qπ

Ω‖∞

maxa,ω

sups∈ω

∑ω′∈Ω

supx∈ω

∣∣P(ω′|s,a)−P(ω′|x,a)∣∣

and using π = π∗Ω inside the inequality. Also note that, since π∗

Ω is optimal in Ω,

||Qπ∗ΩΩ ||∞ ≤ ||Qπ∗Ω

Ω ||∞.Putting this together with Lemma 1, equation (1) and Lemma 2 and applying the

triangle inequality, we obtain our main result:

Theorem 1. For any ε > 0, z ≥ 20|Ω|/ε2 we have

P(∥∥∥Qπ∗Ω −Q∗

∥∥∥∞≤ 2γ

1− γ‖Q

π∗ΩΩ ‖∞(2∆Ω + ε)

)≥ (1−3e−zε2/25)|Ω‖A|P(min(N) ≥ z), where

∆Ω = maxa,ω

sups∈ω

∑ω′∈Ω

supx∈ω

∣∣P(ω′|s,a)−P(ω′|x,a)∣∣ .

The quantity ∆Ω appearing in Theorem 1 is a measure of how well the partition Ω“fits in” with the true model of the MDP M. The more similar the states in the samepartition are in terms of their transition model, the lower ∆Ω will be. This connectsthe theory with the practical reasoning that, for model-based reinforcement learningwith state aggregation, state-action pairs should be grouped according to whether theytransition to similar states. This reasoning (and perhaps the theoretical results as well)could potentially be extended to other forms of model approximation.

The quantity P(min(N) ≥ z) measures the probability of getting at least z samplesfor each state-action pair. N is the random vector describing how many times each state-action pair is seen; it is multinomially distributed with parameters m and dΩ, where m isthe total number of samples used and dΩ is the distribution induced by d on the partitionspace.

For a fixed number of states and actions, P(min(N) ≥ z) will clearly converge to1 as the number of samples m goes to infinity. On the other hand, the exact value ofP(min(N) ≥ z) is difficult to evaluate for reasonably large values of m. To the bestof our knowledge, there is no simple closed-form formula for this quantity or a tightapproximation of it. Corrado (2007) presents a method that permits “rapid calculation ofexact probabilities” for the minimum of a multinomial. However, this method requiresstoring matrices of size m×m, so it is only applicable to moderate sample sizes. Thesearch for a closed-form approximation to P(min(N) ≥ z), which would also allow usto express m as a function of the desired error ε and a confidence parameter δ, is left forfuture work.

Let us now discuss what happens when the rewards are either unknwon or do notdepend on the partition alone. For the first situation, rewards could be learned fromexperience, and the bound in Equation 1 would include a term that measures the er-ror in the estimated rewards. This term would be added to the error in the transition

model. Note however that in most applications, the rewards are provided by the pro-grammer, and therefore known. The second type of situation, when the rewards are afunction of S but not of Ω, seems more likely to occur in practice. In this case, the termmaxa,ω sups,x∈ω |R(s,a)−R(x,a)|, measuring how good Ω is in terms of expressing re-wards, would have to be added to ∆Ω.

0 2000 4000 6000 8000 100000

1

2

3

4

5

6

7

8

9

10

Data Exploration Steps

SARSAModel:RandomlyMoveModel:Loop

(a) τ = 0.0

0 2000 4000 6000 8000 100000

1

2

3

4

5

6

Data Exploration Steps

SARSAModel:UniformModel:Loop−back

(b) τ = 1.0

Fig. 1. Random MDP results.Shown is the average value of maxs,a |Q∗−Qπ∗ |.

4 Experimental Results

In this section we provide experiments illustrating how the performance of model-basedmethods varies with the amount of available experience and the quality of the partition.

4.1 Empirical Results for Randomly Generated Finite MDPs

The first experiments use randomly generated MDPs with finite state spaces, in whichthe model can be represented exactly. This allows us to ignore the effect of the qualityof the partition and to focus on how the number of samples used to learn the modelaffects performance.

We used a suite of MDPs with randomly generated transition and reward models. Inorder to produce environments that are more similar to typical RL tasks, some of theseMDPs were designed to have a 2D lattice-like structure. The lattice has n2 states, wheren is the length of the side, and four actions. Each state has a set of four neighboringstates; the effect of each action when the environment is fully latticial is to take the agentto the corresponding neighboring state with probability 0.8; with probability 0.2, thereis a uniformly random transition to one of the corresponding next state’s neighbors.The degree to which the transition model is lattice-like depends on the parameter τ ∈[0,1], which denotes the probability that an action will behave randomly. Thus, withprobability τ the effect of the action is to take the agent randomly to one of m successor

states (the successor states for each state are uniformly randomly chosen in advance).For example, τ = 1 means that the lattice structure has no effect. The reward for eachstate-action pair is 0 with probability 0.9; with probability 0.1, the reward is drawnuniformly from [0,1]. The discount factor γ = 0.95.

The model is learned using maximum likelihood with the uniform and loop-backpriors as described in section 2. The data is generated by sampling state-action pairsuniformly randomly, then sampling the next state from the true distribution.

We computed π∗, the optimal policy for the learned model, by doing value iterationwith the learned transition model and the (known) reward function. Value iteration wasrun on the learned model until the maximum change in the state-action values was≤ .000001. Then we evaluated the performance of π∗ in the real environment, by usingpolicy policy iteration. More precisely, we computed maxs,a |Q∗ −Qπ∗ |, where Q∗ isthe correct optimal policy. As a baseline, we compared against the well-known SARSAalgorithm (Sutton and Barto, 1998) an on-line, on-policy, model-free algorithm. We ranSARSA for the same number of steps as the number of samples collected and we usedthe same scheme to compare the policy produced by SARSA to the optimal policy. Thelearning rate α was set to 0.001 for SARSA (as it seemed to perform best in initialexperiments), and we used ε-greedy exploration with ε = 0.1.

The results for 64 states and connectivity m = 5 with τ = 0.0 and τ = 1.0 are shownin Figure 1. The results were averaged over 60 independent runs, each with a differentrandom MDP. The error bars were vanishingly small and were removed for clarity.

As expected, the value of maxs,a |Q∗−Qπ∗ | decreases as the number of data samplesincreases. With little or no data, the performance is different for the two initial models,but this effects disappears as the agent starts to see samples for all state-action pairs.With no data, the algorithm has the same performance as an agent that always choosesthe action with the highest immediate reward. We also notice that the overall shape ofthe graph is similar for the lattice-like structure and the completely random MDP. Themodel-based approach clearly outperforms the SARSA baseline. This can be largelyattributed to the fact that SARSA only gets on-policy data and only does one updateper sample. We also experimented with batch experience-replay methods. The resultsare not shown here, since the learning curves were similar to those of the model-basedmethods, as is to be expected for discrete MDPs. The effect of the initialization for themodel seemed innsignificant.

Next we examine the effect of varying the size of the number of states on the speedof convergence. We created random MDPs with varying values of |S| ∈ (20,50,100,200),four actions and m = |S|/10. For each value of |S|, we plotted the number of samplesrequired in order to have maxs,a |Q∗−Qπ∗ | < 0.5 for more than 90% of the MDPs. Theresults are averaged over 20 random MDPs of each size and are shown in Figure 2.

Our results displayed a linear trend in the number of samples required to obtain agood policy as the number of states increases, the number of actions is kept fixed and theconnectivity is kept to a fixed fraction of the number of states. More such experimentswould confirm whether this is indeed a general trend.

We also looked at how the estimated model improves with the number of samples.In Figure 3, we show how the L1 error of the estimated model decreases with the numberof samples. Plugging these errors into the bound in Theorem 1 results in values that are

20 40 60 80 100 120 140 160 180 2001000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Size of State Space

Fig. 2. Number of samples required for maxs,a |Q∗−Qπ∗ |< 0.5 in more than 90% of the randomlygenerated MDPs.

far from the empirical performance of π∗; this, however, is expected for bounds of thistype.

0 0.5 1 1.5 2

x 104

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Data Exploration Steps

max

s,a|P

(s‘|s

,a)

− P

∧ (s‘|s

,a)|

Model:UniformModel:Loop−back

Fig. 3. The L1 error in estimating the model.

4.2 Empirical Results for MDPs with Continuous State Space

We tested the algorithm on a modification of the well known Mountain Car domain(Moore, 1990). The left and right actions were made stochastic, with Gaussian noiseadded with standard deviation σ = .001. The goal region was the standard Mountain

Car goal of x ≥ 0.5, with the added restriction that the agent stops at the top of themountain (|v|< 0.01). Reward was -1 everywhere except for 0 at the goal, and a rewardof -100 was given if the agent exceeded 10000 time steps without reaching the goal.The discretization divided the state space and velocity space into 40 partitions each,giving a total of 1600 partitions. Value iteration was run on the learned model until themaximum change in the state-action value function was ≤ .0001, up to a maximum of100 steps.

We ran our algorithm with both model initialization methods. The average return ofthe resulting policy was estimated by averaging 10 independent runs of Monte Carlopolicy evaluation on the underlying environment, where each episode was run for amaximum of 10000 steps. We compare our policy to the policy obtained by runningSARSA with α = 0.02 and ε = 0.1 until convergence. The results are averaged over 20independent trials and are shown in Figure 4. The behavior is qualitatively very similarto that of the random MDPs.

0 1 2 3 4 5 6 7x 10

4Data Exploration Steps

Converged SARSA

Model:Uniform

Model:Loop−back

Fig. 4. Results for stochastic Mountain Car with stopping at the top. MBRL converged to a goodestimate with about 10 samples per state-action pair on average.

4.3 Discussion

The empirical results provide intuition with respect to the amount of data required inorder to compute a good policy, when that policy is computed using the model-basedapproach with a discrete model. In the experiments, we considered a favorable sce-nario by assuming that we can sample each state-action pair uniformly. We chose thisdistribution as it is guaranteed to sample each state-action pair, which is required forsatisfying Theorem 1. In practice, the distribution will likely not be uniform, as the datais typically collected by running some policy in the environment. This can lead to situ-ations where important regions of the state space have low probability. In fact, we haveobserved in experiments not reported here that not visiting some part of the state spacecan have hazardous effects, even if that region is not important for the optimal policy.

Note that in this case, our bound will also predict high errors, as the maximum L1 errorwill remain high.

5 Conclusions and future work

We presented the first sample complexity results for model-based reinforcement learn-ing with state aggregation. Our bounds highlight the trade-off between the quality ofthe partition and the number of samples needed for a good approximation.

There are several avenues for future work. The quality of the PAC result used toestimate the L1 error in the model can be tightened further, using Azuma’s inequality(this is an avenue which we are actively pursuing at the moment). There are several waysin which the results can be tightened. Instead of using L∞ norm, we could work withLp norm. Munos (2007) provides a theoretical analysis of approximate value iterationwith the Lp norm, which can be used as a step in this direction. Also, instead of usingthe maximum L1 error, we could likely work with a version weighted by a desireddistribution.

Our current approach assumes that the samples on which the model is built aredrawn i.i.d. In practice, these samples may have come from experience obtained bysolving a previous task. In this case, the correlation between samples must be takeninto account (as in Mannor et al, 2007). Our results do not extend immediately, but canlikely be adapted with more work.

The general case of linear function approximation is not covered by our results. Inthis case, there are still open questions in the field about how to build a transition model.However, the idea of having identical feature vectors for “similar” states should still beuseful.

Bibliography

Antos, A., Munos, R., & Szepesvari, C. (2007). Value-iteration based fitted policyiteration: Learning with a single trajectory. IEEE International Symposium on Ap-proximate Dynamic Programming and Reinforcement Learning.

Corrado, C. J. (2007). The exact joint distribution for the multinomial maximum andminimum and the exact distribution for the multinomial range.

Ernst, D., Guerts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcementlearning. Journal of Machine Learning Research, 6, 503–556.

Kearns, M., & Singh, S. (1999). Finite-sample rates of convergence for q-learning andindirect methods. Advances in Neural Information Processing Systems 11.

Kuvayev, L., & Sutton, R. S. (1996). Model-based reinforcement learning with anapproximate, learned model. Proceedings of the 9th Yale Workshop on Adaptive andLearning Systems (pp. 101–105).

Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Ma-chine Learning Research, 4, 1107–1149.

Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, plan-ning and teaching. Machine Learning, 8, 293–321.

Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. N. (2007). Biases and variance invalue function estimates. Management Science, 53, 308–322.

Moore, A., & Atkeson, C. (1995). The parti-game algorithm for variable resolutionreinforcement learning in multidimensional state-spaces. Machine learning, 21.

Moore, A. W. (1990). Efficient memory-based learning for robot control. Doctoraldissertation, Cambridge, UK.

Munos, R., & Moore, A. (2002). Variable resolution discretization in optimal control.Machine learning, 291–323.

Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic pro-gramming. Wiley.

Strehl, A. L., Li, L., & Littman, M. L. (2006). Incremental model-based learners withformal learning-time guarantees. Proceedings of the 22nd UAI.

Strehl, A. L., & Lihong Li, M. L. L. (2006). Pac reinforcement learning bounds for rtdpand rand-rtdp. Proceedings of AAAI’06 Workshop on Learning For Search.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MITPress.

Appendix

A Proof of Lemma 1

Using the triangle inequality:

‖Qπ∗Ω −Q∗‖∞ ≤ ‖Qπ∗Ω −Qπ∗ΩΩ ‖∞ +‖Q

π∗ΩΩ −Q∗‖∞. (2)

First we bound the second term, relating the optimal value function of MΩ to the optimalvalue function of M. Let us arbitrarily choose a partition ω ∈ Ω and a state s ∈ ω. Weuse the following Bellman equations:

Q∗(s,a) = R(s,a)+ γZ

SP(s′|s,a)max

a′Q∗(s′,a′)ds′

andQ

π∗ΩΩ (ω,a) = RΩ(ω,a)+ γ ∑

ω′∈ΩPΩ(ω′|ω,a,d)max

a′Q

π∗ΩΩ (ω′,a′).

Let ∆1 = |Q∗(s,a)−Qπ∗ΩΩ (ω,a)|. Since we assumed that the reward R(s,a) depends only

on the partition ω(s), we have

∆1 = γ|Z

SP(s′|s,a)max

a′Q∗(s′,a′)ds′

− ∑ω′∈Ω

PΩ(ω′|ω,a,d)maxa′

Qπ∗ΩΩ (ω′,a′)|

By adding and subtractingR

S P(s′|s,a)maxa′ Qπ∗ΩΩ (ω(s′),a′)ds′ inside the absolute value,

and applying the triangle inequality, we obtain ∆1 ≤ A+B where

A = γ|Z

SP(s′|s,a)[max

a′Q∗(s′,a′)ds′−max

a′Q

π∗ΩΩ (ω(s′),a′)]ds′| and

B = γ|maxa′

Qπ∗ΩΩ (ω′,a′)[ ∑

ω′∈Ω

Z

ω′P(s′|s,a)ds′− ∑

ω′∈ΩPΩ(ω′|ω,a,d)]|.

By using Jensen’s inequality to move the absolute value inside the integral, upperbounding the absolute value of the value function difference by its L∞ norm, and ob-serving that the transition density integrates to 1, we can upper bound A by

A ≤ γ‖Qπ∗ΩΩ −Q∗‖∞

In order to upper bound B, we will need to re-write PΩ(ω′|ω,a,d). Expanding the short-hand notation, we have

PΩ(ω′|ω,a,d)

= P(st+1 ∈ ω′|st ∈ ω,at = a,d) =P(st+1 ∈ ω′,st ∈ ω|at = a,d)

P(st ∈ ω|at = a,d)

=1

P(st ∈ ω|at = a,d)

Z

ωP(st+1 ∈ ω′,st = s|at = a,d)ds

=1

P(st ∈ ω|at = a,d)

Z

ωP(st = s|at = a,d)P(st+1 ∈ ω′|st = s,at = a,d)ds

=1

P(st ∈ ω|d)

Z

ωP(st = s|d)P(st+1 ∈ ω′|st = s,at = a,d)ds

If we denote the first factor in the integral by d(s), and we realize that we do not needto condition on d in the second factor because we know the state, we get (using theshorthand notation again)

PΩ(ω′|ω,a,d) =1

P(ω|d)

Z

ωd(s)P(ω′|s,a)ds.

Using this equation, the fact that 1P(ω|d)

R

ω d(x)dx = 1, Jensen’s inequality and the upper

bound |maxa′ Qπ∗ΩΩ (ω′,a′)| ≤ ‖Q

π∗ΩΩ ‖∞ we have

B ≤ γ‖Qπ∗ΩΩ ‖∞| ∑

ω′∈Ω[P(ω′|s,a)− 1

P(ω|d)

Z

ωd(x)P(ω′|x,a)dx]|

= γ‖Qπ∗ΩΩ ‖∞ ∑

ω′∈Ωsupx∈ω

|P(ω′|s,a)−P(ω′|x,a)|| 1P(ω|d)

Z

ωd(x)dx|

= γ‖Qπ∗ΩΩ ‖∞ ∑

ω′∈Ωsupx∈ω

∣∣P(ω′|s,a)−P(ω′|x,a)∣∣

Taking the supremum over states and maximum over actions and partitions we get

‖Q∗−Qπ∗ΩΩ ‖∞ ≤ γ‖Q∗−Q

π∗ΩΩ ‖∞ + γ‖Q

π∗ΩΩ ‖∞

maxa,ω

sups∈ω

∑ω′∈Ω

supx∈ω

∣∣P(ω′|s,a)−P(ω′|x,a)∣∣

hence

‖Q∗−Qπ∗ΩΩ ‖∞ ≤ γ

1− γ‖Q

π∗ΩΩ ‖∞

maxa,ω

sups∈ω

∑ω′∈Ω

supx∈ω

∣∣P(ω′|s,a)−P(ω′|x,a)∣∣ . (3)

Now let us bound ‖Qπ∗Ω −Qπ∗ΩΩ ‖∞, which relates the performance of π∗

Ω in MΩ (theMDP where we will learn it) to its performance in M (the MDP where we wil applyit). We will not use the optimality Bellman equations because π∗

Ω is not optimal in M.Instead, we will use the standard policy-based Bellman equations, where we do not sumover actions because π∗

Ω is deterministic (being the greedy policy w.r.t. Qπ∗ΩΩ ). Thus, we

have:

Qπ∗ΩΩ (ω,a) = RΩ(ω,a)+ γ ∑

ω′∈ΩPΩ(ω′|ω,a,d)Q

π∗ΩΩ (ω′,π∗

Ω(ω′))

and

Qπ∗Ω(s,a) = R(s,a)+ γZ

SP(s′|s,a)Qπ∗Ω(s′,π∗

Ω(s′)).

(π∗Ω(s) for a state s will be defined as π∗

Ω(ω), where ω is the partition that includes s).At this point, we can use the same calculations as the one above, since we can still

bound Qπ∗Ω(s,π∗Ω(s)) by ‖Q

π∗ΩΩ ‖∞ and, if s ∈ ω, we can upper bound |Qπ∗Ω(s,π∗

Ω(s))−Q

π∗ΩΩ (ω,π∗

Ω(ω))| by ‖Q∗−Qπ∗ΩΩ ‖∞. Thus, we have, exactly as above,

‖Qπ∗ΩΩ −Qπ∗Ω‖∞ ≤ γ

1− γ‖Q

π∗ΩΩ ‖∞ max

a,ωsups∈ω

∑ω′∈Ω

supx∈ω

∣∣P(ω′|s,a)−P(ω′|x,a)∣∣ . (4)

Inserting equations 3 and 4 into 2 completes the proof of the theorem.

B Proof of Lemma 2

A good starting point for our proof is provided by the following result, a direct adapta-tion of Lemma 1 on page 13 of “Non-parametric Density Estimation: The L1 view” byDevroye and Gyorfi (1985):

Lemma 3. If each state-action pair (s,a) is visited N(s,a) times then, for any ε∈ (0,1),if N(s,a) ≥ 20|S|/ε2 we have

Pr(∥∥P(·|s,a)−P(·|s,a)

∥∥1 ≥ ε) ≤ 3e−N(s,a)ε2/25.

Note that Lemma 3 only applies to the transitions starting from a single state, whereaswe wish to bound the maximum of the L1 distances starting from all states and actions.Before we move on to doing that, let us define ∆s,a =

∥∥P(·|s,a)−P(·|s,a)∥∥

1. Assumingthat we have m total samples (for all state-action pairs), we have

P(maxs,a

∆s,a ≥ ε) = 1−P(maxs,a

∆s,a ≤ ε)

Next we marginalize over all possible values of N. We denote the domain of N by D(N),and the set containing all possible choices of N such that each component is bigger thansome z by M(z). Using the additional notation F = P(maxs,a ∆s,a ≤ ε), we have

F = P(∀s,a,∆s,a ≤ ε)= ∑

x∈D(N)P(∀s,a,∆s,a ≤ ε|N = x)P(N = x)

= ∑x∈D(N)

P(N = x)

[∏s,a

P(∆s,a ≤ ε|N(s,a) = x(s,a))

]

≥ ∑x∈M(z)

P(N = x)

[∏s,a

P(∆s,a ≤ ε|N(s,a) = x(s,a))

]

≥ ∑x∈M(z)

[∏s,a

(1−3e−N(s,a)ε2/25)

]P(N = x)

≥ ∑x∈M(z)

(1−3e−zε2/25)|S‖A|P(N = x)

= (1−3e−zε2/25)|S‖A|P(min(N) ≥ z)

In the equations above, x(s,a) is the component of x corresponding to state s and actiona. Also, because we applied Lemma 3, we need to have z ≥ 20|S|/ε2.