Efficient sampling in approximate dynamic programming algorithms

23
Efficient Sampling in Approximate Dynamic Programming Algorithms Cristiano Cervellera * Marco Muselli Abstract Dynamic Programming (DP) is known to be a standard optimization tool for solving Stochas- tic Optimal Control (SOC) problems, either over a finite or an infinite horizon of stages. Under very general assumptions, commonly employed numerical algorithms are based on ap- proximations of the cost-to-go functions, by means of suitable parametric models built from a set of sampling points in the d-dimensional state space. Here the problem of sample com- plexity, i.e., how “fast” the number of points must grow with the input dimension in order to have an accurate estimate of the cost-to-go functions in typical DP approaches such as value iteration and policy iteration, is discussed. It is shown that a choice of the sampling based on low-discrepancy sequences, commonly used for efficient numerical integration, permits to achieve, under suitable hypotheses, an almost linear sample complexity, thus contributing to mitigate the curse of dimensionality of the approximate DP procedure. Keywords: stochastic optimal control problem, dynamic programming, sample complexity, deterministic learning, low-discrepancy sequences. 1 Introduction Consider the following special case of Markovian decision process, formulated under very general hypotheses. A dynamic system evolves through discrete temporal stages according to the general stochastic state equation x t+1 = f (x t ,u t t ), t =0, 1,... where x t X t R d is a state vector, u t U t R m is a control vector and θ t Θ t R q is a random vector. Suppose the random vectors θ t are characterized by a probability measure P (θ t ) with density p(θ t ), defined on the Borel σ-algebra of R q . The purpose of the control is to minimize a cost function which has an additive form over the various stages. We consider two versions of such optimization problem, namely T-stage stochastic optimal control (T-SOC) problem and discounted infinite-horizon stochastic optimal control (-SOC) problem. In both cases, as the decision problem is Markovian, we want to derive control functions in a closed-loop form, i.e., the control vector at each stage must be a function μ t (usually called policy) of the current state vector u t = μ t (x t ), t =0, 1,... * Istituto di Studi sui Sistemi Intelligenti per l’Automazione - Consiglio Nazionale delle Ricerche - Via de Marini 6, 16149 Genova, Italy - Email: [email protected] Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni - Consiglio Nazionale delle Ricerche - Via de Marini 6, 16149 Genova, Italy - Email: [email protected] 1

Transcript of Efficient sampling in approximate dynamic programming algorithms

Efficient Sampling in

Approximate Dynamic Programming Algorithms

Cristiano Cervellera∗ Marco Muselli†

Abstract

Dynamic Programming (DP) is known to be a standard optimization tool for solving Stochas-tic Optimal Control (SOC) problems, either over a finite or an infinite horizon of stages.Under very general assumptions, commonly employed numerical algorithms are based on ap-proximations of the cost-to-go functions, by means of suitable parametric models built froma set of sampling points in the d-dimensional state space. Here the problem of sample com-plexity, i.e., how “fast” the number of points must grow with the input dimension in order tohave an accurate estimate of the cost-to-go functions in typical DP approaches such as valueiteration and policy iteration, is discussed. It is shown that a choice of the sampling basedon low-discrepancy sequences, commonly used for efficient numerical integration, permits toachieve, under suitable hypotheses, an almost linear sample complexity, thus contributing tomitigate the curse of dimensionality of the approximate DP procedure.

Keywords: stochastic optimal control problem, dynamic programming, sample complexity,deterministic learning, low-discrepancy sequences.

1 Introduction

Consider the following special case of Markovian decision process, formulated under very generalhypotheses. A dynamic system evolves through discrete temporal stages according to the generalstochastic state equation

xt+1 = f(xt, ut, θt), t = 0, 1, . . .

where xt ∈ Xt ⊂ Rd is a state vector, ut ∈ Ut ⊂ Rm is a control vector and θt ∈ Θt ⊂ Rq is arandom vector. Suppose the random vectors θt are characterized by a probability measure P (θt)with density p(θt), defined on the Borel σ-algebra of Rq.

The purpose of the control is to minimize a cost function which has an additive form overthe various stages. We consider two versions of such optimization problem, namely T-stagestochastic optimal control (T-SOC) problem and discounted infinite-horizon stochastic optimalcontrol (∞-SOC) problem.

In both cases, as the decision problem is Markovian, we want to derive control functions ina closed-loop form, i.e., the control vector at each stage must be a function µt (usually calledpolicy) of the current state vector

ut = µt(xt), t = 0, 1, . . .

∗Istituto di Studi sui Sistemi Intelligenti per l’Automazione - Consiglio Nazionale delle Ricerche - Via de Marini6, 16149 Genova, Italy - Email: [email protected]

†Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni - Consiglio Nazionale delleRicerche - Via de Marini 6, 16149 Genova, Italy - Email: [email protected]

1

2

T-stage Stochastic Optimal Control (T-SOC) problem

We want to find the optimal control law u◦ = col(µ◦0(x0), . . . , µ◦T−1(xT−1)) that minimizes

F (u) = Eθ

T−1∑

t=0

h(xt, ut, θt) + hT (xT )

subject to the constraintsµt(xt) ∈ Ut t = 0, . . . , T − 1

andxt+1 = f(xt, µt(xt), θt) t = 0, . . . , T − 1

where x0 is a given initial state, θ = col(θ0, . . . , θT−1), u = col(u0, . . . , uT−1), h(xt, ut, θt) is thecost paid at the single stage t and hT (xT ) is the cost associated with the final stage1.

¤

Discounted Infinite-horizon Stochastic Optimal Control (∞-SOC) problem

When the number of stages T is not limited, we usually look for stationary policies, i.e., policiesthat do not change from stage to stage. We consider here discounted problems, where the effectof future costs is weighted by a parameter β ∈ [0, 1). Therefore, the cost to be minimized takeson the following form

limT→∞E

θ

T∑

t=0

[βth(xt, ut, θt)

]

subject tout = µ(xt), t = 0, 1, . . .

andxt+1 = f(xt, ut, θt), t = 0, 1, . . .

¤The Dynamic Programming (DP) algorithm, introduced by Bellman [1], is the standard tool

for the solution of SOC problems, as is documented by the large amount of studies devoted tothis method through the years. The basic idea underlying the DP procedure is to define, ateach stage t, a function, commonly named cost-to-go or value function 2, which quantifies thecost that has to be paid from that stage on to the end of the time horizon. The basics of therecursive solution for SOC problems are introduced and discussed in several classic referenceson DP methods and applications (see, for example, [1, 2, 3]). Among the most recent surveys onDP techniques and Markov decision processes in general, two excellent monographs are [4, 5].

Although efficient variations of the DP procedure exist for the deterministic version of theSOC problem, such as Differential Dynamic Programming [6], the presence of the random vectorsθt makes the DP equations analytically solvable only when some assumptions on the dynamicsystem and on the cost function are verified3. For the general case we must look for approxi-mate numerical solutions, i.e., we must accept sub-optimal policies based on an approximationof the cost and possibly of the control functions. Several numerical algorithms have been pro-posed for the approximate solution of the DP procedure (see, e.g., [7, 8, 9, 10, 11]). Anyway,if the problem is stated under very general hypotheses, any method based on discretization

1In many cases hT (xT ) ≡ 0.2In the following, we will use the term “cost-to-go”.3Typically, these assumptions are the classic “LQ hypotheses” (linear system equation and quadratic cost).

3

suffers from an exponentional growth of the computational requirements (usually called curseof dimensionality) which prevents from finding accurate solutions for nontrivial dimensions d(see, e.g, [11]). Still, despite the unavoidable curse of dimensionality, there is the need to findcomputationally-tractable methods that can be effectively applied to the SOC context, possiblyintroducing hypotheses on the regularity of the functions involved.

A very general algorithm is based on the approximation of the cost-to-go function by meansof some fixed-structure parametric architecture, which is “trained” on the basis of sample pointscoming from the discretization of the state space. There are many examples of such approach,where different structures are employed. Among the others, polynomial approximators [7],splines [10], multivariate adaptive regression splines [12] and neural networks [13] (in the lastcase the term neuro-dynamic programming is often used).

Once a suitably “rich” class of approximating architectures is chosen, the choice of thesample points is the most critical issue of the procedure. In general, finding the best function(i.e., the one that is closest to the true unknown cost-to-go function) inside the class of modelscorresponds to an estimation process that is usually performed by adopting a local or globaloptimization algorithm, which aims at finding the point of minimum (approximator parameters)of a nonlinear function measuring the error between the actual cost-to-go function and its currentapproximation.

The present paper deals with the curse of dimensionality related to sample complexity, i.e.,the rate at which the number of sampling points must grow in order to achieve a desired rateof accuracy of the estimation. The most common sampling technique used in the literature isthe “full uniform” grid, i.e., the uniform discretization of each component of the state spacein a fixed number of values. This clearly leads to a curse of dimensionality: if each of the dcomponents of the state space is discretized by means of q equally spaced values, the numberof points of the grid is equal to qd. Therefore, for the purpose of proving that the estimationproblem can be solved with non exponential sample complexity, finer sampling schemes have tobe investigated.

For what concerns deterministic sampling, a promising approach is based on the use ofOrthogonal Arrays [12], where Multivariate Adaptive Regression Splines (MARS) are employedas approximating architectures. Orthogonal Arrays are a family of subsets of the full uniformgrid which need to grow only polynomially with the dimension d. Anyway, theoretical resultson the convergence of the estimation process are not currently available.

For what concerns random sampling, interesting theoretical results on functional estimationcome from the field of Statistical Learning Theory (SLT) [14], which deals with the generalproblem of learning functional dependences from empirical data. In a typical learning problem,the data are generated randomly, according to some probability, by an external source. Undersuitable hypotheses on the structure of the class of models, it has been proven that the samplecomplexity of the estimation is quadratic, almost independently of the dimension d. This isconsistent with the typical quadratic convergence of various algorithms based on Monte Carlodiscretization techniques, such as integration of multivariate functions [15].

In the present work the sample complexity issue is faced by employing a deterministic versionof learning theory, first developed in its general context in [16]. Applied to dynamic program-ming, this approach leads to the estimation of the cost-to-go functions on the basis of quasi-random sampling of the state space. It is possible to prove that, under mild regularity condi-tions, an almost linear convergence of the estimation error can be achieved. We point ou thatthe method has already proved to be successful in practice for the solution of high-dimensionalproblems, such as optimal reservoir planning and inventory forecasting ([17, 18, 19]).

The work is organized as follows. In Section 2 the theory of deterministic learning is reported,and bounds on the estimation error are derived. In Section 3 algorithms for the finite horizonand the discounted infinite horizon case, based on the learning of the cost-to-go functions, are

4

presented. Section 4 contains results on the application of deterministic learning to approximateDP. In Section 5 simulation results are presented. Finally, the Appendix contains proofs, figuresand tables.

2 Deterministic learning

We summarize briefly the main results of the deterministic framework procedure that will beconsidered in the following for the context of SOC problems. A detailed tractation can be foundin [16], together with proofs of theorems.

Consider the following problem of function estimation: a functional dependence of the formy = g(x), where x ∈ X ⊂ Rd and y ∈ Y ⊂ R, has to be learnt from a set of samples (xL, yL) ∈(XL × Y L), where xL = {x0, . . . , xL−1}, yL = {y0, . . . , yL−1}, yl = g(xl). In particular, wedefine:

1. A family of parameterized functions Γ ={ψ(x, α) : α ∈ Λ ⊂ Rk

}, which are the models

used for learning.

2. A risk functional R(α) which measures the difference between the true function and themodel over X

R(α) =∫

X`(g(x), ψ(x, α))dx (1)

where ` : (Y ×Y ) 7→ R is a loss function4 that measures the difference between the functiong and its approximation at any point of X.

3. A deterministic algorithm by which a sequence of points xL ∈ XL, xL = {x0, . . . , xL−1} isgenerated.

If the class of models Γ is sufficiently rich, we can annihilate R(α). As previously noted, wewill not discuss the approximation problem related to the choice of Γ (a discussion on differentnonlinear models and their advantageous approximation properties can be found, e.g., in [20]).Also note that the results presented in the following can be easily extended to Y ⊂ Rk by simplyconsidering the k single components independently.

The target of the estimation problem is to find α∗ ∈ Λ such that R(α∗) = minα∈Λ

R(α). If the

minimum does not exist, the problem consists in finding α∗ ∈ Λ such that R(α∗) < infα∈Λ

R(α)+ε

for a given ε > 0.As we know g only in the points of the sequence xL, we try to minimize R on the basis of

available data. In particular, we choose a training algorithm that corresponds to minimizing theempirical risk given L observation samples

Remp(α, xL) =1L

L−1∑

l=0

`(yl, ψ(xl, α))

and define αL as the parameter vector obtained after the minimization of Remp(α, xL).In order to measure the difference between the actual value of the risk R(αL) after the

training and the best achievable risk R(α∗), we define r(αL) as

r(αL) = R(αL)−R(α∗)

Definition 2.1 We say that the training algorithm A is deterministically consistent if r(αL) →0 as L →∞.

4The loss function must be symmetric and satisfy `(z, z) = 0, `(z1, z2) > 0 for z1 6= z2

5

The term sample complexity will be used to denote the rate of convergence of r(αL) as L grows.For a given class of models Γ, we can adopt this rate as an efficiency measure of the chosensequence xL.

2.1 Deterministic learning rates based on discrepancy

We present some theoretical results based on a measure of spread of points called discrepancy,commonly employed in numerical analysis [21] and probability [22]. In particular, it can beproved [16] that the sample complexity is directly related to how uniformly the deterministicsequence covers the input space as the number of points L grows. For this reason, a special familyof deterministic sequences which yield almost linear convergence of the discrepancy is considered.Such sequences are usually referred to as low-discrepancy sequences, and are commonly used fornumerical integration methods. A detailed description of their construction can be found in [23].

In the following we will assume that X = [0, 1)d (i.e., the d-dimensional semi-closed unitcube). The results can be extended to other intervals of Rd, or more complex input spaces, bysuitable transformations [21].

For each vertex of a given subinterval B =d∏

i=1

[ai, bi] of X, we can define a binary label by

assigning ‘0’ to every ai and ‘1’ to every bi. For every function ϕ : X 7→ R we define ∆(ϕ,B) asthe alternating sum of ϕ computed at the vertices of B, i.e.,

∆(ϕ,B) =∑x∈eB

ϕ(x)−∑x∈oB

ϕ(x)

where eB is the set of vertices with an even number of ‘1’s in their label, and oB is the set ofvertices with an odd number of ‘1’s.

Definition 2.2 Let ℘ be any partition of X into subintervals. The variation of ϕ on X in thesense of Vitali is defined by

V (d)(ϕ) = sup℘

B∈℘

|∆(ϕ,B)| (2)

If the partial derivatives of ϕ are continuous on X, it is possible to write V (d)(ϕ) in an easierway as

V (d)(ϕ) =∫ 1

0· · ·

∫ 1

0

∣∣∣∣∂dϕ

∂x1 · · ·xd

∣∣∣∣ dx1 · · · dxd (3)

where xi is the i-th component of x.For 1 ≤ k ≤ d and 1 ≤ i1 < i2 < · · · < ik ≤ d, let V (k)(ϕ, i1, . . . , ik) be the variation in the

sense of Vitali of the restriction of ϕ to the k-dimensional face {(x1, . . . , xd) ∈ X : xi = 1 fori 6= i1, . . . , ik}.

Definition 2.3 The variation of ϕ on X in the sense of Hardy and Krause is defined by

VHK(ϕ) =d∑

k=1

1≤i1<i2<···<ik≤d

V (k)(ϕ, i1, . . . , ik) (4)

The following assumption plays a basic role in deriving an upper bound for the convergencerate of r(αL).

6

Assumption 2.1 The loss function `, the function g and the family Γ of models are such that

V = supα∈Λ

V ′HK(α) < ∞

where V ′HK(α) = VHK (`(g(x), ψ(x, α))).

In fact, we have the following result [16].

Theorem 2.1 Suppose Assumption 2.1 holds. Then, if low-discrepancy sequences are adoptedfor the sample xL, we have

r(α, L) ≤ O

(V (log L)d−1

L

)(5)

thus the training algorithm is deterministically consistent.

¤

3 Approximate dynamic programming algorithms

We present general approximate dynamic programming (ADP) schemes, both for the T -SOCand the ∞-SOC problems, which are heavily based on the estimation of an unknown functionfrom a set of finite samples. Therefore, all the results of the previous section can be directlyapplied to analyze the efficiency of the methods, related to the goodness of the chosen samplingof the state space.

3.1 T-SOC Problems

The computation of the optimal controls for each vector xt at stage t can be obtained recursivelyby the following well-known DP equations

J◦t (xt) = minut∈Ut

Eθt

[h(xt, ut, θt) + J◦t+1[f(xt, ut, θt)]

], t = T − 1, . . . , 0

J◦T (xT )4= hT (xT )

J◦t (xt) is called cost-to-go function, and it represents the optimal cost that has to be paidstarting from the state xt in order to reach stage T . It is possible to prove [5] that J◦0 (x0)corresponds to the optimal cost of the T-SOC problem.

3.1.1 The Approximate Dynamic Programming Algorithm

We formalize a general scheme for Approximate Dynamic Programming (ADP) based on theapproximation of the cost-to-go functions by means of parametric models. As stated in the intro-duction, it is generally impossible to solve the DP equations analytically; therefore we considera numerical solution for which a discretization of feasible state-spaces is needed. This permitsto compute estimated values of the cost-to-go functions in the points of such discretization, andto approximate them in the remaining points of the feasible sets.

For this purpose we define

xLt = {xt,l ∈ Xt : l = 1, . . . , L} , t = 1, . . . , T − 1

as a sample of L points xt,l chosen in Xt, for each stage t.

7

The algorithm is based, at stage t, on approximations Jt+1 of the cost-to-go functions, havingthe form of generic parameterized functions with a fixed structure ψ(x, α), where α ∈ Λ ⊂ Rk

is a set of “free” parameters to be optimized5. Examples of such functions can be feedforwardneural networks [24], radial basis functions networks [25], hinging hyperplanes [26], etc. Then,we define

Jt(xt, α◦t ) = ψ(xt, α

◦t )

where α◦ indicates that the model has been optimized, as will be described in the following.Using the approximation Jt+1(xt+1, α

◦t+1), we can write the DP equation as

J◦t (xt,l) = minut∈Ut

Eθt

[h(xt,l, ut, θt) + Jt+1[f(xt,l, ut, θt), α◦t+1]

](6)

for each xt,l ∈ xLt .

J◦t is used henceforth to denote a generic approximated value of the true cost-to-go functionJ◦t . In this case, the goodness of the approximation is affected by the use of Jt+1, by theimpossibility of computing exactly the true minimum and by the need of estimating the expectedvalue on θt through an average over a finite number of realizations of the random vectors.

When the L values J◦t (xt,l), l = 1, . . . , L, are computed, we can build the approximation Jt

by optimizing the empirical risk, as defined in Sec. 2. For instance, if we employ a quadraticloss function, which corresponds to a typical mean square error (MSE) criterion, we have

α◦t = arg minαt

1L

L∑

l=1

[J◦t (xt,l)− Jt(xt,l, αt)

]2

In this way, we are able to evaluate the cost-to-go function for each point of Xt, which isrequired when computing J◦t−1 at stage t− 1.

Notice that the various cost-to-go approximations can be obtained entirely off-line. Theon-line policy can be obtained by applying a reoptimization procedure, that involves the useof the DP equations and the approximation of the cost-to-go functions Jt(xt, α

◦t ) obtained off-

line. Specifically, at a given state xt, the optimal vector u◦t is derived through the followingminimization

u◦t = arg minut∈Ut

Eθt

[h(xt, ut, θt) + Jt+1[f(xt, ut, θt), α◦t+1]

]

3.2 ∞-SOC Problems

In the infinite horizon case, we look for stationary policies µ◦t (xt) = µ◦(xt), to which correspondsa single stationary cost-to-go function J◦(xt).

We assume that, for every stage t, Xt ≡ X, Ut ≡ U and Θt ≡ Θ, where X is such thatf(x, u, θ) ∈ X for every x ∈ X, u ∈ U and θ ∈ Θ. In this way, we can drop the subscript t, andwrite the well-known Bellman’s equation, which provides, once solved, the optimal cost-to-gofunction for the infinite horizon case.

J◦(x) = minu∈U

[h(x, u, θ) + βJ◦[f(x, u, θ)]

](7)

where the same function J◦ is present at the left and at the right hand side of (7). The differencewith the finite horizon case is that now we have to solve a functional equation.

Different methods have been proposed to obtain J◦. Among the most popular and successfulones, we can cite algorithms relying on the approximation of the cost-to-go function throughapproximating architectures. The book by Bertsekas and Tsitsiklis [13] is an excellent referencefor a survey of the aforementioned methods. In particular, we consider here two quite generalapproaches, namely (i) approximate value iteration and (ii) approximate policy iteration.

5For simplicity, we assume that the set of possible approximating functions ψ remains unchanged stage afterstage.

8

3.2.1 Approximate value iteration

The solving algorithm has basically the same structure as the ADP procedure described for thefinite horizon case, but in this case an iterative version is considered. In particular, the generick-th iteration is based on the use of an approximation of the cost-to-go function J◦

Jk(x, α◦k) = ψ(x, α◦k)

that is obtained from step k − 1 in the following way.Consider again a sample xL = {xl ∈ X : l = 1, . . . , L} of L points chosen6 in X. Then, for

each state xl ∈ xL, compute

J◦k (xl) = minu∈U

[h(xl, u, θ) + βJk−1[f(xl, u, θ), α◦k−1]

](8)

Next, build the cost-to-go approximation for the k-th iteration minimizing the empirical risk

α◦k = arg minα

1L

L∑

l=1

[J◦k (xl)− Jk(xl, α)

]2

After a sufficient number of iterations, the online control vector ut for the generic stage xt

can be obtained by taking the argument of the minimum in (8), having replaced xl by xt.

3.2.2 Approximate policy iteration

This method involves, at the k-th iteration and for each state xl ∈ xL, the use of estimatesof the cost-to-go functions J◦k (xl) evaluated by using the current policy µk. Ideally, this wouldcorrespond to

J◦k (xl) = Eθ

∞∑

t=0

βth(xt,l, µk(xt,l), θt) (9)

where xt+1,l = f(xt,l, µk(xt,l), θt) and x0,l = xl.In practice the system is simulated by using the current policy µk; if a sufficiently long finite

horizon of T stages is chosen and Q realizations of a sequence of random vectors {θ1,q, . . . , θT,q},q = 1, . . . , Q, are employed, the infinite-horizon cost to go function J◦k is estimated as

J◦k (xl) =1Q

Q∑

q=1

T−1∑

t=0

βth(xt,l,q, µk(xt,l,q), θt,q) (10)

where xt,l,q = f(xt−1,l,q, µk(xt−1,l,q), θt,q) and x0,l,q = xl. This phase is usually called policyevaluation.

Then, we build the cost-to-go approximation Jk(x, α◦k) = ψ(x, α◦k) corresponding to the k-thiteration in the usual way

α◦k = arg minα

1L

L∑

l=1

[J◦k (xl)− Jk(xl, α)

]2

Finally, we improve the policy for the states in xL

µk+1(xl) = arg minu∈U

[h(xl, u, θ) + βJk[f(xl, u, θ), α◦k]

](11)

6We suppose, for the sake of simplicity, that the sample xL remains the same for every k.

9

to be used in the policy evaluation phase at the (k + 1)-th iteration. For what concerns thevalue of the control function in the points outside xL, one can use (11) replacing xl with thestate that is actually reached through the simulation. Since this can be too computationallyintensive, especially when T is large, an attractive alternative is to approximate also the controlfunctions µk by parameterized models, on the basis of the available L pairs [xl, µk(xl)], i.e., webuild approximations µk(x, αµ◦

k ) where

αµ◦k = arg min

αµ

1L

L∑

l=1

‖µk(xl)− µk(xl, αµ)‖2

In this way, we have an immediate evaluation of the “optimal” control vector for any givenstate, at the price of a further level of sub-optimality. It may be noticed that all the propertiesof the sampling methods discussed in the previous section clearly hold also for the problem ofapproximating the policies.

3.3 Performance issues

For what concerns T-SOC Problems, it is reasonable to assume that the errors given by theapproximation of the cost-to-go functions have a direct impact on the goodness of the policywe obtain at the end of the DP algorithm. In the case of discounted problems, it is actuallypossible to derive bounds on the goodness of the performance.

It is well known that, if we consider their “exact” version (as opposed to “approximate”),both the value iteration and the policy iteration algorithms are proven to converge, in case ofdiscounted cost, to the true stationary optimal cost-to-go J◦.

In the case of approximate versions such as those described in this work, it is still possible toderive some bounds on their performance. In particular, for value iteration we can easily obtain(see [13], pag. 333) that Jk approaches J◦ as k → ∞ within an absolute error of ε/(1 − β),where ε is defined as the following upper bound

‖Jk − J◦k‖∞ ≤ ε

and J◦k is obtained as in (8). As for policy iteration, we have (see [13], pag. 276) that J◦kapproaches J◦ as k →∞ within an absolute error of (δ + 2βε)/(1− β)2 where again

‖Jk − J◦k‖∞ ≤ ε

with J◦k obtained7 as in (9). δ is an error term that depends on the fact that we cannot in generalperform an exact minimization in (11), and possibly also on the use of µk when we approximatethe control function µk.

Both approaches share the requirement that the functions J◦k , even if they are defined dif-ferently for each method, must be approximated as closely as possible to guarantee a goodperformance.

As previously said, ε depends on (i) how rich is the class of parameterized models chosenfor the approximation and (ii) how well we can estimate, inside this class, the closest elementto the “true” function we want to approximate. It is then clear that a good sampling method,i.e., one that does not lead to an exponential sample complexity, is essential for the efficiency

7When using a simulated version involving a finite horizon of T stages and Q realizations of the random vectors,a further error term should formally be added to ε. For simplicity, we will assume that Q and T are large enoughfor such term to be neglected.

10

and the accuracy of approximated dynamic programming algorithms.8 In the next section theapplication of deterministic learning to the framework of DP algorithms is presented, statingthe conditions under which, as we have seen in Sec. 2, this can lead to an almost linear samplecomplexity when low-discrepancy sequences are employed.

4 Deterministic learning for dynamic programming algorithms

In this section we analyze the variation of the cost-to-go functions involved in the ADP pro-cedures described in Sec. 3, and present sufficient conditions under which the finiteness ofthe variation is guaranteed. If this is the case, the almost linear sample complexity given bylow-discrepancy sequences can be attained.

For simplicity, we suppose there exists an homeomorphism9 between either the sets Xt (finitehorizon) or X (infinite horizon) and [0, 1)d. According to this assumption, henceforth we cangenerically use X ≡ [0, 1)d to denote the input space for every stage.

For a generic function φ(z) : S ⊂ Rp 7→ R, we introduce the following notation: for 1 ≤ k ≤ pand 1 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ p

∂i1,...,ikφ4=

∂kφ

∂zi1 · · · ∂zik

where zi is the i-th component of z.Next, we define the following class WM (S) of functions, for M ∈ R

WM (S) = {f : S ⊂ Rp 7→ H ∈ R such that ∂i1,...,ikf is continuous on S

and |∂i1,...,ikf | ≤ M for 1 ≤ k ≤ p and 1 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ p}

Note that, when S = X, it follows easily by (3) that all the elements of WM (S) have finitevariation in the sense of Hardy and Krause, as defined in (4).

Lemma 4.1 Consider the generic composite function g′(ξ1, . . . , ξd) = g(γ1(ξ1, . . . , ξd), . . . ,γs(ξ1, . . . , ξd)), where γj : [0, 1)d 7→ Hj ⊂ R for 1 ≤ j ≤ s. Suppose the following conditions hold

1. g ∈ WM (∏

j Hj) for some M ∈ R;

2. γj ∈ WM ′([0, 1)d) for all 1 ≤ j ≤ s and some M ′ ∈ R.

Then there is M ′′ ∈ R such that g′ ∈ WM ′′([0, 1)d), i.e., g′ has finite variation in the sense ofHardy and Krause on [0, 1)d.

The proof of this lemma can be found in [16]. From Lemma 4.1 two important cases directlyfollow.

Corollary 4.1 Consider two functions γ1 : [0, 1)d 7→ H1 ⊂ R and γ2 : [0, 1)d 7→ H2 ⊂ R.If γ1 ∈ WM (X), γ2 ∈ WM ′(X) for some M,M ′ ∈ R, then both γ1 + γ2 and γ1γ2 have finitevariation on X in the sense of Hardy and Krause.

8It must be pointed out that the convergence results discussed in the previous section actually hold for Lp

norms. In particular, they are well suited to the L2 norm that is standard in the DP literature. It is easy toverify, though, that convergence in the Lp norm implies convergence in the supremum norm provided the involvedfunctions are sufficiently regular. In particular, this is true for the conditions on the variation we have considered.

9A bijective function ζ such that both ζ and ζ−1 are continuous.

11

4.1 Application to approximate dynamic programming algorithms

4.1.1 The finite horizon case

Consider the estimation problem arising from the ADP algorithm described in Sec. 3.1 and, inparticular, the estimation of the cost-to-go approximations J◦t , defined as in (6), by means ofthe parameterized models Jt(xt, αt).

By Lemma 4.1 we can see that Assumption 2.1 is verified, in the ADP context, under thefollowing conditions

1. The loss function ` belongs to WM (Y × Y ) for some finite M ;

2. The cost-to-go approximations J(xt, αt) are such that, for all t, ∂i1,...,ik J is continuous forany αt ∈ Λ and

supαt∈Λ

|∂i1,...,ik J | < ∞

for all 1 ≤ k ≤ d and all 1 ≤ i1 ≤ . . . ≤ ik ≤ d

3. The cost-to-go approximations J◦t (xt) belong to WM ′(X) for some M ′ ∈ R and all t.

Conditions 1 and 2 are verified by commonly employed loss functions10 and standard ap-proximators11, respectively.

In the following we discuss sufficient hypotheses for condition 3 to be satisfied.For all t we define J◦t (xt) = Jt(xt, α

◦t ) as the trained cost-to-go approximation for stage t,

and UC ={

µ : µi ∈ WM (X) for all 1 ≤ i ≤ m}, where µi is the i-th component of µ.

Assumption 4.1a) For all t, and all θt ∈ Θ, we have fi ∈ WMi(X ×Ut), where fi is the i-th component of f ,

and h ∈ WM ′(X × Ut);b) for t = T , J◦T ≡ hT ∈ WM (X). For all t = T − 1, . . . , 1, J◦t ∈ WM (X).

For each point xt, we define the argument of the minimum in (6) as u◦t = µ◦t (xt). In general,the optimal solution µ◦t , i.e. the one obtained pointwise for each xt, belongs to UC only underparticular assumptions of convexity on h and f (e.g., see [27]).

Assumption 4.2For all t, there exists µ∗t ∈ UC that is ε-close (in some suitable norm) to the optimal pointwise

solution µ◦t , for any ε > 0.

It is known that under mild hypotheses, for any measurable function ζ there is an arbitrarilyclose function ζ ∈ C∞ (see, e.g., [28]). Therefore, the only requirement on µ◦t implied byAssumption 4.2 is that it admits an arbitrarily close approximation with bounded derivatives.

It is easy to see that, even when f and h are not convex, the regularities imposed on them byAssumption 4.1 imply that also the cost-to-go function obtained with µ∗t can be made arbitrarilyclose to J◦t (xt) (obtained with µ◦t ). The next theorem contains conditions on the applicabilityof the results of Sec. 2, and considers an arbitrarily close approximation to J◦t (xt) obtained byconstraining the control functions to belong to UC . For the sake of readability, we will employthe same notation J◦t (xt) also for this approximation.

Theorem 4.1 Suppose Assumptions 4.1 and 4.2 hold. Then we have J◦t ∈ WM ([0, 1)d) forsome M ∈ R and all t, i.e., J◦t has finite variation in the sense of Hardy and Krause.

The proof is contained in the Appendix.10Such as (v1 − v2)

p.11Such as neural networks, radial basis functions networks, support vector machines (for proper behaviour of

the kernel function), etc. [16].

12

4.2 The Infinite Horizon Case

We consider the estimation of the functions J◦k by means of parameterized models Jk(x, αk), asdescribed in Sec. 3.2.

For what concerns approximate value iteration, it is evident that the results of Theorem 4.1,due to the structure of (8), can still be applied, provided we replace the subscripts related tothe temporal stage t with those related to the iteration k of the algorithm.

As for approximate policy iteration, we consider the estimation of functions J◦k (x) given apolicy µk(x) in their “exact” form, i.e.,

J◦k (x) = Eθ

∞∑

t=0

[βth(xt, µk(xt), θt)

]

where x0 = x. The case where a finite T is employed and the expected value is approximatedthrough Q realizations of the random sequences can be derived in a straightforward way.

Assumption 4.3a) for all θ ∈ Θ, we have fi ∈ WMi(X × U), where fi is the i− th component of f ;b) for all θ ∈ Θ, we have h ∈ WM ′(X × U);c) for all k, we have µk ∈ UC ;d) βMd

f,1Mdµ,1 < 1, where Mf,1 = supθ maxn=1,...,d |∂ifn| (for i = 1, . . . , d + m) and Mµ,1 =

maxk supθ maxn=1,...,m |∂iµk,n| (for i = 1, . . . , d) respectively.

We have the following result.

Theorem 4.2 Suppose Assumption 4.3 holds. Then we have J◦k ∈ WM ([0, 1)d) for some M ∈ R,i.e., J◦k has finite variation in the sense of Hardy and Krause.

The proof is contained in the Appendix.Notice that Assumption 4.3c is automatically verified when we approximate the control

functions µk by the usual commonly employed parameterized models, as described in Subsection3.2.

Assumption 4.3d is a bound on the absolute value of the first-order partial derivatives off and µk, and it can theoretically always be fulfilled by a suitable scaling of the input space,and/or choosing a sufficiently small β.

Anyway, when we consider the real implementation of the approximate policy iterationmethod, where the true cost-to-go function values are estimated through simulation over afinite horizon and averaging over a finite numer of random sequences, point d of Assumption 4.3is no longer required, i.e., finiteness of the variation of the functions f , µk, h, stated in pointsa-c, is enough to guarantee finiteness of the variation of J◦k .

5 Experimental results

In this section the use of low-discrepancy sequences in typical ADP contexts is presented. Weconsider two high-dimensional test problems, specifically a 9-dimensional inventory forecastingproblem and a 30-dimensional water reservoirs optimal management problem. For both mod-els we compare the performance obtained with low-discrepancy sequences and with randomsequences, both considering the finite horizon and the discounted infinite horizon case.

13

The inventory forecasting model

In this 9-dimensional problem, the aim of the control is to satisfy the demand of 3 items whilekeeping the storage levels as small as possible. Therefore, the cost function for each stage t isV-shaped, and has the following structure:

h(xt, ut, θt) =3∑

j=1

[ωj max{xt+1,j , 0} − πj min{0,−xt+1,j}]

where

xt+1,j = xt,j + ut,j − xt,j+3 · θt,j for j = 1, 2, 3xt+1,j = xt+1,j+3 · θt,j for j = 4, 5, 6xt+1,j = qj · θt,j for j = 7, 8, 9

The components xt,j are the item levels in period t when j = 1, 2, 3, the forecasts for thedemand of each item in period t + 1 when j = 4, 5, 6, and the forecasts for the demand ofeach item in period t + 2 when j = 7, 8, 9. The random value θt,j represents the correctionbetween the forecast and the true demand of items; the 9-dimensional vector θt has lognormaldistribution. qj is a constant, ωj ≥ 0 is the holding cost parameter for item j, while πj ≥ 0 isthe backorder cost parameter for item j. In order to deal with a differentiable cost, in our testswe used “smooth” functions that approximate the ideal “V” shape of the cost. Specifically, wehave

Q+(z, δ) =

0 for z ≤ 01

4δ2 z3 − 116δ3 z4 for 0 < z < 2δ

z − δ for z ≥ 2δQ−(z, δ) =

−z − δ for z ≤ −2δ− 1

4δ2 z3 − 116δ3 z4 for − 2δ < z < 0

0 for z ≥ 0

Note that when δ → 0, we have Q+(z, δ) → max{0, z} and Q−(z, δ) → −min{0, z}. Furtherdetails on the model can be found in [12].

The water reservoir network model

The water reservoir network we consider consists of 10 basins, each one affected by stochasticinflows and controlled by means of water releases. The objectives of the control are (i) to keepthe level of the water in the reservoirs, at the beginning of the new stage, as close as possible to atarget value smaller than the maximum capacity and (ii) to minimize the cost (maximize benefit)represented by a function g of the water releases and/or water levels (e.g., power generation,irrigation, etc.)

The state equation for the j-th reservoir, j = 1, . . . , 10, can be written as

xt+1,j = xt,j +∑

i∈Uj

rt,i − rt,j + εt,j

where xt,j , for j = 1, . . . , 10, is the amount of water in the j-th reservoir at the beginning ofstage t, rt,j is the amount of water released from the j-th reservoir during stage t, εt,j is the netinflow into the j-th reservoir during stage t and U j is the set of indexes corresponding to thereservoirs which release water into reservoir j.

The stochastic inflows εt are modeled through an autoregressive system of order 2, affectedby a random correction θt that follows a standard normal distribution. Thus, at each stage t,the values of the inflows of stages t−1 and t−2 have to be included into the state vector, whichleads to a 30-dimensional problem.

14

For what concerns the remaining 20 components of the state vector, we can define, forj = 1, . . . , 10, xt,j+10 = εt−1,j and xt,j+20 = εt−2,j , and simply write

xt+1,j+10 = εt,j xt+1,j+20 = xt,j+10

The cost function has the following structure

h(xt, rt, εt) =10∑

j=1

|xt+1,j − xj | −10∑

j=1

pjQ+(rt,j , δj)

where xj is the target level for the j-th reservoir and pj is a coefficient for the benefit due to therelease of water from reservoir j. Again, the absolute value has been approximated by using Q+

and Q−. Q+ has also been used to model a benefit which becomes relevant for “large” values ofthe water releases, depending on the value of δj . Suitable constraints satisfy the requirementsthat each release must be positive and limited by a maximal pumpage capability, and it cannever exceed the amount of water at the beginning of the period plus the water that gets infrom the upstream reservoirs.

The configuration of the network is depicted in Figure 1, while the target levels for thevarious reservoirs are reported in Table 1. A detailed description of the model can be found in[17].

Finite horizon tests

For what concerns the finite horizon case, we have implemented the ADP procedure for boththe 9-dimensional and the 30-dimensional models to solve the SOC problem over T = 3 stages.

The state spaces Xt for the various sets have been discretized by 18 different sequences,9 based on a random extraction with uniform distribution and 9 based on different kinds oflow-discrepancy sequences 12.

For both discretization types 3 basic sequences with L = 600 and L = 2000 points for the9-dimensional and the 30-dimensional case respectively have been generated, while the othersequences are obtained as subsets of these basic samples, giving rise to 3 sequences with sizeL = 200 and L = 1000 and 3 with size L = 400 and L = 1500. This allows to test theimprovement of the solution when new points are added to the discretization. The 18 sequencesthus obtained, originally contained in the d-dimensional hypercube [0, 1]d, have been scaled tofit the state spaces for the various stages t.

The expected value of the costs with respect to the random vectors have been approximatedby averaging over a finite number of realizations drawn from the proper distribution (i.e., thelognormal one for the inventory problem and the standard normal one for the reservoirs man-agement problem). Specifically, we used 8 vectors for the 9-dimensional problem and 10 vectorsfor the 30-dimensional one.

For what concerns the models for the approximation of the cost-to-go functions, we usedfeedforward one hidden-layer perceptron networks, i.e., nonlinear mappings of the form

ψ(x, α) =ν∑

i=1

ciσ(x>ai + bi)

where ν is the number of “neural units”, ai ∈ Rd, bi ∈ R and ci ∈ R are the “free” parametersof the network and σ(·) is the activation function.

12The implementation of such kind of sequences comes from various sources, including [29],http://www.csit.fsu.edu/∼burkardt/m src/halton/halton.html, http://ldsequences.sourceforge.net.

15

For the inventory problem, we used ν = 7 and σ(·) having the form of a logarithmic sigmoid

σ(z) =1

1 + e−z.

For what concerns the 30-dimensional problem, we used ν = 15 and σ(·) having the form ofa hyperbolic tangent

σ(z) =ez − e−z

ez + e−z.

At each stage t, the same initialized network has been trained with each of the 18 trainingsets, by minimizing the empirical risk as described in Sec. 3, using the same algorithm (theLevenberg-Marquardt method) with the same number of training iterations.

To test the goodness of a solution (i.e., the goodness of the approximate cost-to-go functionsobtained by the various discretization methods), 10 initial vectors x0,i, i = 1, . . . , 10, have beenchosen in a set X0 (the range for the various components of the state are reported in Tables2 and 3). For each point, the optimal cost has been computed by means of reoptimizationaveraging over 20 different “online” random sequences. Finally, the average of this 200 costs hasbeen used to measure the performance of a given discretization method.

Tables 4a and 5a contain the comparison among the 18 training sets, based on the value ofthe mean cost defined as above, for the inventory forecasting and the reservoir management13

problem, respectively. Tables 4b and 5b contain the average RMS errors for both kind ofdiscretizations. In the tables and in the following, “UR-n” means “uniform random sequencenumber n”, while “LD-n” means “low-discrepancy sequence14 number n”.

Infinite horizon tests

For the infinite horizon case, we tested the goodness of the approximation of the cost-to-gofunctions in a generic step of a policy iteration algorithm. As discussed in Sec. 3, this is themost critical issue for a good performance of the algorithm.

For the tests we employed the same kind of discretizations used for the finite horizon case,again leading to 18 different sampling schemes with the same sizes, suitably scaled to the statespace X, bounded by the values reported in Tables 6 and 7.

For each point xl of the discretization, we have computed the corresponding value of J◦(xl),estimating the infinite horizon expected value by simulation of the system over a finite horizon ofT = 20 stages and averaging over a finite number Q = 10 of realizations of the random vectors.For what concerns the value of the controls at the various stages, we employed a feedforwardneural network trained, up to the same level of accuracy, on the basis of the values of samplesµ(xl), l = 1, . . . , L (as described in Sec. 3.2), coming from an heuristic policy µ. In particular, wehave µi(x) = xi+3−xi, i = 1, . . . , 3 for the inventory problem and µi(x) = xi−xi, i = 1, . . . , 3 forthe reservoir management problem. The networks used to approximate the controls have ν = 10both for the inventory and the reservoir problem. Then, the values of J◦(xl) thus obtained forthe 18 discretizations have been used to train the approximations of the cost-to-go functions.For both the 9-dimensional and the 30-dimensional problems we employed feedforward neuralnetworks with hyperbolic tangent activation functions, with ν = 5 and ν = 15 neural unitsrespectively. For a fair comparison, we trained all the networks corresponding to the same sizeL up to the same level of accuracy of the empirical risk. The goodness of the approximation

13Note that for the reservoir management problem we consider benefits, therefore good performance is repre-sented by a highly negative cost.

14Specifically, LD-1 is a Niederreiter sequence, LD-2 a Halton sequence and LD-3 is a Sobol’ sequence for theinventory model, while LD-1 and LD-2 are two different Sobol’ sequences and LD-3 is a Halton sequence for thereservoir model.

16

has been tested by computing the rooted mean squared (RMS) error between the output ofthe networks and the function J◦ computed in Ltest = 200 points coming from a discretizationof the state space based on a randomized Latin hypercube design [30]. The choice of suchdiscretization, which is hybrid between a purely random and a deterministic uniform coveringof the space, aims at obtaining unbiased results.

The RMS errors obtained for the inventory forecasting problem with the 18 discretizationschemes are shown in Table 8a, while Table 9a contains the RMS errors for the reservoir model.Tables 8b and 9b contain the average RMS errors for both kind of discretizations.

Comments

For what concerns the finite horizon tests, Tables 4a and 5a show that low-discrepancy sequencesgenerally perform better than random ones in the context of both the 9-dimensional and the30-dimensional problem. In fact, not only they provide the lowest costs, but they also give thelowest average (for the two kinds of sequences) for every dimension, as Tables 4b and 5b show.

Furthermore, only LD sequences show an actual average improvement of the solution as thedimension of the sample size increases. Looking at the single discretizations, though, this is nottrue for LD-3 in the 9-dimensional case. However, such training set provides for L = 400 a costthat is better than all of those coming from random sequences.

Also for what concerns the infinite horizon results, Table 8a shows that low discrepancysequences clearly outperform the random ones in the approximation of the cost-to-go functionsin the 9-dimensional case. In fact, for every size L, the average RMS error of UR sequences ishigher than that of LD sequences, as Table 8b shows.

For what concerns the 30-dimensional example, LD sequences still appear as generally pro-viding better results, for they give the lowest costs for each L and always present an improvementas the sample size grows (Table 9a). Yet, as Table 9b shows, UR sequences have better averageRMS for L = 1500, mainly due to the performance of the Halton sequence.

In general, considering both kinds of test, the advantages of a discretization based on lowdiscrepancy with respect to that involving random sequences are evident in the 9-dimensionalexamples, while they appear to be less marked for the 30-dimensional cases, even if LD sequencesstill clearly result as the best alternative. This is reasonable, since for such high dimensions wecan expect to have a definitely more uniform covering of the space only with a very large numberof points (which is basically what we want to avoid). In particular, some kinds of LD sequences(e.g., the Sobol’ sequences) appear to perform better than others, though there is actually noway of performing an a-priori choice among the different methods for generating LD sequences.This seems to represent one of the main limits of the low-discrepancy approach; another onederives from the fact that LD sequences are definitely more difficult to be implemented thanrandom uniform ones.

Finally, for what concerns the 30-dimensional reservoirs problem, LD sequences appear toperform better in the finite horizon case. This may be due to the fact that the variation of thecost-to-go function in the policy iteration method can become quite large for high dimensionsof the state space (as shown in the proof of Theorem 4.2). This seems to suggest that the valueiteration method should be chosen for high-dimensional infinite-horizon problems.

6 Conclusions

Sample complexity in approximate Dynamic Programming algorithms, where cost-to-go func-tions are approximated by parameterized models, has been considered both for the case of finitehorizon and discounted infinite horizon problems.

17

Deterministic learning has been considered in order to face the curse of dimensionality arisingfrom the discretization of the state space. By employing special deterministic point sets, knownas low-discrepancy sequences, it is possible to prove the convergence of the estimation of thecost-to-go functions with an almost linear rate, which is better than classic results deriving frompure Monte Carlo sampling. The hypotheses under which the almost linear rate is guaranteed ina DP context have been investigated, and sufficient conditions on the regularity of the cost-to-gofunctions in order to avoid the curse of dimensionality for sample complexity have been derived.

The method has then been applied to different examples of high-dimensional models, per-forming both finite and infinite horizon tests. The results show that low-discrepancy sequencesgenerally outperform purely random sequences, in accordance to the theory presented in thispaper.

Appendix

Proof of Theorem 4.1

We write the t-th approximate cost-to-go function in this form

J◦t (xt) = Eθt

[h(xt, µ

∗t (xt), θt) + J◦t+1[f(xt, µ

∗t (xt), θt)]

](12)

=∫

Θt

h(xt, µ∗t (xt), θt)p(θt)dθt +

Θt

J◦t+1[f(xt, µ∗t (xt), θt)]p(θt)dθt

= h(xt) + J◦t+1(xt)

We have by Corollary 4.1 that J◦t ∈ WM ([0, 1)d) for some M ∈ R if both h and J◦t+1 havefinite variation on [0, 1)d.

For what concerns h, we have, for k = 1, . . . , d and 1 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ d

∂i1,...,ik h =∂k

∂xt,i1 · · · ∂xt,ik

Θt

h(xt, µ∗t (xt), θt)p(θt)dθt

=∫

Θt

∂kh(xt, µ∗t (xt), θt)

∂xt,i1 · · · ∂xt,ik

p(θt)dθt

and

|∂i1,...,ik h| =∣∣∣∣∫

Θt

∂kh(xt, µ∗t (xt), θt)

∂xt,i1 · · · ∂xt,ik

p(θt)dθt

∣∣∣∣

≤∫

Θt

∣∣∣∣∂kh(xt, µ

∗t (xt), θt)

∂xt,i1 · · · ∂xt,ik

∣∣∣∣ p(θt)dθt

After a change of variables, we can write

h(xt, µ∗t (xt), θt) = h(γi1(xt), . . . , γik+m

(xt), θt)

where

γij (xt)4=

{xt,ij for 1 ≤ j ≤ kµ∗t,j−k(xt) for k + 1 ≤ j ≤ k + m

It follows from Lemma 4.1, Assumption 4.1a and Assumption 4.2 that for any θt the function

∂kh(xt, µ∗t (xt), θt)

∂xt,i1 · · · ∂xt,ik

18

has finite variation. Then we define

M(θt) = supxt,i1

,...,xt,ik

∣∣∣∣∂kh(xt, µ

∗t (xt), θt)

∂xt,i1 · · · ∂xt,ik

∣∣∣∣

It is easy to see that ∂i1,...,ik h is continuous, and |∂i1,...,ik h| is bounded by

supθt∈Θ

M(θt)

This proves that h has bounded variation on [0, 1)d.For what concerns J◦t+1, we can proceed in the same way as for h. By operating the same

change of variables we write

J◦t+1[f(xt, µ∗t (xt), θt)] = J◦t+1[f1(γi1(xt), . . . , γik+m

(xt)), . . . , fd(γi1(xt), . . . , γik+m(xt)), θt)]

and obtain, by Lemma 1, Assumption 4.1 and Assumption 4.2, that J◦t+1 has bounded variationon X. As a consequence, the same assertion is true also for J◦t .

¤

Proof of Theorem 4.2

We write the estimated cost given µk as

J◦k (x) = limT→∞E

θ

T∑

t=0

[βth(xt, µk(xt), θt)

]

= limT→∞

ΘT

T∑

t=0

[βth(xt, µk(xt), θt)

]p(θ)dθ

Consider the term of the sum corresponding to t = 1, i.e. h(x1, µk(x1), θ1). We can writethe explicit dependence on x as

h1(x) = h(f(x, µk(x), θ0), µk(f(x, µk(x), θ0)), θ1)

In general, if we consider the “unfolded” expression of the t-th term, we obtain a longrecursion made of terms where the function f is nested t times, and µk is nested up to t + 1times.

Deriving an expression for |∂i1,...,ijht| involves computing the partial derivatives, up to thej-th order, of the aforementioned “unfolded” form. By tedious algebra and iterative applicationof the chain rule for differentiating compositions of functions, we can see that this leads to aterm that is bounded, for each sequence θ0, . . . , θt, by

|∂i1,...,ijht| ≤ δ(j, d, m,Mf,2, . . . ,Mf,j ,Mµ,2, . . . , Mµ,j ,Mh,1, . . . , Mh,j)tγ(j,d,m)M jtf,1M

j(t+1)µ,1

where Mf,q, Mh,q and Mµ,q correspond to supθ maxn=1,...,d |∂i1,...,iqfn|, supθ |∂i1,...,iqh| andmaxk maxn=1,...,m |∂i1,...,iqµk,n|, respectively, while δ, γ are functions that do not depend on t; δ isfinite provided that Mf,q,Mµ,q,Mh,q are finite (which is true from conditions a-c of Assumption4.3).

Thus, we can write

|∂i1,...,ij J◦k | ≤ lim

T→∞

T∑

t=0

δtγ(j,d,m)βtM jtf,1M

j(t+1)µ,1

which converges to a finite value provided βM jf,1M

jµ,1 < 1.

¤

19

Figures and Tables

Figure 1: Reservoir Network.

Res. 1 Res. 2 Res. 3 Res. 4 Res. 5 Res. 6 Res. 7 Res. 8 Res. 9 Res. 10

Target Water Level(∗) 200 250 260 270 220 420 200 500 180 340

Table 1: Target water levels for the 30-dimensional reservoirs management problem ((∗) in 105

cubic meters)

ComponentsBounds of X0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Xmin0 −20 −24 −15 0 0 0 0 0 0

Xmax0 20 24 15 20 24 15 13 16 10

Table 2: Bounds for the inventory forecasting problem - finite horizon tests

20

Components 1-15Bounds of X0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15

Xmin0 180 230 240 250 200 400 180 480 160 320 16 15 16 17 15

Xmax0 220 270 280 290 240 440 220 520 200 360 38 37 37 37 37

Components 16-30Bounds of X0 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30

Xmin0 14 14 14 14 8 14 10 12 11 12 3 3 1 3 2

Xmax0 21 22 22 21 11 44 42 41 43 43 23 25 25 25 15

Table 3: Bounds for the water reservoirs problem - finite horizon tests

Training CostSet L=200 L=400 L=600

UR-1 70.6397 56.0255 52.3873UR-2 48.9437 48.5632 51.0478UR-3 54.8784 77.9804 49.0067LD-1 67.4341 55.5062 45.2913LD-2 55.7225 51.6289 43.5876LD-3 51.0034 47.8010 56.1214

(a)

Training Average costSet URS LDS

L=200 58.1539 58.0533L=400 60.8563 51.6454L=600 50.8139 48.3334

(b)

Table 4: Inventory forecasting problem: costs (a) and average costs (b)

Training CostSet L=1000 L=1500 L=2000

UR-1 −347.4490 −348.9913 −345.6562UR-2 −347.3970 −346.0732 −342.1767UR-3 −210.7020 −348.6975 −352.0993LD-1 −348.3664 −349.9061 −351.9973LD-2 −344.6842 −351.0062 −352.8830LD-3 −343.4628 −346.8174 −351.7950

(a)

Training Average costSet URS LDS

L=1000 −301.8493 −345.5045L=1500 −347.9207 −349.2432L=2000 −346.6441 −352.2251

(b)

Table 5: Reservoirs network management problem: costs (a) and average costs (b)

ComponentsBounds of X x1 x2 x3 x4 x5 x6 x7 x8 x9

Xmin0 −5 −5 −5 0 0 0 0 0 0

Xmax0 5 5 5 10 10 10 10 10 10

Table 6: Bounds for the inventory forecasting problem - infinite horizon tests

Components 1-15Bounds of X x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15

Xmin0 140 190 200 210 160 360 140 440 120 280 16 15 16 17 15

Xmax0 260 310 320 330 280 480 260 560 240 400 38 37 37 37 37

Components 16-30Bounds of X0 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30

Xmin0 14 14 14 14 8 14 10 12 11 12 3 3 1 3 2

Xmax0 21 22 22 21 11 44 42 41 43 43 23 25 25 25 15

Table 7: Bounds for the water reservoirs problem - infinite horizon tests

21

Training RMS ErrorSet L=200 L=400 L=600

UR-1 0.4501 0.8116 0.3984UR-2 0.8846 0.5581 0.3758UR-3 0.3357 0.8526 0.2649LD-1 0.3666 0.2756 0.2230LD-2 0.5861 0.3237 0.1794LD-3 0.4516 0.4590 0.1904

(a)

Training Average RMS ErrorSet URS LDS

L=200 0.5568 0.4681L=400 0.7407 0.3528L=600 0.3464 0.1976

(b)

Table 8: Inventory forecasting problem: RMS errors (a) and average RMS (b)

Training RMS ErrorSet L=1000 L=1500 L=2000

UR-1 43.6156 32.3719 27.8988UR-2 40.9879 29.8102 29.1203UR-3 35.0081 38.3010 27.8337LD-1 37.7873 33.5182 26.9206LD-2 33.9177 33.8786 26.0140LD-3 38.6658 34.8530 30.4339

(a)

Training Average RMS ErrorSet URS LDS

L=1000 39.8705 36.7903L=1500 33.4944 34.0833L=2000 28.2843 27.7895

(b)

Table 9: Reservoirs management problem: RMS errors (a) and average RMS (b)

22

References

[1] R. Bellman, Dynamic Programming. Princeton: Princeton University Press, 1957.

[2] R. Bellman and S. Dreyfus, Applied Dynamic Programming. Princeton: Princeton Univer-sity Press, 1962.

[3] R. E. Larson, State Increment Dynamic Programming. New York: Elsevier Publ. Co., 1968.

[4] M. Puterman, Markov Decision Processes. New York: Wiley, 1994.

[5] D. Bertsekas, Dynamic Programming and Optimal Control (2nd Edition), vol. I. Belmont:Athena Scientific, 2000.

[6] D. Jacobson and D. Mayne, Differential Dynamic Programming. New York: Academic,1970.

[7] R. Bellman, R. Kalaba, and B. Kotkin, “Polynomial approximation - a new computationaltechnique in dynamic programming allocation processes,” Math. Comp., vol. 17, pp. 155–161, 1963.

[8] D. Bertsekas, “Convergence of discretization procedures in dynamic programming,” IEEETrans. on Automatic Control, vol. 20, pp. 415–419, 1975.

[9] E. Foufoula-Georgiou and P. Kitanidis, “Gradient dynamic programming for stochasticoptimal control of multidimensional water resources systems,” Water Resour. Res., vol. 24,pp. 1345–1359, 1988.

[10] S. Johnson, J. Stedinger, C.Shoemaker, Y.Li, and J. Tejada-Guibert, “Numerical solutionof continuous-state dynamic programs using linear and spline interpolation,” Oper. Res.,vol. 41, pp. 484–500, 1993.

[11] C. Chow and J. Tsitsiklis, “An optimal multigrid algorithm for continuous state discretetime stochastic control,” IEEE Trans. on Automatic Control, vol. 36, pp. 898–914, 1991.

[12] V. Chen, D. Ruppert, and C. Shoemaker, “Applying experimental design and regressionsplines to high-dimensional continuous-state stochastic dynamic programming,” Oper. Res.,vol. 47, pp. 38–53, 1999.

[13] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Belmont: Athena Scientific,1996.

[14] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1995.

[15] J. M. Hammersley and D. C. Handscomb, Monte Carlo Methods. London: Methuen, 1964.

[16] C. Cervellera and M. Muselli, “Deterministic design for neural network learning: An ap-proach based on discrepancy,” IEEE Trans. on Neural Networks, vol. 15, pp. 533–543,2004.

[17] C. Cervellera, V. C. Chen, and A. Wen, “Optimization of a large-scale water reservoirnetwork by stochastic dynamic programming with efficient state space discretization,” Eu-ropean Journal of Operational Research, vol. 171, no. 3, pp. 1139–1151, 2006.

[18] C. Cervellera, V. Chen, and A. Wen, “Neural network and regression spline value functionapproximations for stochastic dynamic programming,” Computers and Operations Research,2006.

23

[19] M. Baglietto, C. Cervellera, T. Parisini, M. Sanguineti, and R. Zoppoli, “Neural approx-imators, dynamic programming and stochastic approximation,” Proc. 19th Amer. Contr.Conf., pp. 3304–3308, 2000.

[20] R. Zoppoli, M. Sanguineti, and T. Parisini, “Approximating networks and extended Ritzmethod for the solution of functional optimization problems,” Journ. of Optim. Theory andAppl., vol. 112, pp. 403–439, 2002.

[21] K.-T. Fang and Y. Wang, Number-theoretic Methods in Statistics. London: Chapman &Hall, 1994.

[22] N. Alon and J. Spencer, The Probabilistic Method. New York: Wiley, 2000.

[23] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods. Philadel-phia: SIAM, 1992.

[24] A. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,”IEEE Trans. on Information Theory, vol. 39, pp. 930–945, 1993.

[25] P. Niyogi and F. Girosi, “On the relationship between generalization error, hypothesiscomplexity, and sample complexity for radial basis functions,” Neural Computation, vol. 8,pp. 819–842, 1996.

[26] L. Breiman, “Hinging hyperplanes for regression, classification and function approxima-tion,” IEEE Trans. on Inf. Theory, vol. 39, pp. 993–1013, 1993.

[27] N. Stokey, R. Lucas, and E. Prescott, Recursive Methods in Economic Dynamics. Cam-bridge: Harvard University Press, 1989.

[28] R. M. Dudley, Real Analysis and Probability. Pacific Grove, CA:Wadsworth & Brooks/Cole,1989.

[29] P. Bratley, B. L. Fox, and H. Niederreiter, “Programs to generate Niederreiter’s low-discrepancy sequences,” ACM Transactions on Mathematical Software, vol. 20, no. 4,pp. 494–495, 1994.

[30] V. C. P. Chen, K.-L. Tsui, R. R. Barton, and J. K. Allen, “A review of design and mod-eling in computer experiments,” in Handbook in Industrial Statistics (C. R. Rao and RaviKhattree, eds.), pp. 231–261, Elsevier Science, 2003.