An Optimal Approximate Dynamic ProgrammingAlgorithm for the Energy Dispatch Problem with Grid-
Level Storage
Juliana M. NascimentoWarren B. Powell
Department of Operations Research and Financial EngineeringPrinceton University
September 7, 2009
Abstract
We consider the problem of modeling hourly dispatch and energy allocation decisions in thepresence of grid-level storage. The model makes it possible to capture hourly, daily andseasonal variations in wind, solar and demand, while also modeling the presence of hydro-electric storage to smooth energy production over both daily and seasonal cycles. The problemis formulated as a stochastic, dynamic program, where we approximate the value of storedenergy. We provide a rigorous convergence proof for an approximate dynamic programmingalgorithm, which can capture the presence of both the amount of energy held in storage as wellas other exogenous variables. Our algorithm exploits the natural concavity of the problem toavoid any need for explicit exploration policies.
1 Introduction
There is a long history of modeling energy dispatch using deterministic optimization models
to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-
gramming has been used for energy policy modeling in models such as NEMS (NEMS (2003)),
MARKAL (Fishbone & Abilock (1981)), MESSAGE III (Messner & Strubegger (1995)) and
Meta*Net (Lamont (1997)). These models typically work in very aggregate time steps such as
a year or more, which require that average values be used for intermittent energy sources such
as wind and solar. However, it is well known (Lamont (2008)) that intermittent energy sources
such as wind and solar look far more attractive when we use average production over periods
as short as a few hours. Averaging reduces the peaks, where energy is often lost because we
are unable to use brief bursts, and raises the valleys, which determines the amount of spinning
reserve that must be maintained to supply demand in the event that these sources drop to
their minimum. To accurately model the economics of intermittent energy, it is necessary to
model energy supply and demand in time increments no greater than an hour.
Some authors recognize the importance of capturing hourly wind variations, but still use
deterministic linear programming models which are allowed to make storage and allocation
decisions knowing the entire future directory (Castronuovo & Lopes (2004)). Garcia-Gonzalez
et al. (2008) and Brown et al. (2008) use the idea of stochastic wind scenarios, but allow
decisions to “see” the wind in the future within a particular scenario (violating what are
known as nonanticipativity conditions). These models will also overstate the value of energy
from intermittent sources by being able to make adjustments to other sources of energy in
anticipation of future fluctuations in wind and solar. For this reason, it is important to capture
not only the variability of wind and solar, but also the uncertainty. Furthermore, uncertainties
come in a number of sources such as demand, prices and rainfall, among others.
We address the problem of allocating energy resources over both the electric power grid
(generally referred to as the electric power dispatch problem) as well as other forms of energy
allocation (conversion of biomass to liquid fuels, conversion of petroleum to gasoline or elec-
tricity, and the use of natural gas in home heating or electric power generation). Determining
1
how much energy can be converted from each source to serve each type of demand can be
modeled as a series of single-period linear programs linked by a single, scalar storage variable,
as would be the case with grid-level storage.
Let Rt be the scalar quantity of stored energy on hand, and let Wt be a vector-valued
stochastic process describing exogenously varying parameters such as available energy from
wind and solar, demand of different types (with hourly, daily and weekly patterns), energy
prices and rainfall. We assume that Wt is Markovian to be able to model processes such as
wind, prices and demand where the future (say, the price pt+1 at time t+ 1), depends on the
current price pt plus an exogenous change pt+1, allowing us to write pt+1 = pt + pt+1. If wt
is the wind at time t, pt is a price (or vector of prices), and Dt is the demand, we would let
Wt = (wt, pt, Dt), where the future of each stochastic process depends on the current value.
The state of our system, then, is given by St = (Rt,Wt).
In our problem, xt is a vector-valued control giving the allocation of different sources of
energy (oil, coal, wind, solar, nuclear, etc.) over different energy pathways (conversion to
electricity, transmission over the electric power grid, conversion of biomass to liquid fuel) and
to satisfy different types of demands. xt also includes the amount of energy stored in an
aggregate, grid-level storage device such as a reservoir for hydroelectric power.
An optimal policy is described by Bellman’s equation
Vt(St) = maxxt∈Xt
(C(St, xt) + γE Vt+1(St+1)|St) (1)
where St+1 = f(St, xt,Wt+1) and C(St, xt) is a contribution function that is linear in xt. We
use the convention that any variable indexed by t is Ft-measurable, where Ft is the sigma-
algebra generated by W1,W2, . . . ,Wt.
We are going to assume that Wt is a set of discrete random variables, which includes
exogenous energy supply (such as wind and solar), demand, prices, rainfall and any other
uncertain parameters. When the supplies and demands are discrete (in particular, that it can
be written as a scaling factor times an integer), it is possible to show that Vt(St) = Vt(Rt,Wt)
can be written as a piecewise linear function in Rt. We treat xt as a continuous variable,
2
but we will show that the optimal policy involves solving a linear program which returns an
optimal xt that is also defined on a discrete set of outcomes (the extreme point solutions of
the linear program).
In this case, we can ensure that the solution xt also takes on discrete values, but we solve
the optimization problem as a continuous problem using a linear programming solver. Thus,
we can view xt and Rt as continuous, while ensuring that they always take on discrete values.
Although St may only have only five or ten dimensions, this is often enough to render
(1) computationally intractable. There is an extensive literature in approximate dynamic
programming which assumes either discrete action spaces, or low dimensional, continuous
controls. For our problem, xt can easily have hundreds or thousands of dimensions, depending
on the level of aggregation. In addition, we are unable to compute the expectation ana-
lytically, introducing the additional complication of maximizing an expectation that is itself
computationally intractable. The problem of vector-valued state, outcome and action spaces
is known as the three curses of dimensionality (Powell (2007)).
We propose a Monte Carlo-based algorithm based on the principles of approximate dy-
namic programming. We assume that C(St, xt) is linear in xt given St. Now define Ft(Rt)
by
Ft(Rt) = maxxt∈Xt
C(St, xt)
subject to a flow conservation constraint involving the stored energy Rt. It is well known (see,
e.g. Vanderbei (1997)) that Ft(Rt) is piecewise linear concave in Rt. Backward induction
can be used to then show that the value function Vt(St) = Vt(Rt,Wt) is piecewise linear
and concave in the resource dimension Rt. We exploit this concavity to show almost sure
convergence of an ADP algorithm that uses pure exploitation. This is important for higher
dimensional problems where exploration steps, which may be useful for proving convergence,
can be extremely slow in practice (imagine inserting an exploration step when there are
millions of states). This algorithmic strategy has proven to be very successful for industrial
applications (see, for example, Godfrey & Powell (2002), Topaloglu & Powell (2006)) and
most recently in an energy model that is the motivating application for this paper (Powell et
3
al. (2009)), but none of these prior papers provided any formal convergence results.
Our work is motivated by a particular energy application, but the problem class is more
general. For example, the theory would apply to problems in logistics involving allocation
of a single commodity from multiple suppliers to multiple consumers, with a single inven-
tory linking time periods. We were recently introduced to such an application in Europe
involving the distribution of cement provided by multiple suppliers to numerous construction
projects. There are many other applications involving the distribution of a single product.
This technique has also been applied in transportation to problems with multiple resource
classes (Topaloglu & Powell (2006)), but without any formal convergence results.
Our algorithmic strategy is limited to problems with linear contribution functions subject
to constraints which can be discretized. For this problem class, Ft(Rt) can be described as
a piecewise linear, concave function. We exploit this property to show that we will visit,
infinitely often, the slopes near the optimal solution, allowing us to prove convergence in our
estimates of these slopes. Concavity allows us to avoid visiting other points infinitely often.
Fortunately, there is an extremely wide class of problems that fit these conditions.
Our proof technique combines the convergence proof for a monotone mapping in Bertsekas
& Tsitsiklis (1996) with the proof of the SPAR algorithm in Powell et al. (2004). Current
proofs of convergence for approximate dynamic programming algorithms such as Q-learning
(Tsitsiklis (1994), Jaakkola et al. (1994)) and optimistic policy iteration (Tsitsiklis (2002))
require that we visit states (and possibly actions) infinitely often. A convergence proof for
a Real Time Dynamic Programming (Barto et al. (1995)) algorithm that considers a pure
exploitation scheme is provided in Bertsekas & Tsitsiklis (1996) [Prop. 5.3 and 5.4], but it
assumes that expected values can be computed and the initial approximations are optimistic,
which produces a type of forced exploration. We make no such assumptions, but it is important
to emphasize that our result depends on the concavity of the optimal value functions.
Others have taken advantage of concavity (convexity for minimization) in the stochastic
control literature. Hernandez-Lerma & Runggladier (1994) presents a number of theoretical
results for general stochastic control problems for the case with i.i.d. noise (an assumption we
do not make), but does not present a specific algorithm that could be applied to our problem
4
class. There is an extensive literature on stochastic control which assumes linear, additive
noise (see, for example, Bertsekas (2005) and the references cited there). For our problem,
noise arises in costs and, in particular, the constraints.
Other approaches to deal with our problem class would be different flavors of Benders de-
composition (Van Slyke & Wets (1969), Higle & Sen (1991), Chen & Powell (1999)) and sample
average approximation (Shapiro (2003))(SAA). However, techniques based on stochastic pro-
gramming such as Benders require that we create scenario trees to capture the trajectory of
information. We are interested in modeling our problem over a full year to capture seasonal
variations, producing a problem with 8,760 time periods. On the other hand, SAA relies on
generating random samples outside of the optimization problems and then solving the cor-
responding deterministic problems using an appropriate optimization algorithm. Numerical
experiments with the SAA approach applied to problems where an integer solution is required
can be found in Ahmed & Shapiro (2002).
We begin in section 2 with a brief summary of the energy policy model SMART. Section
3 gives the optimality equations and describes some properties of the value function. Section
4 describes the algorithm in detail. Section 5 gives the convergence proof, which is the heart
of the paper. Section 6 describes the use of this algorithm in SMART, and shows that the
approximate dynamic programming algorithm almost perfectly matches the solution provided
by an optimal linear programming model for a deterministic problem. Section 7 concludes the
paper.
2 An energy modeling application
Our problem is based on the SMART model (Powell et al. (2009)), which describes a stochas-
tic, multiscale model which models multiple energy supplies and demands, and an energy
conversion network that captures both electric power transmission as well as activities such as
biomass conversion. This model includes hydroelectric storage which is able to store energy
(in the form of water) over long periods of time. This makes it possible to study both the
storage of wind energy (which can be used to generate electricity to pump water back into the
5
reservoir) along with seasonal rainfall. In this section, we provide only a sketch of this model,
since we only need the structure of the problem and specific mathematical properties such as
concavity.
We assume that the exogenous information process Wt is Markov, and that it includes
both the demand for energy as well as any other exogenous information. The demand is
vector-valued representing different types of energy demand, which we represent using Dt =
(Dt1, . . . , DtBd), where Bd is the number of different sources of demand. The demand does
not need to be fully satisfied, but no backlogging is allowed (our energy model assumes that
energy can be imported at a high price). Wt also includes wind and solar energy production,
energy commodity prices, and rainfall. We assume that all elements of Wt ∈ Wt, including
demands, have finite support, which means that they have been discretized at an appropriate
level.
We denote by Rt the scalar storage level right before a decision is taken. To simplify the
presentation of the algorithm, we discretize the storage, and let the amount in storage be
given by ηRt where η is an appropriately chosen scalar parameter, while Rt is a nonnegative
integer. If the units of all the forms of energy are chosen carefully, we can let η = 1. We
model the problem as if Rt is continuous, but argue below that it will always take on integer
values.
We define St = (Wt, Rt) to be the pre-decision state (the information just before we make
a decision), where we use St and (Wt, Rt) interchangeably. The decision xt = (xdt , xst , x
trt ) has
three components. The first component, xdt , is a vector that represents how much to sell to
satisfy the demands Dt. The second component, xst , is also a vector that represents how much
to buy from Bst different energy sources (coal, nuclear, natural gas, hydroelectric, ethanol,
biomass, and so on). Finally, xtrt is a scalar that denotes the amount of energy that should
be transferred (discarded), in case the storage level is too high, even after demand has been
satisfied.
6
The feasible set, which is dependent on the current state (Wt, Rt), is given by
Xt(Wt, Rt) =xdt ∈ IRB
d
, xst ∈ IRBs
, xtrt ∈ IR :
Bdt∑
i=1
xdti + xtrt ≤ ηRt,
0 ≤ xdti ≤ Dti, i = 1, . . . , Bd,
0 ≤ xsti ≤Mti, i = 1, . . . , Bs,
0 ≤ xsti ≤ ust(Wt), i = 1, . . . , Bs,
0 ≤ xtrti ≤ utrt (Wt),
where ust(·) and utrt (·) are nonnegative functions which take on discrete values and Mti is a
deterministic bound. The first constraint indicates that backlogging is not allowed, the second
indicates that we do not sell more than is demanded and the third indicates that each supplier
can only offer a limited amount of resources. The last two constraints impose upper bounds
that varies with the information vector Wt.
The contribution generated by the decision is given by
Ct(Wt, Rt, xt) =−Bd
t∑i=1
cuti(Wt)(Dti − xdti)− cot (Wt)
Rt −Bd
t∑i=1
xdti
− ctrt (Wt)xtrt
−Bs
t∑i=1
csti(Wt)xsti +
Bdt∑
i=1
cdti(Wt)xdti,
where cuti(·), cot (·), ctrt (·), csti(·) and cdti(·) are nonnegative scalar functions. The first three terms
represent underage, overage and transfer costs, respectively. The fourth term represents the
buying cost and the last one represents the money that is made satisfying demand.
The decision takes the system to a new storage level, denoted by the scalar Rxt , given by
Rxt = fx(Rt, xt) (2)
= Rt −Bd
t∑i=1
xdti +
Bst∑
i=1
xsti − xtrt . (3)
where fx(Rt, xt) is our transition function that returns the post-decision resource state.
7
If the problem is deterministic, it can be formulated as a single linear program of the form
maxx
T∑t=0
Ct(Wt, Rt, xt) (4)
subject to xt ∈ Xt(Wt, Rt) and the transition equation 3. In section 6, we compare against
the optimal solution obtained from this formulation (again, assuming Wt is deterministic) as
a benchmark. When the problem is stochastic, we look for a decision function Xπt (St) (or
policy), that solves
maxπ∈Π
ET∑t=0
Ct(Wt, Rt, Xπt (St)). (5)
In this paper, we design the decision function using approximate dynamic programming, where
Xπt (St) is given by
Xπt (St) = max
xt∈Xt
(C(St, xt) + γE
Vt+1(St+1)|St
)where Vt(s) is an approximation of the value function at state s. This strategy introduces two
challenges. The first is the presence of an expectation (which cannot be computed exactly)
within the optimization problem. We can remove the expectation by using the concept of
the post-decision state, Sxt = (Wt, Rxt ), which is the state immediately after we have made a
decision but before any new information has arrived. Since Wt is not affected by a decision,
we only have to worry about Rxt which is the storage level after we have decided how much
to add or subtract. The pre-decision state is then computed using
Rt+1 = Rxt + Rt+1
where Rt+1 might represent exogenous rainfall to our reservoir (Rt+1 would be part of Wt+1).
The post-decision state (see Powell (2007), chapters 4, 11 and 12) allows us to solve problems
of the form
Xπt (St) = max
xt∈Xt
(C(St, xt) + γVt+1(Wt, R
xt ))
(6)
8
which is a deterministic linear program, where Rxt is given by equation 3.
The second challenge is computing the value function Vt(Sxt ) = Vt(Wt, R
xt ). Vt(Wt, R) can
be represented exactly as a piecewise linear function of R indexed by Wt. Throughout our
presentation, we will write Vt(Wt, R) for R = 0, 1, . . . , BR, by which we mean Rxt = ηR. We
will treat R as if it is continuous, but it is easy to show by induction that V xt (Wt, R
xt ) is
piecewise linear and concave in Rxt . Since energy supplies and demands are all assumed to
be discretized (with the same discretization parameter η), the optimal solution of the linear
program in equation 6 has the property that the optimal solution xt will also be discrete, and
can be represented as integers scaled by η. For this reason, assuming the initial amount in
storage also takes on one of these values (such as zero), then we are assured that Rxt (and Rt)
will always take on a discrete value, even if we treat it as a continuous variable.
For our energy application, concavity of Vt(Wt, R) in R is a significant structural property.
In section 6, we report on comparisons against the optimal solution from a deterministic linear
program. To match this solution, we had to choose η so that BR was on the order of 10,000,
which means that the value function approximation could have as many as 10,000 segments
(and one of these for each of 8,760 time periods). However, our algorithm requires that we
visit only a small number of these (for each value of Wt), accelerating the algorithm by a factor
of 1000. For this reason, it does not affect the computational performance of the algorithm if
we were to discretize the function into a million segments.
Our algorithmic strategy can be extended to more complex problems in an approximate
way. In Powell et al. (2009), energy is stored in a hydroelectric reservoir. If we extended the
model to handle storage using compressed air, flywheels, batteries and compressed hydrogen,
we lose our convergence proof because we would have to approximate the value of this vector
of energy storage resources using separable, piecewise linear approximations. Topaloglu &
Powell (2006) does exactly this for a problem managing a fleet of vehicles of different types,
and shows that the use of separable, piecewise linear value functions provides very good, but
not optimal results.
9
3 The Optimal Value Functions
We define, recursively, the optimal value functions associated with the storage class. We
denote by V ∗t (Wt, Rt) the optimal value function around the pre-decision state (Wt, Rt) and
by V xt (Wt, R
xt ) the optimal value function around the post-decision state (Wt, R
xt ).
At time t = T , since it is the end of the planning horizon, the value of being in any state
(WT , RxT ) is zero. Hence, V x
T (WT , RxT ) = 0. At time t− 1, for t = T, . . . , 1, the value of being
in any post-decision state (Wt−1, Rxt−1) does not involve the solution of a deterministic opti-
mization problem, it only involves an expectation, since the next pre-decision state (Wt, Rt)
only depends on the information that first becomes available at t. On the other hand, the
value of being in any pre-decision state (Wt, Rt) does not involve expectations, since the next
post-decision state (Wt, Rxt ) is a deterministic function of Wt, it only requires the solution of
a deterministic optimization problem. Therefore,
V xt−1(Wt−1, R
xt−1) = E
[V ∗t (Wt, Rt)|(Wt−1, R
xt−1)], (7)
V ∗t (Wt, Rt) = maxxt∈Xt(Wt,Rt)
(Ct(Wt, Rt, xt) + γV x
t (Wt, Rxt )). (8)
In the remainder of the paper, we only use the value function V xt (Sxt ) defined around the
post-decision state variable, since it allows us to make decisions by solving a deterministic
problem as in (7).
We show that V xt (Wt, R) is concave and piecewise linear in R with BR discrete break-
points. This structural property combined with the optimization/expectation inversion is the
foundation of our algorithmic strategy and its proof of convergence.
For Wt ∈ Wt, let vt(Wt) = (vt(Wt, 1), . . . , vt(Wt, BR)) be a vector representing the slopes
of a function Vt(Wt, R) : [0,∞) → IR that is concave and piecewise linear with breakpoints
R = 1, . . . , BR. It is easy to see that the function F ∗t (vt(Wt),Wt, ·), where
F ∗t (vt(Wt),Wt, Rt) = maxxt,yt
Ct(Wt, Rt, xt) + γBR∑r=1
vt(Wt, r)ytr
subject to xt ∈ Xt(Wt, Rt),BR∑r=1
ytr = fx(Rt, xt), 0 ≤ yt ≤ 1,
10
is also concave and piecewise linear with breakpoints R = 1, . . . , BR, since the demand vec-
tor Dt and the upper bounds usti(·), utrt (·) and Mti in the constraint set Xt(Wt, Rt) are all
discrete. Moreover, the optimal solution (x∗t , y∗t ) to the linear programming problem that
defines F ∗t (vt(Wt),Wt, Rt) does not depend on Vt(Wt, 0) and is a discrete vector whenever the
argument Rt is discrete. We also have that F ∗t (vt(Wt),Wt, Rt) is bounded for all Wt ∈ Wt.
We use F ∗t to prove the following proposition about the optimal value function.
Proposition 1. For t = 0, . . . , T and information vector Wt ∈ Wt, the optimal value func-
tion V xt (Wt, R) is concave and piecewise linear with breakpoints R = 1, . . . , BR. We denote
its slopes by v∗t (Wt) =(v∗t (Wt, 1), . . . , v∗t (Wt, B
R)), where, for R = 1, . . . , BR and t < T ,
v∗t (Wt, R) is given by
v∗t (Wt, R) = V xt (Wt, R)− V x
t (Wt, R− 1)
= E[F ∗t+1(v∗t+1(Wt+1),Wt+1, R + Rt+1(Wt+1))
− F ∗t+1(v∗t+1(Wt+1),Wt+1, R− 1 + Rt+1(Wt+1))|Wt
]. (9)
Proof: The proof is by backward induction on t. The base case t = T holds as V xT (WT , ·)
is equal to zero for all WT ∈ WT . For t < T the proof is obtained noting that
V xt (Wt, R
xt ) = E
[γV x
t+1(Wt+1, 0) + F ∗t (v∗t+1(Wt+1),Wt+1, Rxt + Rt+1(Wt+1))|Wt
].
Due to the concavity of V xt (Wt, ·), the slope vector v∗t (Wt) is monotone decreasing, that
is, v∗t (Wt, R) ≥ v∗t (Wt, R + 1) (throughout, we use v∗t (Wt) to refer to the slope of the value
function V xt (Wt, R
xt ) defined around the post-decision state variable). Moreover, throughout
the paper, we work with the translated version V ∗t (Wt, ·) of V xt (Wt, ·) given by V ∗t (Wt, R +
y) =R∑r=1
v∗t (Wt, r) + yv∗t (Wt, R + 1), where R is a nonnegative integer and 0 ≤ y ≤ 1, since
the optimal solution (x∗t , y∗t ) associated with F ∗(v∗t+1(Wt+1),Wt+1, R) does not depend on
V xt (Wt, 0).
We next introduce the dynamic programming operator H associated with the storage class.
We define H using the slopes of piecewise linear functions instead of the functions themselves.
11
Let v = vt(Wt) for t = 0, . . . , T, Wt ∈ Wt be a set of slope vectors, where vt(Wt) =
(vt(Wt, 1), . . . , vt(Wt, BR)). The dynamic programming operator H associated with the stor-
age class maps a set of slope vectors v into a new set Hv as follows. For t = 0, . . . , T − 1,
Wt ∈ Wt and R = 1, . . . , BR,
(Hv)t(Wt, R) = E[F ∗t (vt+1(Wt+1),Wt+1, R + Rt+1(Wt+1))
−F ∗t (vt+1(Wt+1),Wt+1, R− 1 + Rt+1(Wt+1))|Wt
]. (10)
It is well known that the optimal value function in a dynamic program is unique. This
is a trivial result for the finite horizon problems that we consider in this paper. Therefore,
the set of slopes v∗ corresponding to the optimal value functions V ∗t (Wt, R) for t = 0, . . . , T ,
Wt ∈ Wt and R = 1, . . . , Bt are unique. Moreover, the dynamic programming operator
H defined by (10) is assumed to satisfy the following conditions. Let v = vt(Wt) for t =
0, . . . , T, Wt ∈ Wt and ˜v = ˜vt(Wt) for t = 0, . . . , T, Wt ∈ Wt be sets of slope vectors
such that vt(Wt) = (vt(Wt, 1), . . . , vt(Wt, Bt)) and ˜vt(Wt) = (˜vt(Wt, 1), . . . , ˜vt(Wt, Bt)) are
monotone decreasing and vt(Wt) ≤ ˜vt(Wt). Then, for Wt ∈ Wt and R = 1, . . . , Bt:
(Hv)t(Wt) is monotone decreasing, (11)
(Hv)t(Wt, R) ≤ (H ˜v)t(Wt, R), (12)
(Hv)t(Wt, R)− ηe ≤ (H(v − ηe))t(W,R) ≤ (H(v + ηe))t(W,R) ≤ (Hv)t(W,R) + ηe,(13)
where η is a positive integer and e is a vector of ones. The conditions in equations (11) and
(13) imply that the mapping H is continuous (see the discussion in Bertsekas & Tsitsiklis
(1996), page 158).
The dynamic programming operator H and the associated conditions (11)-(13) are used
later on to construct deterministic sequences that are provably convergent to the optimal
slopes.
4 The SPAR-Storage Algorithm
We propose a pure exploitation algorithm, namely the SPAR-Storage Algorithm, that provably
learns the optimal decisions to be taken at parts of the state space that can be reached by an
12
Important Region
Asset Dimension
V ∗t (Wt, ·) V nt (Wt, ·)
Approximation - Constructed
Rt1 Rt2 Rt3
v nt (W
t , Rt5 )
v ∗t (W
t , Rt4)
v∗t (Wt, Rt3)
v∗t(W t, R
t2)
v∗ t(Wt,Rt1
)
Optimal - Unknown
vn t(W
t,Rt1
)
vnt(W
t,R t
2)
vnt (Wt, Rt3) v nt (W
t , Rt4)
v ∗t (Wt , R
t5 )
Figure 1: Optimal value function and the constructed approximation
optimal policy, which are determined by the algorithm itself. This is accomplished by learning
the slopes of the optimal value functions at important parts of the state space, through the
construction of value function approximations V nt (Wt, ·) that are concave, piecewise linear
with breakpoints R = 1, . . . , BR. The approximation is represented by its slopes vnt (Wt) =(vnt (Wt, 1), . . . , vnt (Wt, B
R)). Figure 1 illustrates the idea. In order to do so, the algorithm
combines Monte Carlo simulation in a pure exploitation scheme and stochastic approximation
integrated with a projection operation.
Figure 2 describes the SPAR-Storage algorithm. The algorithm requires an initial concave
piecewise linear value function approximation V 0(Wt, ·), represented by its slopes v0t (Wt) =(
v0t (Wt, 1), . . . , v0
t (Wt, BR)), for each information vector Wt ∈ Wt. Therefore the initial slope
vector v0t (Wt) has to be monotone decreasing. For example, it is valid to set all the initial
slopes equal to zero. For completeness, since we know that the optimal value function at the
end of the horizon is equal to zero, we set vnT (WT , R) = 0 for all iterations n, information
vectors WT ∈ WT and asset levels R = 1, . . . , BR. The algorithm also requires an initial asset
level to be used in all iterations. Thus, for all n ≥ 0, Rx,n−1 is set to be a nonnegative integer,
as in STEP 0b.
At the beginning of each iteration n, the algorithm observes a sample realization of the
information sequence W n0 , . . . ,W
nT , as in STEP 1. The sample can be obtained from a sample
generator or actual data. After that, the algorithm goes over time periods t = 0, . . . , T .
13
STEP 0: Algorithm Initialization:
STEP 0a: Initialize v0t (Wt) for t = 1, . . . , T − 1 and Wt ∈ Wt monotone decreasing.
STEP 0b: Set Rx,n−1 = r, a nonnegative integer, for all n ≥ 0
STEP 0c: Set n = 1.
STEP 1: Sample/Observe the information sequence W n0 , . . . ,W
nT .
Do for t = 0, . . . , T :
STEP 2: Compute the pre-decision asset level: Rnt = Rx,n
t−1 + Rt(Wnt ).
STEP 3: Find the optimal solution xnt of
maxxt∈Xt(Wn
t ,Rnt )Ct(W
nt , R
nt , xt) + γV n−1
t (W nt , f
x(Rnt , xt)) .
STEP 4: Compute the post-decision asset level:
Rx,nt = fx(Rn
t , xnt ).
STEP 5: Update slopes:
If t < T then
STEP 5a: Observe vnt+1(Rx,nt ) and vnt+1(Rx,n
t + 1). See (14).
STEP 5b: For Wt ∈ Wt and R = 1, . . . , BR:
znt (Wt, R) = (1− αnt (Wt, R))vn−1t (Wt, R) + αnt (Wt, R)vnt+1(R).
STEP 5c: Perform the projection operation vnt = ΠC(znt ). See (18).
STEP 6: Increase n by one and go to step 1.
Figure 2: SPAR-Storage Algorithm
First, the pre-decision asset level Rnt is computed, as in STEP 2. Then, the decision xnt ,
which is optimal with respect to the current pre-decision state (W nt , R
nt ) and value function
approximation V n−1t (W n
t , ·) is taken, as stated in STEP 3. We have that V nt (Wt, R + y) =
R∑r=1
vnt (Wt, r) + yvnt (Wt, R+ 1), where R is a nonnegative integer and 0 ≤ y ≤ 1. Next, taking
into account the decision, the algorithm computes the post-decision asset level Rx,nt , as in
STEP 4.
14
Time period t is concluded by updating the slopes of the value function approximation.
Steps 5a-5c describes the procedure and figure 3 illustrates it. Sample slopes relative to the
post-decision states (W nt , R
x,nt ) and (W n
t , Rx,nt + 1) are observed, see STEP 5a and figure 3a.
After that, these samples are used to update the approximation slopes vn−1t (W n
t ), through
a temporary slope vector znt (W nt ). This procedure requires the use of a stepsize rule that is
state dependent, denoted by αnt (Wt, R), and it may lead to a violation of the property that the
slopes are monotonically decreasing, see STEP 5b and figure 3b. Thus, a projection operation
ΠC is performed to restore the property and updated slopes vnt (W nt ) are obtained, see STEP
5c and figure 3c.
After the end of the planning horizon T is reached, the iteration counter is incremented,
as in STEP 6, and a new iteration is started from STEP 1.
We have that xnt is easily computed since it is the first component of the optimal solution
to the linear programming problem in the definition of F ∗t (vn−1t (W n
t ),W nt , R
nt ). Moreover,
given our assumptions and the properties of F ∗t discussed in the previous section, it is clear
that Rnt , xnt and Rx,n
t are all integer valued. We also know that they are bounded. Therefore,
the sequences of decisions, pre and post decision states generated by the algorithm, which are
denoted by xnt n≥0, Snt n≥0 = (W nt , R
nt )n≥0 and Sx,nt = (W n
t , Rx,nt )n≥0, respectively,
have at least one accumulation point. Since these are sequences of random variables, their
accumulation points, denoted by x∗t , S∗t and Sx,∗t , respectively, are also random variables.
The sample slopes used to update the approximation slopes are obtained by replacing the
expectation and the slopes v∗t+1(Wt+1) of the optimal value function in (9) by a sample real-
ization of the information W nt+1 and the current slope approximation vn−1
t+1 (W nt+1), respectively.
Thus, for t = 1 . . . , T , the sample slope is given by
vnt+1(R) = F ∗t (vn−1t+1 (W n
t+1),W nt+1, R + Rt+1(W n
t+1))
− F ∗t (vn−1t+1 (W n
t+1),W nt+1, R− 1 + Rt+1(W n
t+1)). (14)
The update procedure is then divided into two parts. First, a temporary set of slope vectors
znt = znt (Wt, R) : Wt ∈ Wt, R = 1, . . . , BR is produced combining the current approximation
15
vnt+1
(Rx,n
t)
Exact
xntx
Rnt Rx,n
t
Ft(vn−
1(W
n t),W
n t,R
n t,x
)
vnt+1(Rx,nt + 1)
vn−1t (Wn
t , Rnt )
Wnt
Slopes of V n−1t (W n
t , ·)
3a: Current approximate function, optimal decision and sampled slopes
Concavity violationExact
znt (Wnt , R
nt )
Temporary slopes
3b: Temporary approximate function with violation of concavity
Ft(vn−
1(W
n t),W
n t,R
n t,x
)
Exact
x
vnt (Wnt , R
nt )
Slopes of V nt (W n
t , ·)
3c: Level projection operation: updated approximate function with concavity restored
Figure 3: Update procedure of the approximate slopes
and the sample slopes using the stepsize rule αnt (Wt, R). We have that
αnt (Wt, R) = αnt 1Wt=Wnt (1R=Rx,n
t + 1R=Rx,nt +1),
where αnt is a scalar between 0 and 1 and can depend only on information that became available
up until iteration n and time t. Moreover, on the event that (W ∗t , R
x,∗t ) is an accumulation
16
point of (W nt , R
x,nt )n≥0, we make the standard assumptions that
∞∑n=1
αnt (W ∗t , R
x,∗t ) =∞ a.s., (15)
∞∑n=1
(αnt (W ∗t , R
x,∗t ))2 ≤ Bα <∞ a.s., (16)
where Bα is a constant. Clearly, the rule αnt = 1NV n(W ∗t ,R
x,∗t )
satisfies all the conditions, where
NV n(W ∗t , R
x,∗t ) is the number of visits to state (W ∗
t , Rx,∗t ) up until iteration n. Furthermore,
for all positive integers N ,
∞∏n=N
(1− αnt (W ∗t , R
x,∗t )) = 0 a.s. (17)
The proof for (17) follows directly from the fact that log(1 + x) ≤ x.
The second part is the projection operation, where the temporary slope vector znt (Wt), that
may not be monotone decreasing, is transformed into another slope vector vnt (Wt) that has this
structural property. The projection operator imposes the desired property by simply forcing
the violating slopes to be equal to the newly updated ones. For Wt ∈ Wt and R = 1, . . . , BR,
the projection is given by
ΠC(znt )(Wt, R) =
znt (W n
t , Rx,nt ), if Wt = W n
t , R < Rx,nt , znt (Wt, R) ≤ znt (W n
t , Rx,nt )
znt (W nt , R
x,nt + 1), if Wt = W n
t , R > Rx,nt + 1, znt (Wt, R) ≥ znt (W n
t , Rx,nt + 1)
znt (Wt, R), otherwise.
(18)
For t = 0, . . . , T , information vector Wt ∈ Wt and asset level R = 1, . . . , BR, the se-
quence of slopes of the value function approximation generated by the algorithm is denoted
by vnt (Wt, R)n≥0. Moreover, as the function F ∗t (vn−1t (W n
t ),W nt , R
nt ) is bounded and the
stepsizes αnt are between 0 and 1, we can easily see that the sample slopes vnt (R), the tempo-
rary slopes znt (Wt, R) and, consequently, the approximated slopes vnt (Wt, R) are all bounded.
Therefore, the slope sequence vnt (Wt, R)n≥0 has at least one accumulation point, as the
projection operation guarantees that the updated vector of slopes are elements of a compact
set. The accumulation points are random variables and are denoted by v∗t (Wt, R), as opposed
to the deterministic optimal slopes v∗t (Wt, R).
17
The ability of the SPAR-storage algorithm to avoid visiting all possible values of Rt was
significant in our energy application. In our energy model in Powell et al. (2009), Rt was the
energy stored in a hydroelectric reservoir which served as a source of storage for the entire grid.
The algorithm required that we discretize Rt into approximately 10,000 elements. However,
the power of concavity requires that we visit only a small number of these a large number of
times for each value of Wt.
An important practical issue is that we effectively have to visit the entire support of
Wt infinitely often. If Wt contains a vector of as few as five or ten dimensions, this can
be computationally expensive. In a practical application, we would use statistical methods
to produce more robust estimates using smaller numbers of observations. For example, we
could use the hierarchical aggregation strategy suggested by George et al. (2008) to produce
estimates of V (W,R) at different levels of aggregation of W . These estimates can then be
combined using a simple weighting formula. It is beyond the scope of this paper to analyze the
convergence properties of such a strategy, but it is important to realize that these strategies
are available to overcome the discretization of Wt.
5 Convergence Analysis
We start this section presenting the convergence results we want to prove. The major result is
the almost sure convergence of the approximation slopes corresponding to states that are vis-
ited infinitely often. On the event that (W ∗t , R
x,∗t ) is an accumulation point of (W n
t , Rx,nt )n≥0,
we obtain
vnt (W ∗t , R
x,∗t )→ v∗t (W
∗t , R
x,∗t ) and vnt (W ∗
t , Rx,∗t + 1)→ v∗t (W
∗t , R
x,∗t + 1) a.s.
As a byproduct of the previous result, we show that, for t = 0, . . . , T , on the event that
(W ∗t , R
∗t , x∗t ) is an accumulation point of (W n
t , Rnt , x
nt )n≥0,
x∗t = arg maxxt∈Xt(W ∗t ,R
∗t )Ct(W
∗t , R
∗t , xt) + γV ∗t (W ∗
t , fx(R∗t , xt)) a.s. (19)
where V ∗t (W ∗t , ·) is the translated optimal value function.
18
Equation (19) implies that the algorithm has learned almost surely an optimal decision for
all states that can be reached by an optimal policy. This implication can be easily justified
as follows. Pick ω in the sample space. We omit the dependence of the random variables
on ω for the sake of clarity. For t = 0, since Rx,n−1 = r, a given constant, for all iterations of
the algorithm, we have that Rx,∗−1 = r. Moreover, all the elements in W0 are accumulation
points of W n0 n≥0, asW0 has finite support. Thus, (19) tells us that the accumulation points
x∗0 of the sequence xn0n≥0 along the iterations with pre-decision state (W ∗0 , R
x,∗−1 + R0(W ∗
0 ))
are in fact an optimal policy for period 0 when the information is W ∗0 . This implies that
all accumulation points Rx,∗0 = fx(Rx,∗
−1 + R0(W ∗0 ), x0) of Rx,n
0 n≥0 are post-decision resource
levels that can be reached by an optimal policy. By the same token, for t = 1, every element
inW1 is an accumulation point of W n1 n≥0. Hence, (19) tells us that the accumulation points
x∗1 of the sequence xn1 along iterations with (W n1 , R
n1 ) = (W ∗
1 , Rx,∗−0 + R0(W ∗
1 )) are indeed an
optimal policy for period 1 when the information is W ∗1 and the pre-decision resource level is
R∗1 = Rx,∗−0 + R0(W ∗
1 ). As before, the accumulation points Rx,∗1 = fx(R∗1, x1) of R∗,n1 n≥0 are
post-decision resource levels that can be reached by an optimal policy. The same reasoning
can be applied for t = 2, . . . , T .
5.1 Outline of the Convergence Proofs
Our proofs build on the ideas presented in Bertsekas & Tsitsiklis (1996) and in Powell et
al. (2004). The first proves convergence assuming that all states are visited infinitely often.
The authors do not consider a concavity-preserving step, which is the key element that has
allowed us to obtain a convergence proof when a pure exploitation scheme is considered.
Although the framework in Powell et al. (2004) also considers the concavity of the optimal
value functions in the resource dimension, the use of a projection operation to restore concavity
and a pure exploitation routine, their proof is restricted to two-stage problems. The difference
is significant. In Powell et al. (2004), it was possible to assume that Monte Carlo estimates
of the true value function were unbiased, a critical assumption in that paper. In our paper,
estimates of the marginal value of additional resource at time t depends on a value function
approximation at t + 1. Since this is an approximation, the estimates of marginal values at
time t are biased.
19
The main concept to achieve the convergence of the approximation slopes to the op-
timal ones is to construct deterministic sequences of slopes, namely, Lk(Wt, R)k≥0 and
Ukt (Wt, R)k≥0, that are provably convergent to the slopes of the optimal value functions.
These sequences are based on the dynamic programming operator H, as introduced in (10).
We then use these sequences to prove almost surely that for all k ≥ 0,
Lkt (W∗t , R
x,∗t ) ≤ vnt (W ∗
t , Rx,∗t ) ≤ Uk
t (W ∗t , R
x,∗t ), (20)
Lkt (W∗t , R
x,∗t + 1) ≤ vnt (W ∗
t , Rx,∗t + 1) ≤ Uk
t (W ∗t , R
x,∗t + 1), (21)
on the event that the iteration n is sufficiently large and (W ∗t , R
x,∗t ) is an accumulation point
of (W nt , R
x,nt )n≥0, which implies the convergence of the approximation slopes to the optimal
ones.
Establishing (20) and (21) requires several intermediate steps that need to take into consid-
eration the pure exploitation nature of our algorithm and the concavity preserving operation.
We give all the details in the proof of Lkt (W∗t , R
x,∗t ) ≤ vnt (W ∗
t , Rx,∗t ) and Lkt (W
∗t , R
x,∗t + 1) ≤
vnt (W ∗t , R
x,∗t + 1). The upper bound inequalities are obtained using a symmetrical argument.
First, we define two auxiliary stochastic sequences of slopes, namely, the noise and the
bounding sequences, denoted by snt (Wt, R)n≥0, and lnt (Wt, R)n≥0, respectively. The first
sequence represents the noise introduced by the observation of the sample slopes, which re-
places the observation of true expectations and the optimal slopes. The second one is a
convex combination of the deterministic sequence Lkt (Wt, R) and the transformed sequence
(HLk)t(Wt, R).
We then define the set St to contain the states (W ∗t , R
x,∗t ) and (W ∗
t , Rx,∗t + 1), such that
(W ∗t , R
x,∗t ) is an accumulation point of (W n
t , Rx,nt )n≥0 and the projection operation decreased
or kept the same the corresponding unprojected slopes infinitely often. This is not the set
of all accumulation points, since there are some points where the slope may have increased
infinitely often.
The stochastic sequences snt (Wt, R), and lnt (Wt, R) are used to show that on the event
that the iteration n is big enough and (W ∗t , R
x,∗t ) is an element of the random set St,
vn−1t (W ∗
t , Rx,∗t ) ≥ ln−1
t (W ∗t , R
x,∗t )− sn−1
t (W ∗t , R
x,∗t ) a.s.
20
Then, on (W ∗t , R
x,∗t ) ∈ St, convergence to zero of the noise sequence, the convex com-
bination property of the bounding sequence and the monotone decreasing property of the
approximate slopes, will give us
Lkt (W∗t , R
x,∗t ) ≤ vn−1
t (W ∗t , R
x,∗t ) a.s.
Note that this inequality does not cover all the accumulation points of (W nt , R
x,nt )n≥0, since
they are restricted to states in the set St. Nevertheless, this inequality and some properties of
the projection operation are used to fulfill the requirements of a bounding technical lemma,
which is used repeatedly to obtain the desired lower bound inequalities for all accumulation
points.
In order to prove (19), we note that
Ft(vn−1t (W n
t ),W nt , R
nt , xt) = Ct(W
nt , R
nt , xt) + γV n−1
t (W nt , f
x(Rnt , xt))
is a concave function of Xt and Xt(W nt , R
nt ) is a convex set. Therefore,
0 ∈ ∂Ft(vn−1t (W n
t ),W nt , R
nt , x
nt )−XN
t (W nt , R
nt , x
nt ),
where xnt is the optimal decision of the optimization problem in STEP 3a of the algo-
rithm, ∂Ft(vn−1t (W n
t ),W nt , R
nt , x
nt ) is the subdifferential of Ft(v
n−1t (W n
t ),W nt , R
nt , ·) at xnt and
XNt (W n
t , Rnt , x
nt ) is the normal cone of Xt(W n
t , Rnt ) at xnt . This inclusion and the first conver-
gence result are then combined to prove, as shown in section 5.4, that
0 ∈ ∂Ft(v∗t (W ∗t ),W ∗
t , R∗t , x∗t )−XN
t (W ∗t , R
∗t , x∗t ).
5.2 Technical Elements
In this section, we set the stage to the convergence proofs by defining some technical elements.
We start with the definition of the deterministic sequence Lkt (Wt, R)k≥0. For this, we let Bv
be a deterministic integer that bounds |vnt (R)|, |znt (Wt, R)|, |vnt (Wt, R)| and |v∗t (Wt, R)| for all
Wt ∈ Wt and R = 1, . . . , BR. Then, for t = 0, . . . , T − 1, Wt ∈ Wt and R = 1, . . . , BR, we
have that
L0t (Wt, R) = v∗t (Wt, R)− 2Bv and Lk+1
t (Wt, R) =Lkt (Wt, R) + (HLk)t(Wt, R)
2. (22)
21
At the end of the planning horizon T , LkT (WT , R) = 0 for all k ≥ 0. The proposition below
introduces the required properties of the deterministic sequence Lk(Wt, R)k≥0. Its proof is
deferred to the appendix.
Proposition 2. For t = 0, . . . , T − 1, information vector Wt ∈ Wt and resource levels R =
1, . . . , BR,
Lkt (Wt, R) is monotone decreasing, (23)
(HLk)t(Wt, R) ≥ Lk+1t (Wt, R) ≥ Lkt (Wt, R), (24)
LkT (Wt, R) < v∗t (Wt, R), and limk→∞
Lkt (Wt, R) = v∗t (Wt, R). (25)
The deterministic sequence Uk(Wt, R)k≥0 is defined in a symmetrical way. It also has
the properties stated in proposition 2, with the reversed inequality signs.
We move on to define the random index N that is used to indicate when an iteration
of the algorithm is large enough for convergence analysis purposes. Let N be the smallest
integer such that all states (actions) visited (taken) by the algorithm after iteration N are
accumulation points of the sequence of states (actions) generated by the algorithm. In fact, N
can be required to satisfy other constraints of the type: if an event did not happen infinitely
often, then it did not happen after N . Since we need N to be finite almost surely, the
additional number of constraints have to be finite.
We introduce the set of iterations, namely Nt(Wt, R), that keeps track of the effects pro-
duced by the projection operation. For Wt ∈ Wt and R = 1, . . . , BR, let Nt(Wt, R) be the set
of iterations in which the unprojected slope corresponding to state (Wt, R), that is, znt (Wt, R)
was too large and had to be decreased by the projection operation. Formally,
Nt(Wt, R) = n ∈ IN : znt (Wt, R) > vnt (Wt, R).
For example, based on figure 3c, n ∈ Nt(W nt , R
nt + 2). A related set is the set of states St. A
state (Wt, R) is an element of St if (Wt, R) is equal to an accumulation point (W ∗t , R
x,∗t ) of
(W nt , R
x,nt )n≥0 or is equal to (W ∗
t , Rx,∗t +1). Its corresponding approximate slope also has to
satisfy the condition znt (Wt, R) ≤ vnt (Wt, R) for all n ≥ N , that is, the projection operation
decreased or kept the same the corresponding unprojected slopes infinitely often.
22
We close this section dealing with measurability issues. Let (Ω,F ,Prob) be the probability
space under consideration. The sigma-algebra F is defined by F = σ(W nt , x
nt ), n ≥ 1, t =
0, . . . , T. Moreover, for n ≥ 1 and t = 0, . . . , T ,
Fnt = σ((Wm
t′ , xmt′ ), 0 < m < n, t′ = 0, . . . , T
⋃(W n
t′ , xnt′), t′ = 0, . . . , t
).
Clearly, Fnt ⊂ Fnt+1 and FnT ⊂ Fn+10 . Furthermore, given the initial slopes v0
t (Wt) and the
initial resource level r, we have that Rnt , Rx,n
t and αnt are in Fnt , while vnt+1(Rxt ), z
nt (Wt),
vnt (Wt) are in Fnt+1. A pointwise argument is used in all the proofs of almost sure convergence
presented in this paper. Thus, zero-measure events are discarded on an as-needed basis.
5.3 Almost sure convergence of the slopes
We prove that the approximation slopes produced by the SPAR-Storage algorithm converge
almost surely to the slopes of the optimal value functions of the storage class for states that
can be reached by an optimal policy. This result is stated in theorem 1 below. Along with
the proof of the theorem, we present the noise and the bounding stochastic sequences and
introduce three technical lemmas. Their proofs are given in the appendix so that the main
reasoning is not disrupted.
Before we present the theorem establishing the convergence of the value function, we
introduce the following technical condition. Given k ≥ 0 and t = 0, . . . , T − 1, we assume
there exists a positive random variable Nkt such that on n ≥ Nk
t ,
(HLk)t(Wnt , R
nt ) ≤ (Hvn−1)t(W
nt , R
nt ) ≤ (HUk)t(W
nt , R
nt ) a.s., (26)
(HLk)t(Wnt , R
nt + 1) ≤ (Hvn−1)t(W
nt , R
nt + 1) ≤ (HUk)t(W
nt , R
nt + 1) a.s. (27)
Proving this condition is nontrivial. It draws on a proof technique in Bertsekas & Tsitsiklis
(1996), Section 4.3.6, although this technique requires an exploration policy that ensures that
each state is visited infinitely often. Nascimento & Powell (2009) (section 6.1) show how this
proof can be adapted to a similar concave optimization problem that uses pure exploration.
The proof for our problem is essentially the same, and is not duplicated here.
Theorem 1. Assume the stepsize conditions (15)–(16). Also assume (26) and (27). Then,
for all k ≥ 0 and t = 0, . . . , T , on the event that (W ∗t , R
x,∗t ) is an accumulation point of
23
(W nt , R
x,nt )n≥0, the sequences of slopes vnt (W ∗
t , Rx,∗t )n≥0 and vnt (W ∗
t , Rx,∗t + 1)n≥0 gener-
ated by the SPAR-Storage algorithm for the storage class converge almost surely to the optimal
slopes v∗t (W∗t , R
x,∗t ) and v∗t (W
∗t , R
x,∗t + 1), respectively.
Proof: As discussed in section 5.1, since the deterministic sequences Lkt (Wt, Rxt )k≥0
and Ukt (Wt, R
xt )k≥0 do converge to the optimal slopes, the convergence of the approximation
sequences is obtained by showing that for each k ≥ 0 there exists a nonnegative random
variable N∗,kt such that on the event that n ≥ N∗,kt and (W ∗t , R
x,∗t ) is an accumulation point
of (W nt , R
x,nt )n≥0, we have
Lkt (W∗t , R
x,∗t ) ≤ vnt (W ∗
t , Rx,∗t ) ≤ Uk
t (W ∗t , R
x,∗t ) a.s.
Lkt (W∗t , R
x,∗t + 1) ≤ vnt (W ∗
t , Rx,∗t + 1) ≤ Uk
t (W ∗t , R
x,∗t + 1) a.s.
We concentrate on the inequalities Lkt (W∗t , R
x,∗t ) ≤ vnt (W ∗
t , Rx,∗t ) and Lkt (W
∗t , R
x,∗t + 1) ≤
vnt (W ∗t , R
x,∗t + 1). The upper bounds are obtained using a symmetrical argument.
The proof is by backward induction on t. The base case t = T is trivial as LkT (WT , R) =
vnT (WT , R) = 0 for all WT ∈ WT , R = 1, . . . , BR, k ≥ 0 and iterations n ≥ 0. Thus, we can
pick, for example, N∗,kT = N , where N , as defined in section 5.2, is a random variable that
denotes when an iteration of the algorithm is large enough for convergence analysis purposes.
The backward induction proof is completed when we prove, for t = T − 1, . . . , 0 and k ≥ 0,
that there exists N∗,kt such that on the event that n ≥ N∗,kt and (W ∗t , R
x,∗t ) is an accumulation
point of (W nt , R
x,nt )n≥0,
Lkt (W∗t , R
x,∗t ) ≤ vnt (W ∗
t , Rx,∗t ) and Lkt (W
∗t , R
x,∗t + 1) ≤ vnt (W ∗
t , Rx,∗t + 1) a.s. (28)
Given the induction hypothesis for t+ 1, the proof for time period t is divided into two parts.
In the first part, we prove for all k ≥ 0 that there exists a nonnegative random variable Nkt
such that
Lkt (W∗t , R
x,∗t ) ≤ vn−1
t (W ∗t , R
x,∗t ), a.s. on n ≥ Nk
t , (W ∗t , R
x,∗t ) ∈ St. (29)
Its proof is by induction on k. Note that it only applies to states in the random set St. Then,
again for t, we take on the second part, which takes care of the states not covered by the first
24
part, proving the existence of a nonnegative random variable N∗,kt such that the lower bound
inequalities are true on n ≥ N∗,kt for all accumulation points of (W nt , R
x,nt )n≥0.
We start the backward induction on t. Pick ω ∈ Ω. We omit the dependence of the random
elements on ω for compactness. Remember that the base case t = T is trivial and we pick
N∗,kT = N . We also pick, for convenience, NkT = N .
Induction Hypothesis: Given t = T −1, . . . , 0, assume, for t+1, and all k ≥ 0 the existence
of integers Nkt+1 and N∗,kt+1 such that, for all n ≥ Nk
t+1, (29) is true, and, for all n ≥ N∗,kt+1,
inequalities in (28) hold true for all accumulation points (W ∗t , R
x,∗t ).
Part 1:
For our fixed time period t, we prove for any k, the existence of an integer Nkt such that
for n ≥ Nkt , inequality(29) is true. The proof is by forward induction on k.
We start with k = 0. For every state (Wt, R), we have that −Bv ≤ v∗t (Wt, R) ≤ Bv,
implying, by definition, that L0t (Wt, R) ≤ −Bv. Therefore, (29) is satisfied for all n ≥ 1,
since we know that vn−1t (Wt, R) is bounded by −Bv and +Bv for all iterations. Thus, N0
t =
max(1, N∗,0t+1
)= N∗,0t+1.
The induction hypothesis on k assumes that there exists Nkt such that, for all n ≥ Nk
t (29)
is true. Note that we can always make Nkt larger than N∗,kt+1, thus we assume that Nk
t ≥ N∗,kt+1.
The next step is the proof for k + 1.
Before we move on, we depart from our pointwise argument in order to define the stochastic
noise sequence and state a lemma describing an important property of this sequence. We start
by defining, for R = 1, . . . , BR, the random variable snt+1(R) = (Hvn−1)t(Wnt , R) − vnt+1(R)
that measures the error incurred by observing a sample slope. Using snt+1(R) we define for
each Wt ∈ Wt the stochastic noise sequence snt (Wt, R)n≥0. We have that snt (Wt, R) = 0 on
n < Nkt and, on n ≥ Nk
t , snt (Wt, R) is equal to
max(0, (1− αnt (Wt, R)) sn−1
t (Wt, R) + αnt (Wt, R)snt+1(Rx,nt 1R≤Rx,n
t + (Rnt + 1)1R>Rn
t )).
25
The sample slopes are defined in a way such that
E[snt+1(R)|Fnt
]= 0. (30)
This conditional expectation is called the unbiasedness property. This property, together with
the martingale convergence theorem and the boundedness of both the sample slopes and the
approximate slopes are crucial for proving that the noise introduced by the observation of the
sample slopes, which replace the observation of true expectations, go to zero as the number
of iterations of the algorithm goes to infinity, as is stated in the next lemma.
Lemma 5.1. On the event that (W ∗t , R
x,∗t ) is an accumulation point of (W n
t , Rx,nt )n≥0 we
have that
snt (W ∗t , R
x,∗t )n≥0 → 0 and snt (W ∗
t , Rx,∗t + 1)n≥0 → 0 a.s. (31)
Proof: [Proof of lemma 5.1] Given in the appendix.
Returning to our pointwise argument, where we have fixed ω ∈ Ω, we use the conventionthat the minimum of an empty set is +∞. Let
δk = min
(HLk)t(W ∗t , R
x,∗t )− Lkt (W ∗t , R
x,∗t )
4: (W ∗t , R
x,∗t ) ∈ St, (HLk)t(W ∗t , R
x,∗t ) > Lkt (W
∗t , R
x,∗t )
.
If δk < +∞ we define an integer NL ≥ Nkt to be such that
NL−1∏m=Nk
t
(1− αmt (W ∗
t , Rx,∗t ))≤ 1/4 and sn−1
t (W ∗t , R
x,∗t ) ≤ δk, (32)
for all n ≥ NL and states (W ∗t , R
x,∗t ) ∈ St. Such an NL exists because both (17) and (31)
are true. If δk = +∞ then, for all states (W ∗t , R
x,∗t ) ∈ St, (HLk)t(W
∗t , R
x,∗t ) = Lkt (W
∗t , R
x,∗t )
since (24) tells us that (HLk)t(W∗t ) ≥ Lkt (W
∗t ). Thus, Lk+1
t (W ∗t , R
x,∗t ) = Lkt (W
∗t , R
x,∗t ) and we
define the integer NL to be equal to Nkt . We let Nk+1
t = max(NL, N
∗,k+1t+1
)and show that
(29) holds for n ≥ Nk+1t .
We pick a state (W ∗t , R
x,∗t ) ∈ St. If Lk+1
t (W ∗t , R
x,∗t ) = Lkt (W
∗t , R
x,∗t ), then inequality
(29) follows from the induction hypothesis. We therefore concentrate on the case where
Lk+1t (W ∗
t , Rx,∗t ) > Lkt (W
∗t , R
x,∗t ).
26
First, we depart one more time from the pointwise argument to introduce the stochastic
bounding sequence. We also state a lemma combining this sequence with the stochastic noise
sequence. For Wt ∈ Wt and R = 1, . . . , BR, we have on n < Nkt that lnt (Wt, R) = Lkt (Wt, R)
and, on n ≥ Nkt ,
lnt (Wt, R) = (1− αnt (Wt, R)) ln−1t (Wt, R) + αnt (Wt, R)(HLk)t(Wt, R).
The next lemma states that the noise and the stochastic bounding sequences can be used to
provide a bound for the approximate slopes as follows.
Lemma 5.2. On n ≥ Nkt , (W ∗
t , Rx,∗t ) ∈ St,
vn−1t (W ∗
t , Rx,∗t ) ≥ ln−1
t (W ∗t , R
x,∗t )− sn−1
t (W ∗t , R
x,∗t ) a.s. (33)
Proof: [Proof of lemma 5.2] Given in the appendix.
Back to our fixed ω, a simple inductive argument proves that lnt (Wt, R) is a convex com-
bination of Lkt (Wt, R) and (HLk)t(Wt, R). Therefore we can write,
ln−1t (W ∗
t , Rx,∗t )) = bn−1
t Lkt (W∗t , R
x,∗t ) + (1− bn−1
t )(HLk)t(W∗t , R
x,∗t ),
where bn−1t =
n−1∏m=Nk
t
(1− αmt (W ∗
t , Rx,∗t ))
. For n ≥ Nk+1t ≥ NL, we have bn−1
t ≤ 1/4. Moreover,
Lkt (W∗t , R
x,∗t ) ≤ (HLk)t(W
∗t , R
x,∗t ). Thus, using (22) and the definition of δk, we obtain
ln−1t (W ∗t , R
x,∗t ) ≥ 1
4Lkt (W
∗t , R
x,∗t ) +
34
(HLk)t(W ∗t , Rx,∗t )
=12Lkt (W
∗t , R
x,∗t ) +
12
(HLk)t(W ∗t , Rx,∗t ) +
14
((HLk)t(W ∗t , Rx,∗t )− Lkt (W ∗t , R
x,∗t ))
≥ Lk+1t (W ∗t , R
x,∗t ) + δk. (34)
Combining (33) and (34), we obtain, for all n ≥ Nk+1t ≥ NL,
vn−1t (W ∗
t , Rx,∗t ) ≥ Lk+1
t (W ∗t , R
x,∗t ) + δk − sn−1
t (W ∗t , R
x,∗t )
≥ Lk+1t (W ∗
t , Rx,∗t ) + δk − δk
= Lk+1t (W ∗
t , Rx,∗t ),
where the last inequality follows from (32).
27
Part 2:
We continue to consider ω picked in the beginning of the proof of the theorem. In this
part, we take care of the states (W ∗t , R
x,∗t ) that are accumulation points but are not in St.
In contrast to part 1, the proof technique here is not by forward induction on k. We rely
entirely on the definition of the projection operation and on the elements defined in section
5.2, as this part of the proof is all about states for which the projection operation decreased
the corresponding approximate slopes infinitely often, which might happen when some of the
optimal slopes are equal. Of course this fact is not verifiable in advance, as the optimal slopes
are unknown.
Remember that at iteration n time period t, we observe the sample slopes vnt+1(Rx,nt ) and
vnt+1(Rx,nt + 1) and it is always the case that vnt+1(Rx,n
t ) ≥ vnt+1(Rx,nt + 1), implying that the
resulting temporary slope znt (W nt , R
x,nt ) is bigger than znt (W n
t , Rx,nt + 1). Therefore, according
to our projection operator, the updated slopes vnt (W nt , R
x,nt ) and vnt (W n
t , Rx,nt + 1) are always
equal to znt (W nt , R
x,nt ) and znt (W n
t , Rx,nt +1), respectively. Due to our stepsize rule, as described
in section 4, the slopes corresponding to (W nt , R
x,nt ) and (W n
t , Rx,nt + 1) are the only ones
updated due to a direct observation of sample slopes at iteration n, time period t. All the
other slopes are modified only if a violation of the monotone decreasing property occurs.
Therefore, the slopes corresponding to states with information vector Wt ∈ W different than
W nt , no matter the resource level R = 1, . . . , BR, remain the same at iteration n time period
t, that is, vn−1t (Wt, R) = znt (Wt, R) = vnt (Wt, R). On the other hand, it is always the case that
the temporary slopes corresponding to states with information vector W nt and resource levels
smaller than Rx,nt can only be increased by the projection operation. If necessary, they are
increased to be equal to vnt (W nt , R
x,nt ). Similarly, the temporary slopes corresponding to states
with information vector W nt and resource levels greater than Rx,n
t + 1 can only be decreased
by the projection operation. If necessary, they are decreased to be equal to vnt (W nt , R
x,nt + 1),
see figure 3c.
Keeping the previous discussion in mind, it is easy to see that for each W ∗t ∈ Wt, if RMin
t is
the minimum resource level such that (W ∗t , R
Mint ) is an accumulation point of (W n
t , Rx,nt )n≥0,
then the slope corresponding to (W ∗t , R
Mint ) could only be decreased by the projection opera-
28
tion a finite number of iterations, as a decreasing requirement could only be originated from a
resource level smaller than RMint . However, no state with information vector W ∗
t and resource
level smaller than RMint is visited by the algorithm after iteration N (as defined in section
5.2), since only accumulation points are visited after N . We thus have that (W ∗t , R
Mint ) is an
element of the set St, showing that St is a proper set.
Hence, for all states (W ∗t , R
x,∗t ) that are accumulation points of (W n
t , Rx,nt )n≥0 and are
not elements of St, there exists another state (W ∗t , R
x,∗t ), where Rx,∗
t is the maximum resource
level smaller than Rx,∗t such that (W ∗
t , Rx,∗t ) ∈ St. We argue that for all resource levels R
between Rx,∗t and Rx,∗
t (inclusive), we have that |Nt(W ∗t , R)| = ∞. Figure 4 illustrates the
situation.
Rx,∗t
Rx,∗t 6∈St R
∈ St
v∗ t(W∗ t,R
)
are accumulation points
RMint
|Nt(W ∗t , Rx,∗t )| =∞
|Nt(W ∗t , Rx,∗t − 1)| =
|Nt(W ∗t , Rx,∗t − 2)| =
Figure 4: Illustration of technical elements related to the projection operation
As introduced in section 5.2, we have thatNt(W ∗t , R) = n ∈ IN : znt (W ∗
t , R) > vnt (W ∗t , R).
By definition, the sets St and Nt(W ∗t , R) share the following relationship. Given that (W ∗
t , R)
is an accumulation point of (W nt , R
x,nt )n≥0, then |Nt(W ∗
t , R)| = ∞ if and only if the state
(W ∗t , R) is not an element of St. Therefore, |Nt(W ∗
t , Rx,∗t )| = ∞, otherwise (W ∗
t , Rx,∗t ) would
be an element of St. If Rx,∗t = Rx,∗
t − 1 we are done. If Rx,∗t < Rx,∗
t − 1, we have to
consider two cases, namely (W ∗t , R
x,∗t − 1) is accumulation point and (W ∗
t , Rx,∗t − 1) is not an
accumulation point. For the first case, we have that |Nt(W ∗t , R
x,∗t − 1)| = ∞ from the fact
that this state is not an element of St. For the second case, since (W ∗t , R
x,∗t − 1) is not an
accumulation point, its corresponding slope is never updated due to a direct observation of
sample slopes for n ≥ N , by the definition of N . Moreover, every time the slope of (W ∗t , R
x,∗t )
29
is decreased due to a projection (which is coming from the left), the slope of (W ∗t , R
x,∗t − 1)
has to be decreased as well. Therefore, Nt(W ∗t , R
x,∗t )⋂n ≥ N ⊆ Nt(W ∗
t , Rx,∗t − 1)
⋂n ≥
N, implying that |Nt(W ∗t , R
x,∗t − 1)| = ∞. We then apply the same reasoning for states
(W ∗t , R
x,∗t − 2), . . . , (W ∗
t , Rx,∗t − 1), obtaining that the corresponding sets of iterations have an
infinite number of elements. The same reasoning applies to states (W ∗t , R
x,∗t + 1) that are not
in St.
We state a lemma that is the key element for the proof of Part 2, once again going away
from the pointwise argument.
Lemma 5.3. Given an information vector Wt ∈ Wt and a resource level R = 1, . . . , BR − 1,
if for all k ≥ 0, there exists an integer random variable Nk(Wt, R) such that Lkt (Wt, R) ≤
vn−1t (Wt, R) almost surely on n ≥ Nk(Wt, R), Nt(Wt, R+1) =∞ , then for all k ≥ 0, there
exists another integer random variable Nk(Wt, R+1) such that Lkt (Wt, R+1) ≤ vn−1t (Wt, R+1)
almost surely on n ≥ Nk(Wt, R + 1).
Proof: [Proof of lemma 5.3] Given in the appendix.
Using the properties of the projection operator, we return to the proof of Part 2 and to
our fixed ω. Pick k ≥ 0 and a state (W ∗t , R
x,∗t ) that is an accumulation point but is not in
St. The same applies if (W ∗t , R
x,∗t + 1)6∈St. Consider the state (W ∗
t , Rx,∗t ) where Rx,∗
t is the
maximum resource level smaller than Rx,∗t such that (W ∗
t , Rx,∗t ) ∈ St. This state satisfies the
condition of lemma 5.3 with Nk(W ∗t , R
x,∗t ) = Nk
t (from part 1 of the proof). Thus, we can
apply this lemma in order to obtain, for all k ≥ 0, an integer Nk(W ∗t , R
x,∗t + 1) such that
Lkt (W∗t , R
x,∗t + 1) ≤ vn−1
t (W ∗t , R
x,∗t + 1), for all n ≥ Nk(W ∗
t , Rx,∗t + 1).
After that, we use lemma 5.3 again, this time considering state (W ∗t , R
x,∗t + 1). Note that
the first application of lemma 5.3 gave us the integer Nk(W ∗t , R
x,∗t + 1), necessary to fulfill
the conditions of this second usage of the lemma. We repeat the same reasoning, applying
lemma 5.3 successively to the states (W ∗t , R
x,∗t + 2), . . . , (W ∗
t , Rx,∗t − 1). In the end, we obtain,
for each k ≥ 0, an integer Nk(W ∗t , R
x,∗t ), such that Lkt (W
∗t , R
x,∗t ) ≤ vn−1
t (W ∗t , R
x,∗t ), for all
n ≥ Nk(W ∗t , R
x,∗t ). Figure 5 illustrates this process.
Finally, if we pick N∗,kt to be greater than Nkt of part 1 and greater than Nk(W ∗
t , Rx,∗t )
30
1 * * * *
* *
, ,
for ,
n kt t
k kt
v P R L P R
n N P R N
1 * * * *
* *
Lemma 3 0,1
, 1 , 1
for , 1
n kt t
k
v P R L P R
n N P R
1 * * * *
* *
Lemma 3
, 2 , 2
for , 2
n kt t
k
v P R L P R
n N P R
1 * * * *
* *
Lemma 3
, 3 , 3
for , 3
n kt t
k
v P R L P R
n N P R
* *, 1| |k P R * *, 2| |k P R * *,| |k P R
* *t tR S S
* 1R *
*
2
1
R
R
* * \t tR S S
Figure 5: Successive applications of lemma 5.3.
and Nk(W ∗t , R
x,∗t + 1) for all accumulation points (W ∗
t , Rx,∗t ) that are not in St, then (28) is
true for all accumulation points and n ≥ N∗,kt .
5.4 Optimality of the Decisions
We finish the convergence analysis proving that, with probability one, the algorithm learns
an optimal decision for all states that can be reached by an optimal policy.
Theorem 2. Assume the conditions of Theorem 1 are satisfied. For t = 0, . . . , T , on the
event that (W ∗t , R
∗t , v∗t , x
∗t ) is an accumulation point of the sequence (W n
t , Rnt , v
n−1t , xnt )n≥1
generated by the SPAR-Storage algorithm, x∗t , is almost surely an optimal solution of
maxxt∈Xt(W ∗t ,R
∗t )Ft(v
∗t (W
∗t ),W ∗
t , R∗t , xt), (35)
where Ft(v∗t (W
∗t ),W ∗
t , R∗t , xt) = Ct(W
∗t , R
∗t , xt) + γV ∗t
W ∗t , R
∗t −
Bdt∑
i=1
xdti +
Bst∑
i=1
xsti − xtrt
.
31
Proof: Fix ω ∈ Ω. As before, the dependence on ω is omitted. At each iteration n and
time t of the algorithm, the decision xnt in STEP 3 of the algorithm is an optimal solution to the
optimization problem maxxt∈Xt(Wn
t ,Rnt )Ft(v
n−1t (W n
t ),W nt , R
nt , xt). Since Ft(v
n−1t (W n
t ),W nt , R
nt , ·) is
concave and Xt(W nt , R
nt ) is convex, we have that
0 ∈ ∂Ft(vn−1t (W n
t ),W nt , R
nt , x
nt ) + XN
t (W nt , R
nt , x
nt ),
where ∂Ft(vn−1t (W n
t ),W nt , R
nt , x
nt ) is the subdifferential of Ft(v
n−1t (W n
t ),W nt , R
nt , ·) at xnt and
XNt (W n
t , Rnt , x
nt ) is the normal cone of Xt(W n
t , Rnt ) at xnt .
Then, by passing to the limit, we can conclude that each accumulation point (W ∗t , R
∗t , v∗t , x
∗t )
of the sequence (W nt , R
nt , v
n−1t , xnt )n≥1 satisfies the condition 0 ∈ ∂Ft(v∗t (W ∗
t ),W ∗t , R
∗t , x∗t ) +
XNt (W ∗
t , R∗t , x∗t ).
We now derive an expression for the subdifferential. We have that
∂Ft(v∗t (W
∗t ),W ∗
t , R∗t , xt) = ∇Ct(W ∗
t , R∗t , xt) + γ∂V ∗t (W ∗
t , R∗t + A · xt),
where A′ = (−1, . . . ,−1︸ ︷︷ ︸Bd
, 1, . . . , 1︸ ︷︷ ︸Bs
,−1), that is, A · xt = −∑Bd
ti=1 x
dti +
∑Bst
i=1 xsti − xtrt . From
(Bertsekas et al., 2003, Proposition 4.2.5), for xt ∈ INBd+Bs+1,
∂V ∗t (W ∗t , R
∗t+A·xt) = (A1y, . . . , ABd+Bs+1y)T : y ∈ [v∗t (W
∗t , R
∗t+A·xt+1), v∗t (W
∗t , R
∗t+A·xt)].
Therefore, as x∗t is integer valued,
∂Ft(v∗t (W
∗t ),W ∗
t , R∗t , x∗t ) = ∇Ct(W ∗
t , R∗t , x∗t ) + γAy :
y ∈ [v∗t (W∗t , R
∗t + A · x∗t + 1), v∗t (W
∗t , R
∗t + A · x∗t )].
Since (W ∗t , R
∗t +A·x∗t ) is an accumulation point of (W n
t , Rx,nt )n≥0, it follows from theorem
1 that
v∗t (W∗t , R
∗t +A·x∗t ) = v∗t (W
∗t , R
∗t +A·x∗t ) and v∗t (W
∗t , R
∗t +A·x∗t +1) = v∗t (W
∗t , R
∗t +A·x∗t +1).
Hence, ∂Ft(v∗t (W
∗t ),W ∗
t , R∗t , x∗t ) = ∂Ft(v
∗t (W
∗t ),W ∗
t , R∗t , xt) and
0 ∈ ∂Ft(v∗t (W ∗t ),W ∗
t , R∗t , x∗t ) + XN
t (W ∗t , R
∗t , x∗t ),
which proves that x∗t is the optimal solution of (35).
32
6 Numerical experiments
We have implemented the SPAR-Storage algorithm in an energy policy model called SMART
(Stochastic Multiscale model for the Analysis of energy Resources, Technology and policy),
described in Powell et al. (2009) (the numerical results reported here are based on this paper).
The problem involved modeling the allocation of 15 types of energy resources where some were
converted to electricity and distributed over the grid, while the others (notably transportation
fuels and home heating needs) which capture the conversion and movement of physical fuels
to meet various demands. The model represents energy supplies and demand for the western
United States at an aggregate level.
The model captures hourly variations in wind, solar and demands, over a one-year horizon.
Each time period involves a fairly small linear program (comparable to the model in section 2)
that is linked over time through the presence of a single, grid-level storage facility in the form
of a reservoir for hydroelectric power. Water inflow to the reservoir, in the form of rainfall
and snow melt, is a highly seasonal process with typically high input in the late winter and
spring. Generating electricity from hydroelectric power is extremely cheap on the margin, but
there are benefits to holding water in the reservoir until it is needed in the summer. There is
a limit on how much water can be stored. In the optimal solution, the tendency is to begin
storing water late in the winter, hitting the maximum just before we begin drawing it down for
the summer months with no rainfall. This means that the algorithm has to be smart enough
to realize, in March, that there is a positive value for storing water that is not needed until
August, thousands of time periods later.
We undertook two sets of experiments to test the performance of the algorithm. The
first considered a deterministic dataset which could be solved optimally as a linear program
over all 8,760 time periods at the same time. This is a good test of the algorithm, since the
ADP algorithm is not allowed to take advantage of the deterministic future. The deterministic
model was run with a single rainfall scenario, and two demand patterns: the first used constant
demand over the entire year, while the second used a sinusoidal curve to test more complex
behaviors for storing and releasing water. The ADP algorithm was run for 200 iterations
33
Optimal from linear program
Reservoir level
Demand
Rainfall
ADP solution
Reservoir level
Demand
Rainfall
(a) Optimal-constant demand (b) ADP-constant demand
Optimal from linear program
Reservoir level
Demand
Rainfall
ADP solution
Reservoir level
Demand
Rainfall
(c) Optimal-sinusoidal demand (d) ADP-sinusoidal demand
Figure 6: Optimal reservoir levels (a) and (c) and reservoir levels from ADP algorithm (b)and (d) for constant and sinusoidal demand (from Powell et al. (2009)).
(requiring approximately 5 seconds per iteration for the entire year), and produced an objective
function within 0.02 percent of the optimal solution produced by the linear programming
model. Despite this close agreement, we were interested in seeing the actual behavior in
terms of the amount held in storage, since small differences in the objective function can hide
significant differences in the decisions.
Figure 6 shows the rainfall, demand and, most important, the amount of water in storage
at each time period as produced by the ADP algorithm, and the optimal solution from the
linear programming model. Clearly there is very close agreement. Note also that in both runs,
the ADP solution was able to replicate the behavior of the optimal solution where water was
held in storage for over 2000 hours.
The ADP algorithm, of course, can also handle uncertainty. This was tested in the form
of stochastic rainfall. Figure 7 shows the reservoir level produced by the ADP algorithm
(as a single, sample path taken from the last iteration). Also shown is the optimal solution
34
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 100 200 300 400 500 600 700 800
Res
ervo
ir le
vel
Time period
ADPOptimal for individual
scenarios
Figure 7: Reservoir level produced by ADP (last iteration) for stochastic precipitation, andoptimal reservoir levels produced by LP when optimized over individual scenarios (from Powellet al. (2009)).
produced by the linear programming model when optimized over each sample path for rainfall.
Each of these so-called optimal solutions are allowed to see into the future. Not surprisingly,
the optimal, prescient solutions for each sample path are able to hold off longer from storing
water, whereas the ADP solution, quite reasonably, stores water earlier in the season because
of the possibility that the rainfall might be lower than expected.
The ADP-based algorithm in the SMART model was found to offer some attractive fea-
tures. First, while this paper focuses purely on storage tested over a one-year horizon, in the
SMART model it was used in the context of an energy policy model that extended for 20
years (over 175,000 time periods) with run times of only a few hours on a PC. The same ADP
algorithm described in this paper was also used to model energy investment decisions (how
many megawatts of capacity should we have for ethanol, wind, solar, coal, natural gas, ...), but
in this case, we used separable approximations of a nonseparable value function. For this set-
ting, we can no longer prove convergence, but we did obtain results within 0.24 percent of the
linear programming optimal solution, suggesting that the general algorithmic strategy scales
to vector-valued resource allocation problems. This finding is in line with other algorithmic
research with this class of algorithms (in particular, see Topaloglu & Powell (2006)).
35
7 Summary
We propose an approximate dynamic programming algorithm using pure exploitation for the
problem of planning energy flows over a year, in hourly increments, in the presence of a single
storage device, taking advantage of our ability to represent the value function as a piecewise
linear function. We are able to prove almost sure convergence of the algorithm using a pure
exploitation strategy. This ability is critical for this application, since we discretized each
value function (for 8,760 time periods) into thousands of breakpoints. Rather than needing
to visit all the possible values of the storage level many times, we only need to visit a small
number of these points (for each value of our information vector Wt) many times.
A key feature of our algorithm is that it is able to handle high-dimensional decisions in each
time period. For the energy application, this means that we can solve the relatively small (but
with hundreds of dimensions) decision problem that determines how much energy is coming
from each energy source and being converted to serve different types of energy demands. We
are able to solve sequences of deterministic linear programs very quickly by formulating the
value function around the post-decision state variable.
Our numerical experiments with the SMART model suggest that the resulting algorithm
is quite fast, and when applied to a deterministic problem, produced solutions that closely
match the optimal solution produced by a linear program. This is a true test of the use of
ADP algorithm, since it requires that it be able to determine when to start storing water in
the water so that we can meet demands late in the summer, thousands of hours later. The
algorithm requires approximately five seconds to do a forward pass over all 8,760 hourly time
periods within a year, and high quality solutions are obtained within 100 or 200 iterations.
Additional research is required to develop a practical algorithm in the presence of uncer-
tainty when the information vector Wt is large, since we have to estimate a piecewise linear
value function for each possible value of Wt. There are a number of statistical techniques that
can be used to address this problem. A possible line of investigation, described in George et
al. (2008), uses hierarchical aggregation, where a piecewise linear value function is estimated
for Wt at different levels of aggregation, and an estimate of the value function is formed using
36
a weighted some of these different approximations. As the algorithm progresses, the weights
are adjusted so that higher weights are put on more disaggregate levels. This strategy ap-
pears promising, but we need to show a) that this strategy would still produce a convergent
algorithm and b) that it actually works in practice.
Acknowledgements
The authors would like to acknowledge the valuable comments and suggestions of Andrzej
Rusczynski. This research was supported in part by grant AFOSR contract FA9550-08-1-
0195.
References
Ahmed, S. & Shapiro, A. (2002), The sample average approximation method for stochastic
programs with integer recourse, E-print available at http://www.optimization-online.org. 5
Barto, A. G., Bradtke, S. J. & Singh, S. P. (1995), ‘Learning to act using real-time dynamic
programming’, Artificial Intelligence, Special Volume on Computational Research on Inter-
action and Agency 72, 81–138. 4
Bertsekas, D. (2005), Dynamic Programming and Stochastic Control Vol. I, Athena Scientific.
Bertsekas, D. & Tsitsiklis, J. (1996), Neuro-Dynamic Programming, Athena Scientific, Bel-
mont, MA. 4, 12, 19, 23, 42
Bertsekas, D., Nedic, A. & Ozdaglar, E. (2003), Convex Analysis and Optimization, Athena
Scientific, Belmont, Massachusetts. 32
Brown, P. D., Peas-Lopes, J. A. & Matos, M. A. (2008), ‘Optimization of pumped storage
capacity in an isolated power system with large renewable penetration’, IEEE Transactions
on Power Systems 23(2), 523–531. 1
Castronuovo, E. D. & Lopes, J. A. P. (2004), ‘On the Optimization of the Daily Operation of
a Wind-Hydro Power Plant’, IEEE Transactions on Power Systems 19(3), 1599–1606. 1
37
Chen, Z.-L. & Powell, W. B. (1999), ‘A convergent cutting-plane and partial-sampling al-
gorithm for multistage stochastic linear programs with recourse’, Journal of Optimization
Theory and Applications 102(3), 497–524. 5
Fishbone, L. & Abilock, H. (1981), ‘MARKAL - A linear programming model for energy
systems analysis: Technical description of the BNL version’, International Journal of Energy
Research 5, 353–375. 1
Garcia-Gonzalez, J., de la Muela, R. M. R., Santos, L. M. & Gonzalez, A. M. (2008), ‘Stochas-
tic Joint Optimization of Wind Generation and Pumped-Storage Units in an Electricity
Market’, Power Systems, IEEE Transactions on 23(2), 460–468. 1
George, A., Powell, W. B. & Kulkarni, S. (2008), ‘Value function approximation using mul-
tiple aggregation for multiattribute resource management’, J. Machine Learning Research
pp. 2079–2111. 18, 36
Godfrey, G. & Powell, W. B. (2002), ‘An adaptive, dynamic programming algorithm for
stochastic resource allocation problems I: Single period travel times’, Transportation Science
36(1), 21–39. 3
Hernandez-Lerma, O. & Runggladier, W. (1994), ‘Monotone approximations for convex
stochastic control problems’, J. Math. Syst., Estim. and Control 4, 99–140.
Higle, J. & Sen, S. (1991), ‘Stochastic decomposition: An algorithm for two stage linear
programs with recourse’, Mathematics of Operations Research 16(3), 650–669. 5
Jaakkola, T., Jordan, M. I. & Singh, S. P. (1994), Convergence of stochastic iterative dynamic
programming algorithms, in J. D. Cowan, G. Tesauro & J. Alspector, eds, ‘Advances in Neu-
ral Information Processing Systems’, Vol. 6, Morgan Kaufmann Publishers, San Francisco,
pp. 703–710. 4
Lamont, A. (1997), User’s Guide to the META*Net Economic Modeling System Version 2.0,
Technical report, Lawrence Livermore National Laboratory. 1
Lamont, A. D. (2008), ‘Assessing the long-term system value of intermittent electric generation
technologies’, Energy Economics 30(3), 1208–1231. 1
38
Messner, S. & Strubegger, M. (1995), User’s Guide for MESSAGE III, Technical report,
International Institute for Applied Systems Analysis (IIASA). 1
Nascimento, J. M. & Powell, W. B. (2009), ‘An optimal approximate dynamic programming
algorithm for the lagged asset acquisition problem’, Mathematics of Operations Research
34(1), 210–237. 23
NEMS (2003), The National Energy Modeling System: An Overview 2003, Technical report,
Energy Information Administration, Office of Integrated Analysis and Forecasting, U.S.
Department of Energy, Washington, DC 20585. 1
Powell, W. B. (2007), Approximate Dynamic Programming: Solving the curses of dimension-
ality, John Wiley and Sons, New York. 3, 8
Powell, W. B., George, A., Lamont, A. & Stewart, J. (2009), ‘Smart: A stochastic multiscale
model for the analysis of energy resources, technology and policy’, Informs J. on Computing.
3, 5, 9, 18, 33, 34, 35
Powell, W. B., Ruszczynski, A. & Topaloglu, H. (2004), ‘Learning algorithms for separable
approximations of stochastic optimization problems’, Mathematics of Operations Research
29(4), 814–836. 4, 19
Shapiro, A. (2003), Monte carlo sampling methods, in A. Ruszczynski & A. Shapiro, eds,
‘Handbooks in Operations Research and Management Science: Stochastic Programming’,
Vol. 10, Elsevier, Amsterdam, pp. 353–425. 5
Shiryaev, A. (1996), Probability Theory, Vol. 95 of Graduate Texts in Mathematics, Springer-
Verlag, New York. 43
Topaloglu, H. & Powell, W. B. (2006), ‘Dynamic programming approximations for stochas-
tic, time-staged integer multicommodity flow problems’, Informs Journal on Computing
18(1), 31–42. 3, 4, 9, 35
Tsitsiklis, J. (2002), ‘On the convergence of optimistic policy iteration’, Journal of Machine
Learning Research 3, 59–72. 4
39
Tsitsiklis, J. N. (1994), ‘Asynchronous stochastic approximation and Q-learning’, Machine
Learning 16, 185–202. 4
Van Slyke, R. & Wets, R. (1969), ‘L-shaped linear programs with applications to optimal
control and stochastic programming’, SIAM Journal of Applied Mathematics 17(4), 638–
663. 5
Vanderbei, R. (1997), Linear Programming: Foundations and Extensions, Kluwer’s Interna-
tional Series. 3
Appendix
The appendix consists of a brief summary of notation to assist with reading the proofs and
the proofs themselves.
Notation
For each random element, we provide its measurability.
• Filtrations
F = σ(W nt , x
nt ), n ≥ 1, t = 0, . . . , T.
Fnt = σ ((Wmt′ , x
mt′ ), 0 < m < n, t′ = 0, . . . , T ∪ (W n
t′ , xnt′), t′ = 0, . . . , t).
• Post-decision state (W nt , R
x,nt )
Rx,nt ∈ Fnt : resource level after the decision. Integer and bounded by 0 and BR.
W nt ∈ Fnt : Markovian information vector. Independent of the resource level.
Dnt ∈ Fnt : demand vector. Nonnegative and integer valued.
Wt: finite support set of W nt .
• Slopes (monotone decreasing in R and bounded)
v∗t (Wt, R): slope of the optimal value function at (Wt, R).
znt (Wt, R) ∈ Fnt+1: unprojected slope of the value function approximation at (Wt, R).
40
vnt (Wt, R) ∈ Fnt+1: slope of the value function approximation at (Wt, R).
v∗t (Wt, R) ∈ F : accumulation point of vnt (Wt, R)n≥0.
vnt (R) ∈ Fnt : sample slope at R.
• Stepsizes (bounded by 0 and 1, sum is +∞, sum of the squares is < +∞ )
αnt ∈ Fnt and αnt (Wt, R) = αnt(1Wt=Wn
t ,R=Rx,nt + 1Wt=Wn
t ,R=Rx,nt +1
).
• Finite random variable N . Iteration big enough for convergence analysis.
• Set of iterations and states (due to the projection operation)
Nt(Wt, R) ∈ F : iterations in which the unprojected slope at (Wt, R) was decreased.
St ∈ F : states in which the projection had not decreased the unprojected slopes i.o.
• Dynamic programming operator H.
• Deterministic slope sequence Lkt (Wt, R)k≥0. Converges to v∗t (Wt, R).
• Error variable snt+1(R) ∈ Fnt+1
• Stochastic noise sequence snt (Wt, R)n≥0
• Stochastic bounding sequence lnt (Wt, R)n≥0
Proofs of Lemmas
We start with the proof of proposition 2, then we present the proofs for the lemmas of the
convergence analysis section.
Proof: [Proof of proposition 2] We use the notational convention that Lk is the entire set of
slopes Lk = Lkt (Wt) for t = 0, . . . , T, Wt ∈ Wt, where Lkt (Wt) = (Lkt (Wt, 1), . . . , Lkt (Wt, BR)).
We start by showing (23). Clearly, LkT (WT ) is monotone decreasing (MD) for all k ≥ 0
and WT ∈ WT , since LkT (WT ) is a vector of zeros. Thus, using condition (11), we have
that (HLk)T−1(WT−1) is MD. We keep in mind that, for t = 0, . . . , T − 1, L0t (Wt) is MD as
L0 = v∗−2Bve. By definition, we have that L1T−1(WT−1) is MD. A simple induction argument
shows that LkT−1(WT−1) is MD. Using the same argument for t = T − 2, . . . , 0, we show that
(HLk)t(Wt) and Lkt (Wt) are MD.
41
We now show (24). Since v∗ is the unique fixed point of H, then L0 = Hv∗ − 2Bve.
Moreover, from condition (13), we have that (Hv∗)t(Wt, R)−2Bv ≤ (H(v∗−2Bve))t(Wt, R) =
(HL0)t(Wt, R) for all Wt ∈ Wt and R = 1, . . . , BRt . Hence, L0 ≤ HL0 and L0 ≤ L1 =
L0+HL0
2≤ HL0. Suppose that (24) holds true for some k > 0. We shall prove for k + 1. The
induction hypothesis and (23) tell us that (12) holds true. Hence, HLk+1 ≥ HLk ≥ Lk+1 and
HLk+1 ≥ Lk+2 ≥ Lk+1.
Finally, we show (25). A simple inductive argument shows that Lk > v∗ for all k ≥ 0.
Thus, as the sequence is monotone and bounded, it is convergent. Let L denote the limit.
It is clear that Lt(Wt) is monotone decreasing. Therefore, conditions (11)-(13) are also true
when applied to L. Hence, as shown in (Bertsekas & Tsitsiklis, 1996, pages 158-159), we have
that
‖HLk −HL‖∞ ≤ ‖Lk − L‖∞.
With this inequality, it is straightforward to see that
limk→∞
HLk = HL.
Therefore, as in the proof of (Bertsekas & Tsitsiklis, 1996, Lemma 3.4)
L =L+HL
2.
It follows that L = v∗, as v∗ is the unique fixed point of H.
Each lemma assumes all the conditions imposed and all the results obtained before its
statement in the proof of Theorem 1. To improve the comprehension of each proof, all the
assumptions are presented beforehand.
Proof: [Proof of lemma 5.1]
Assumptions: Assume stepsize conditions (15)-(16).
Fix ω ∈ Ω. Omitting the dependence on ω, let (W ∗t , R
x,∗t ) be an accumulation point
of (W nt , R
x,nt )n≥0. In order to simplify notation, let snt (W ∗
t , Rx,∗t ) be denoted by s∗,nt and
42
αnt (W ∗t , R
x,∗t ) be denoted by α∗,nt . Furthermore, let
θnt+1 = snt+1(Rnt 1Rx,∗
t ≤Rx,nt + (Rn
t + 1)1Rx,∗t >Rx,n
t ).
We have, for n ≥ 1,
(s∗,nt )2 ≤[(1− α∗,nt ) s∗,n−1
t + α∗,nt θnt+1
]2
= (s∗,n−1t )2 − 2α∗,nt (s∗,n−1
t )2 + Ant , (36)
where Ant = 2α∗,nt s∗,n−1t θnt+1 + (α∗,nt )2(θnt+1 − s
∗,n−1t )2. We want to show that
∞∑n=1
Ant = 2∞∑n=1
α∗,nt s∗,n−1t θnt+1 +
∞∑n=1
(α∗,nt )2(θnt+1 − s∗,n−1t )2 <∞.
It is trivial to see that both s∗,n−1t and θnt+1 are bounded. Thus, (θnt+1− s
∗,n−1t )2 is bounded
and (16) tells us that
∞∑n=1
(α∗,nt )2(θnt+1 − s∗,n−1t )2 <∞. (37)
Define a new sequence gnt+1n≥0, where g0t+1 = 0 and gnt+1 =
n∑m=1
α∗,mt s∗,m−1t θmt+1. We
can easily check that gnt+1n≥0 is a FnT -martingale bounded in L2. Measurability is obvious.
The martingale equality follows from repeated conditioning and the unbiasedness property.
Finally, the L2-boundedness and consequentially the integrability can be obtained by noticing
that (gnt+1)2 = (gn−1t+1 )2+2gn−1
t+1 α∗,nt s∗,n−1
t θnt+1+(α∗,nt )2(s∗,n−1t θnt+1)2. From the martingale equality
and boundedness of s∗,n−1t and θnt+1, we get
E[(gnt+1)2|Fn−1
T
]≤ (gn−1
t+1 )2 + CE[(α∗,nt )2|Fn−1
T
],
where C is a constant. Hence, taking expectations and repeating the process, we obtain, from
the stepsize assumption (16) and E[(g0t+1)2] = 0,
E[(gnt+1)2
]≤ E
[(gn−1t+1 )2
]+ CE
[(α∗,nt )2
]≤ E
[(g0t+1)2
]+ C
n∑m=1
E[(α∗,mt )2
]<∞.
Therefore, the L2-Bounded Martingale Convergence Theorem (Shiryaev, 1996, page 510) tells
us that
∞∑n=1
α∗,nt s∗,n−1t θnt+1 <∞. (38)
43
Inequalities (37) and (38) show us that∞∑n=1
Ant <∞, and so, it is valid to write
Ant =∞∑m=n
Amt −∞∑
m=n+1
Amt .
Therefore, as −2α∗,nt (s∗,n−1t )2 < 0, inequality (36) can be rewritten as
(s∗,nt )2 +∞∑
m=n+1
Amt ≤ (s∗,n−1t )2 +
∞∑m=n
Amt . (39)
Thus, the sequence (s∗,n−1t )2+
∞∑m=n
Amt n≥1 is decreasing and bounded from below, as∞∑m=1
Amt <
∞. Hence, it is convergent. Moreover, as∞∑m=n
Amt → 0 when n → ∞, we can conclude that
s∗,nt n≥0 converges.
Finally, as inequality (36) holds for all n ≥ 1, it yields
(s∗,nt )2 ≤ (s∗,n−1t )2 − 2α∗,nt (s∗,n−1
t )2 + Ant
≤ (s∗,n−2t )2 − 2α∗,n−1
t (s∗,n−2t )2 + An−1
t − 2α∗,nt (s∗,n−1t )2 + Ant
...
≤ −2n∑
m=1
α∗,mt (s∗,m−1t )2 +
n∑m=1
Amt .
Passing to the limits we obtain:
lim supn→∞
(s∗,nt )2 + 2∞∑m=1
α∗,mt (s∗,m−1t )2 ≤
∞∑m=1
Amt <∞.
This implies, together with the convergence of s∗,nt n≥0, that∞∑m=1
α∗,mt (s∗,m−1t )2 < ∞. On
the other hand, stepsize assumption (15) tells us that∞∑m=1
α∗,mt = ∞. Hence, there must
exist a subsequence of s∗,nt n≥0 that converges to zero. Therefore, as every subsequence of a
convergent sequence converges to its limit, it follows that s∗,nt n≥0 converges to zero.
Proof: [Proof of lemma 5.2]Assumptions: Given t = 0, . . . , T − 1, k ≥ 0 and integer Nkt ,
assume on n ≥ Nkt , that (29), (26) and (27) hold true.
44
Again, fix ω ∈ Ω and omit the dependence on ω. The proof is by induction on n. Let
(W ∗t , R
x,∗t ) be in St. The proof for the base case n = Nk
t is immediate from the fact that
sNk
t −1t (W ∗
t , Rx,∗t ) = 0, l
Nkt −1
t (W ∗t , R
x,∗t ) = Lkt (W
∗t , R
x,∗t ) and by the assumption that (29) holds
true for n ≥ Nkt . Now suppose (33) holds for n > Nk
t ≥ N and we need to prove for n+ 1.
In order to simplify the notation, let αnt (W ∗t , R
x,∗t ) be denoted by αnt and vnt (W ∗
t , Rx,∗t )
be denoted by vnt . We use the same shorthand notation for znt (W ∗t , R
x,∗t ), lnt (W ∗
t , Rx,∗t ) and
snt (W ∗t , R
x,∗t ) as well. Keeping in mind that the set of iterations Nt(W ∗
t , Rx,∗t ) is finite and for
all n ≥ Nkt ≥ N , vnt (W ∗
t , Rx,∗t ) ≥ znt (W ∗
t , Rx,∗t ). We consider three different cases:
Case 1: W ∗t = W n
t and Rx,∗t = Rx,n
t .
In this case, (W ∗t , R
x,∗t ) is the state being visited by the algorithm at iteration n at time
t. Thus,
vnt ≥ znt = (1− αnt )vn−1t + αnt v
nt+1(Rx,n
t )
≥ (1− αnt )(ln−1t − sn−1
t ) + αnt vnt+1(Rx,n
t )
− αnt (Hvn−1)t(Wnt , R
x,nt ) + αnt (Hvn−1)t(W
nt , R
x,nt ) (40)
≥ (1− αnt )(ln−1t − sn−1
t )− αnt snt+1(Rx,nt ) + αnt (HLk)t(P
nt , R
x,nt ) (41)
= lnt − ((1− αnt )sn−1t + αnt s
nt+1(Rx,n
t )) (42)
≥ lnt −(max
(0, (1− αnt )sn−1
t + αnt snt+1(Rx,n
t )))
= lnt − snt . (43)
The first inequality is due to the construction of set St, while (40) is due to the induction
hypothesis. Inequalities (26) and (27) for n ≥ Nkt explains (41). Finally, (42) and (43)
come from the definition of the stochastic sequences lnt and snt , respectively.
Case 2: W ∗t = W n
t and Rx,∗t = Rn
t + 1.
This case is analogous to the previous one, except that we use the sample slope vnt+1(Rx,nt +
1) instead of vnt+1(Rx,nt ). We also consider the (W n
t , Rx,nt + 1) component, instead of
(W nt , R
x,nt ).
Case 3: Else.
Here the state (W ∗t , R
x,∗t ) is not being updated at iteration n at time t due to a direct
45
observation of sample slopes. Then, αnt = 0 and, hence,
lnt = ln−1t and snt = sn−1
t .
Therefore, from the construction of set St and the induction hypothesis
vnt ≥ znt = vn−1t ≥ ln−1
t − sn−1t = lnt − snt .
Proof: [Proof of lemma 5.3]
Assumptions: Assume stepsize conditions (15)-(16). Moreover, assume for all k ≥ 0
and integer Nkt that inequalities (26) and (27) hold true on n ≥ Nk
t . Finally, assume
there exists an integer random variable Nk(Wt, R) such that vn−1t (Wt, R) ≥ Lkt (Wt, R) on
n ≥ Nk(Wt, R), Nt(Wt, R + 1) =∞.
Pick ω ∈ Ω and omit the dependence of the random elements on w. We start by showing,
for each k ≥ 0, there exists an integer Nk,st such that vn−1
t (Wt, R + 1) ≥ Lkt (Wt, R + 1) −
sn−1t (Wt, R + 1) for all n ≥ Nk,s
t . Then, we show, for all ε > 0, there is an integer Nk,εt such
that vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1) − ε for all n ≥ Nk,ε
t . Finally, using these results, we
prove existence of an integer Nk(Wt, R+ 1) such that vn−1t (Wt, R+ 1) ≥ Lkt (Wt, R+ 1) for all
n ≥ Nk(Wt, R + 1).
Pick k ≥ 0. Let Nk,st = min
n ∈ Nt(Wt, R + 1) : n ≥ Nk(Wt, R)
+ 1, where Nk(Wt, R)
is the integer such that vn−1t (Wt, R) ≥ Lkt (Wt, R) for all n ≥ Nk(Wt, R). Given the projection
properties discussed in the proof of Theorem 1, Nt(Wt, R+ 1) is infinite and vNk,s
t −1t (Wt, R) =
vNk,s
t −1t (Wt, R + 1). Therefore, Nk,s
t is well defined. Redefine the noise sequence snt (Wt, R +
1)n≥0 introduced in the proof of Theorem 1 using Nk,st instead of Nk
t . We prove that
vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1)− sn−1
t (Wt, R + 1) for all n ≥ Nk,st by induction on n.
For the base case n− 1 = Nk,st − 1, from our choice of Nk,s
t and the monotone decreasing
property of Lkt (Wt), we have that
vn−1t (Wt, R+1) = vn−1
t (Wt, R) ≥ Lkt (Wt, R) ≥ Lkt (Wt, R+1) = Lkt (Wt, R+1)−sn−1t (Wt, R+1).
Now, we suppose vn−1t (Wt, R+ 1) ≥ Lkt (Wt, R+ 1)− sn−1
t (Wt, R+ 1) is true for n− 1 > Nk,st
and prove for n. We have to consider two cases:
46
Case 1: n ∈ Nt(Wt, R + 1)
In this case, a projection operation took place at iteration n. This fact and the monotone
decreasing property of Lkt (Wt) give us
vn−1t (Wt, R+1) = vn−1
t (Wt, R) ≥ Lkt (Wt, R) ≥ Lkt (Wt, R+1) ≥ Lkt (Wt, R+1)−sn−1t (Wt, R+1).
Case 2: n 6∈Nt(Wt, R + 1)
The analysis of this case is analogous to the proof of inequality (33) of lemma 5.2 for
(Wt, R+1) ∈ St. The difference is that we consider Lkt instead of the stochastic bounding
sequence lnt (Wt, R + 1)n≥0 and we note that
(1− αnt (Wt, R+1))Lkt (Wt, R+1)+ αnt (Wt, R+1)(HLk)t(Wt, R+1) ≥ Lkt (Wt, R+1).
Hence, we have proved that for all k ≥ 0 there exists an integer Nk,st such that
vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1)− sn−1
t+ (Wt, R + 1) for all n ≥ Nk,st .
We move on to show, for all k ≥ 0 and ε > 0, there is an integer Nk,εt such that
vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1) − ε for all n ≥ Nk,ε
t . We have to consider two cases:
(i) (Wt, R + 1) is either equal to an accumulation point (W ∗t , R
x,∗t ) or (W ∗
t , Rx,∗t + 1) and (ii)
(Wt, R+1) is neither equal to an accumulation point (W ∗t , R
x,∗t ) nor (W ∗
t , Rx,∗t +1). For the first
case, lemma 5.1 tells us that snt+(Wt, R+ 1) goes to zero. Then, there exists N ε > 0 such that
snt+(Wt, R + 1) < ε for all n ≥ N ε. Therefore, we just need to choose Nk,εt = max(Nk,s
t , N ε).
For the second case, αnt (Wt, R + 1) = 0 for all n ≥ Nk,st and s
Nk,st −1
t+ (Wt, R + 1) = 0. Thus,
snt+(Wt, R+1) = sNk,s
t −1t+ (Wt, R+1) = 0 for all n ≥ Nk,s
t and we just have to choose Nk,εt = Nk,s
t .
We are ready to conclude the proof. For that matter, we use the result of the previous
paragraph. Pick k > 0. Let ε = v∗t (Wt, R+ 1)−Lkt (Wt, R+ 1) > 0. Since Lkt (Wt, R+ 1)k≥0
increases to v∗t (Wt, R+ 1), there exists k′ > k such that v∗t (Wt, R+ 1)−Lk′t (Wt, R+ 1) < ε/2.
Thus, Lk′t (Wt, R + 1) − Lkt (Wt, R + 1) > ε/2 and the result of the previous paragraph tells
us that there exists Nk′,ε/2t such that vn−1
t (Wt, R + 1) ≥ Lk′t (Wt, R + 1) − ε/2 > Lkt (Wt, R +
1) + ε/2 − ε/2 = Lkt (Wt, R + 1) for all n ≥ Nk′,ε/2t . Therefore, we just need to choose
47
Top Related