Capacity of channels with memory and feedback: Encoder properties and dynamic programming

8
Capacity of Channels with Memory and Feedback: Encoder Properties and Dynamic Programming Charalambos D. Charalambous Department of Electrical and Computer Engineering University of Cyprus Email: [email protected] Christos K. Kourtellaris Department of Electrical and Computer Engineering University of Cyprus Email: [email protected] Christoforos Hadjicostis Department of Electrical and Computer Engineering University of Cyprus Email: [email protected] Abstract—This paper is concerned with capacity formulae for channels with memory and feedback, properties of the capacity achieving encoder, and dynamic programming for designing opti- mal encoders. The source is general and the techniques discussed include outputs of dynamic systems whose conditional probability distribution depends causally on the channel output and encoder law. First, encoder strategies are identified to maximize directed information, between the source and the channel output. Second, various definitions of information capacity are introduced via directed information, and converse coding theorems are derived. Encoder properties which lead to a tight upper bound on achievable rates are identified. Specifically, it is shown that channel inputs need to be independent of past channel outputs. Third, the form of the capacity achieving encoder is described. The encoder law is a functional of the a posteriori distribution of the source output given a sequence of observable channel outputs. Here a generalization of the Posterior Matching Scheme to channels with memory and feedback is shown to hold. Finally, dynamic programming is discussed, identifying analogies with optimal stochastic control under partial information. I. I NTRODUCTION Capacity of channels with feedback and associated cod- ing theorems are often classified into Discrete Memoryless Channels (DMC) and channels with memory. For channels with memory and feedback the measure of information often employed is the so-called directed information which accounts for causality and direction of information flow, introduced by Masssey [1] and subsequently applied by Kramer [2]; this directional measure of information is attributed to Marko [3]. Shannon [4] and Dubrushin [5] derived formulas for capacity of DMC and established coding theorems, while Ebert [6] and Cover and Pombra [7] characterized the capacity of Gaussian channels with memory and feedback, showing that memory can increase capacity. Tatikonda [8] generalized the information density of [9] to channels with memory and feedback, and employed dynamic programming to characterize the capacity achieving input distribution for certain types of channels. Chen and Berger [10] analyzed limited memory channels with feedback, in which the channel output {Y i } and the channel input-output {X i ,Y i } are assumed to be first order Markov models, maximized directed information via dynamic programming, presented a formulae for channel capacity, and derived sufficient conditions under which coding theorems can be shown. Recently, Shayevitz and Feder [11], [12], [13] introduced the so-called Posterior Matching Scheme (PMS) which is a recursive encoding scheme that achieves the capacity of DMC with feedback. This scheme goes back to the idea put forward by Horstein [14], where he describes a scheme that achieves capacity of a discrete memoryless symmetric channel with feedback. The PMS is further investigated by Gorantla and Coleman [15] for DMC with feedback. This paper is concerned with channels with memory and feedback, under general conditions on the channel kernel and source kernel, and presents results along the following directions. 1) Encoder properties for the information capacity achiev- ing distribution and tight bounds on the converse to the coding theorem; 2) Generalization of the PMS to design encoders, which achieve the information capacity of channels with mem- ory and feedback; Future investigation will address maximization of directed in- formation via dynamic programming using separated encoder strategies. The material discussed generalizes current and past research in the area of capacity of channels with memory and feedback. Specifically, the material on encoder properties for the capacity achieving distribution and the tight bounds on the converse to the coding theorem state that maximizing directed information over encoder strategies which are non-Markov with respect to the source is equivalent to maximizing over Markov encoding strategies. Moreover, a tight upper bound on the converse to the coding theorem is established if and only if the probability distribution of the encoder output is independent of past channel outputs. This is indeed a generalization of the capacity achieving channel input distribution of DMC which states that Prob(X i x i |Y 0 ,Y 1 ,...,Y i-1 )= Prob(X i x i ), i, where X i is the channel input and Y i is the corresponding channel output. The material on PMS describes a coding scheme which achieves the maximization of directed information of channels with memory and feedback, hence obtaining generalized results in comparison to [11], [12], [13], [15]. The material on maximizing directed information via dynamic 978-1-4244-8216-0/10/$26.00 ©2010 IEEE 1450 Forty-Eighth Annual Allerton Conference Allerton House, UIUC, Illinois, USA September 29 - October 1, 2010

Transcript of Capacity of channels with memory and feedback: Encoder properties and dynamic programming

Capacity of Channels with Memory and Feedback: EncoderProperties and Dynamic Programming

Charalambos D. CharalambousDepartment of Electrical and

Computer EngineeringUniversity of Cyprus

Email: [email protected]

Christos K. KourtellarisDepartment of Electrical and

Computer EngineeringUniversity of Cyprus

Email: [email protected]

Christoforos HadjicostisDepartment of Electrical and

Computer EngineeringUniversity of Cyprus

Email: [email protected]

Abstract—This paper is concerned with capacity formulae forchannels with memory and feedback, properties of the capacityachieving encoder, and dynamic programming for designing opti-mal encoders. The source is general and the techniques discussedinclude outputs of dynamic systems whose conditional probabilitydistribution depends causally on the channel output and encoderlaw. First, encoder strategies are identified to maximize directedinformation, between the source and the channel output. Second,various definitions of information capacity are introduced viadirected information, and converse coding theorems are derived.Encoder properties which lead to a tight upper bound onachievable rates are identified. Specifically, it is shown thatchannel inputs need to be independent of past channel outputs.Third, the form of the capacity achieving encoder is described.The encoder law is a functional of the a posteriori distributionof the source output given a sequence of observable channeloutputs. Here a generalization of the Posterior Matching Schemeto channels with memory and feedback is shown to hold. Finally,dynamic programming is discussed, identifying analogies withoptimal stochastic control under partial information.

I. INTRODUCTION

Capacity of channels with feedback and associated cod-ing theorems are often classified into Discrete MemorylessChannels (DMC) and channels with memory. For channelswith memory and feedback the measure of information oftenemployed is the so-called directed information which accountsfor causality and direction of information flow, introduced byMasssey [1] and subsequently applied by Kramer [2]; thisdirectional measure of information is attributed to Marko [3].Shannon [4] and Dubrushin [5] derived formulas for capacityof DMC and established coding theorems, while Ebert [6]and Cover and Pombra [7] characterized the capacity ofGaussian channels with memory and feedback, showing thatmemory can increase capacity. Tatikonda [8] generalized theinformation density of [9] to channels with memory andfeedback, and employed dynamic programming to characterizethe capacity achieving input distribution for certain types ofchannels.Chen and Berger [10] analyzed limited memory channelswith feedback, in which the channel output Yi and thechannel input-output Xi, Yi are assumed to be first orderMarkov models, maximized directed information via dynamicprogramming, presented a formulae for channel capacity, andderived sufficient conditions under which coding theorems canbe shown.

Recently, Shayevitz and Feder [11], [12], [13] introducedthe so-called Posterior Matching Scheme (PMS) which is arecursive encoding scheme that achieves the capacity of DMCwith feedback. This scheme goes back to the idea put forwardby Horstein [14], where he describes a scheme that achievescapacity of a discrete memoryless symmetric channel withfeedback. The PMS is further investigated by Gorantla andColeman [15] for DMC with feedback.

This paper is concerned with channels with memory andfeedback, under general conditions on the channel kerneland source kernel, and presents results along the followingdirections.

1) Encoder properties for the information capacity achiev-ing distribution and tight bounds on the converse to thecoding theorem;

2) Generalization of the PMS to design encoders, whichachieve the information capacity of channels with mem-ory and feedback;

Future investigation will address maximization of directed in-formation via dynamic programming using separated encoderstrategies.The material discussed generalizes current and past researchin the area of capacity of channels with memory andfeedback. Specifically, the material on encoder properties forthe capacity achieving distribution and the tight bounds onthe converse to the coding theorem state that maximizingdirected information over encoder strategies which arenon-Markov with respect to the source is equivalent tomaximizing over Markov encoding strategies. Moreover, atight upper bound on the converse to the coding theoremis established if and only if the probability distributionof the encoder output is independent of past channeloutputs. This is indeed a generalization of the capacityachieving channel input distribution of DMC which statesthat Prob(Xi ≤ xi|Y0, Y1, . . . , Yi−1) = Prob(Xi ≤ xi),∀i,where Xi is the channel input and Yi is the correspondingchannel output.The material on PMS describes a coding scheme whichachieves the maximization of directed information ofchannels with memory and feedback, hence obtaininggeneralized results in comparison to [11], [12], [13], [15].The material on maximizing directed information via dynamic

978-1-4244-8216-0/10/$26.00 ©2010 IEEE 1450

Forty-Eighth Annual Allerton ConferenceAllerton House, UIUC, Illinois, USASeptember 29 - October 1, 2010

programming is motivated from optimal stochastic controlwith partial information, in which separated strategies areemployed [17], [18]. An information state is identified whichcarries all the information available in any channel outputsequence.

Although, no results are presented establishing the direct partof coding theorem, it is believed the work of Tatikonda [8],[16], is in principle applicable, although such an achievablerate might not be tight. This question will be addressed insubsequent treatment of the problems described in this paper.

II. PROBLEM FORMULATION

In this section the various blocks of the communicationSystem of Fig. II.1 are defined on abstract alphabets (Polishspaces). Define the time sets of non-negative integers Z+

4=

0, 1, 2, . . . and the finite set of integers Zn+4= 0, 1, . . . , n,

n ∈ Z+ and assume all processes (introduced below) aredefined on a complete probability space (Ω,F(Ω),P) withfiltration Ft : t ∈ Zn

+. The alphabets of the sourceoutput, channel input, channel output and decoder output aresequences of Polish spaces Wt : t = 0, 1, . . . , n, Xt : t =0, 1, . . . , n, Yt : t = 0, 1, . . . , n and Wt : t = 0, 1, . . . , nrespectively (e.g., Wt,Xt,Yt, Wt are complete separable met-ric spaces). Moreover, these abstract alphabets are associatedwith their corresponding measurable spaces (Wt,B(Wt)),(Xt,B(Xt)), (Yt,B(Yt)) and (Wt,B(Wt)) (e.g., B(Xt) isa Borel σ−algebra of subsets of the set Xt generated byclosed sets). Thus, sequences are identified with the productmeasurable spaces

(W0,n,B(W0,n)4= ×n

k=0(Wk,B(Wk))

(X0,n,B(X0,n)4= ×n

k=0(Xk,B(Xk))

(Y0,n,B(Y0,n)4= ×n

k=0(Yk,B(Yk))

(W0,n,B(W0,n)4= ×n

k=0(Wk,B(Wk))

The source output, channel input, channel output, and decoderoutput are processes denoted by

Wn 4= Wt : t = 0, 1, . . . , n, W : Zn+ × Ω 7→ Wt

Xn 4= Xt : t = 0, 1, . . . , n, X : Zn+ × Ω 7→ Xt

Y n 4= Yt : t = 0, 1, . . . , n, Y : Zn+ × Ω 7→ Yt

Wn 4= Wt : t = 0, 1, . . . , n, W : Zn+ × Ω 7→ Wt

where the subscript denotes the time evolution of theseprocesses. Probability measures on any measurable space(Z,B(Z)) are denoted by M1(Z).The definitions of stochastic Kernels and conditional indepen-dence are given below.

Definition 2.1: Consider the measurable spaces(X ,B(X )), (Y,B(Y)), (Z,B(Z)).i) A stochastic Kernel is a mapping q : B(Y) × X → [0, 1]satisfying the following two properties:

Source Encoder

Channel

Decoder

^

Wt

Xt

Yt

Wt

Fig. II.1. General Communication System with Feedback

1) For every x ∈ X , the set function q(·;x) is a probabilitymeasure (possibly finitely additive) on B(Y);

2) for every F ∈ B(Y), the function q(F ; ·) is B(X )-measurable.The set of all such stochastic Kernels is denoted by Q(Y;X ).ii) The σ-algebra B(Z) is called conditionally independentof B(X ) given B(Y) if and only if the stochastic kernelq ∈ Q(Z;X × Y) satisfies

q(A;x, y) = q(A; y), ∀A ∈ B(Z), almost all x ∈ X , y ∈ Y

If (X ,B(X )), (Y,B(Y)), (Z,B(Z)) are associated with R.V.’sX : (Ω,F) 7→ (X ,B(X )), Y : (Ω,F) 7→ (Y,B(Y)),Z : (Ω,F) 7→ (Z,B(Z)), then the definition is equivalentto q(dz;x, y) = q(dz; y), almost all x ∈ X , y ∈ Y . Suchconditional independence is denoted by (X,Z)⊥Y and it isequivalent to X ↔ Y ↔ Z forming a Markov chain in bothdirections.Given the communication channel of Fig. II.1, the differentblocks of this system are defined below.

A. Definition of Sub-Systems

Throughout this paper it is assumed that the σ-algebrasσX−1 = σY −1 = σW−1 = ∅,Ω.

Information SourceThe information source is a sequence of stochastic KernelsPj(dwj ;w

j−1, yj−1, xj−1) ∈ Q(Wj ;W0,j−1 × Y0,j−1 ×X0,j−1) : j = 0, 1, ..., n, n ∈ Zn

+.

Channel EncoderThe encoder is a sequence of stochastic KernelsPj(dxj ;x

j−1, wj , yj−1) ∈ Q(Xj ;X0,j−1×W0,j×Y0,j−1) :

1451

j = 0, 1, ..., n, n ∈ Zn+.

The set of admissible randomized encoders satisfying powerconstraints denoted by Qnm

ad [0, n] ⊆ Q(Xj ;X0,j−1×W0,j ×Y0,j−1) : j = 0, 1, . . . , n is called non-Markov withrespect to source if for each time j ∈ Zn

+, the randomizedencoder depends on the entire history of the source symbolsW j = wj in addition to Y j−1 = yj−1, Xj−1 = xj−1. Theset of admissible randomized encoder strategies, denoted byQad[0, n] ⊆ Q(Xj ;Wj × Y0,j−1) : j = 0, 1, . . . , n, iscalled Markov with respect to the source if for each timej ∈ Zn

+, the randomized encoder depends on the symbolWj = wj in addition to Y j−1 = yj−1.A deterministic encoder is a sequence of delta measuresand hence it is identified by a sequence of functionse : Zn

+ × X0,j−1 × W0,j × Y0,j−1 → Xj : xj =ej(x

j−1, wj , yj−1), e is measurable, j = 0, 1, ...n.The set of deterministic encoder strategies denoted byEnmad [0, n] is called non-Markov with respect to the source iffor each time j ∈ Zn

+, Xj is B(W j)× B(Xj−1) × B(Y j−1)measurable. Thus, for each realization Y j−1 = yj−1 theencoder strategy ej(·, ·, ·) is a function of the past realizationsW j = wj . The set of deterministic encoder strategies denotedby Ead[0, n] is called Markov with respect to the sourceif for each time j ∈ Zn

+, Xj is B(Wj)×B(Y j−1) measurable.

Communication ChannelA communication channel is a sequence of stochastic KernelsPj(dyj ; y

j−1, xj , wj) ∈ Q(Yj ;Y0,j−1 ×X0,j ×W0,j) : j =0, 1, ..., n, n ∈ Zn

+.

Channel DecoderThe decoder is a sequence of stochastic KernelsPj(dwj ; w

j−1, yj) ∈ Q(Wj : W0,j×Y0,j) : j = 0, 1, ..., n,n ∈ Zn

+.A deterministic decoder is a sequence of deltameasures identified by a sequence of functionsd : Zn

+ × W0,j−1 × Y0,j → Wj : wj =dj(w

j−1, yj), d is measurable, j = 0, 1, ..., n.

Definition 2.2: An (n,Mn, εn) code for the channel con-sists of the following.

1) A set of messages Mn4=

1, 2, . . . ,Mn

and a

class of encoders (deterministic or random) measur-able mappings

ϕi : Mn × Yi−1 → Xi : i =

0, 1, . . . , n−1

that transforms each message W ∈Mn

into a channel input Xn−1 ∈ X0,n−1. For example,ϕ ∈ Ead[0, n− 1] is the set of encoding strategies ϕi :i = 0, 1, . . . , n−1 such that Xi = ϕi(W,Y

i−1) : i =0, 1, . . . , n − 1. Note that the more general strategiessatisfy ϕi(W,X

i−1, Y i−1) : i = 0, 1, . . . , n − 1 =ϕi(W,Y

i−1) : i = 0, 1, . . . , n− 1.2) A class of decoder measurable mappings d : Y0,n−1 →Mn such that the average probability of decoding error

satisfies

Pne4=

1

Mn

∑w∈Mn

Prob(W 6= w|W = w) = εn

where W = d(Y n−1).

Definition 2.3: R is an ε-achievable rate if there existsan (n,Mn, εn) code satisfying lim supn→∞ εn ≤ ε andlim infn→∞

1n logMn ≥ R. The supremum of all ε achievable

rates R for all 0 ≤ ε < 1 is defined as the channel capacity.

B. Directed Information

Given a source, an encoder and a channel, one can define thejoint probability measure as follows.

P0,n(dwn, dxn, dyn) = ⊗ni=0Pi(dyi; y

i−1, xi, wi)

⊗Pi(dxi; yi−1, xi−1, wi)⊗ Pi(dwi; y

i−1, xi−1, wi−1)

From the joint probability measure any other marginal or jointprobability measures can be obtained by integration.The directed information from the source output to the channeloutput is defined below.

I(Wn → Y n)4=

n∑i=0

I(W i;Yi|Y i−1)

=

n∑i=0

∫log

Pi(dyi; yi−1, wi)

Pi(dyi; yi−1)P0,i(dy

i, dwi) (II.1)

Two problems are formulated and solutions are sought. Thefirst problem is to find optimal encoders which maximizedirected information II.1). The second problem is to find anencoder which achieves the information capacity definitionvia (II.1). Both are defined below.

C. Pay-offs

Problem 2.4: (Maximizing Directed Information)(a) Randomized EncodersGiven an admissible encoder class Qnm

ad [0, n] findP ∗j (dxj ;x

j−1, wj , yj−1) : j = 0, 1, ...n ∈ Qnmad [0, n]

which maximizes directed information

J0,n(P ∗j nj=0)4= maxPjnj=0∈Qnm

ad [0,n]

1

n+ 1I(Wn → Y n)

(b) Deterministic EncodersGiven an admissible class of encoders Enmad [0, n] finde∗j (xj−1, wj , yj−1) : j = 0, 1, ...n ∈ Enmad [0, n] whichmaximizes directed information

J0,n(e∗jnj=0)4= maxejnj=0∈Enm

ad [0,n]

1

n+ 1I(Wn → Y n)

Problem 2.5: (Achieving Information Capacity)

1452

Given an admissible set of source and channel inputs Aad[0, n]and a definition of information capacity over a finite horizon

C0,n4= sup

(Wn,Xn)∈Aad[0,n]

1

n+ 1I(Wn → Y n)

find an encoder, either randomized or deterministic, whichachieves the information capacity C0,n.

III. MAXIMIZATION OF DIRECTED INFORMATION OVERENCODER STRATEGIES

Consider Problem 2.4 (b) of maximizing the directed in-formation over the class of deterministic encoder strategiesej(xj−1, wj , yj−1) : j = 0, 1, ...n ∈ Enmad [0, n]. Theinformation structure of the encoder at any time j ∈ Zn

+ is(xj−1, wj , yj−1) : j = 0, 1, . . . , n and a specific strategye0, ...., en ∈ Enmad [0, n] gives

xj = ej(xj−1, wj , yj−1) = ej(e0(x−1, w0, y

−1),

e1(x0, w1, y0), ..., ej−1(xj−2, wj−1, yj−2), wj , yj−1), j ∈ Zn+

The first goal is to identify general conditions so thatmaximizing I(Wn → Y n) over an encoder with informationstructure (xj−1, wj , yj−1) : j = 0, 1, . . . , n is equivalentto maximizing I(Wn → Y n) over an encoder withinformation structure (wj , y

j−1) : j = 0, 1, . . . , n. Thus,under these conditions, encoder strategies are given byej(wj , y

j−1) : j = 0, 1, ...n ∈ Ead[0, n].

The following conditions are important.Assumption 3.1: The information source is restricted to a

sequence of stochastic Kernels

Pj(dwj ;wj−1, yj−1, xj−1) = Pj(dwj ;wj−1, y

j−1, xj−1),

almost all (wj−2, xj−2), j = 0, 1, ...n, n ∈ Zn+

Assumption 3.2: The communication channel is restrictedto a sequence of stochastic Kernels

Pj(dyj ; yj−1, xj , wj) = Pj(dyj ; y

j−1, xj , wj),

almost all (wj−1, xj−1), j = 0, 1, ...n, n ∈ Zn+

Theorem 3.3: Under Assumptions 3.1 and 3.2 the followinghold.(a) Randomized EncodersThe sequence of optimal encoder strategies maximizingI(Wn → Y n) over Qnm

ad [0, n] has the form

P ∗j (dxj ;wj , yj−1, xj−1) = Pj(dxj ;wj , y

j−1),

almost all (wj−1, xj−1), j = 0, 1, ...n, n ∈ Zn+

and

J0,n(P ∗j nj=0)

4= supPj(dxj ;xj−1,wj ,yj−1)nj=0∈Qnm

ad [0,n]

I(Wn → Y n)

= supPj(dxj ;wj ,yj−1)nj=0∈Qad[0,n]

I(Wn → Y n)

(b) Deterministic EncodersThe sequence of optimal encoder strategies maximizingI(Wn → Y n) over Enmad [0, n] has the form

e∗j (xj−1, wj , yj−1) = gj(wj , yj−1),

j = 0, 1, ...n, n ∈ Zn+

and

J0,n(e∗jnj=0)

4= supej(wj ,xj−1,yj−1)nj=0∈Enm

ad [0,n]

I(Wn → Y n)

= supgj(wj ,yj−1)nj=0∈Ead[0,n]

I(Wn → Y n)

Proof. The derivation is based on stochastic optimal controltechniques.

The point to be made in Theorem 3.3 is that, underAssumptions 3.1 and 3.2, maximizing directed informationover non-Markov strategies is equivalent to maximizingit over Markov (with respect to the source) strategies forboth deterministic and randomized strategies. This propertyof encoders will help identify a definition of operationalcapacity, e.g., direct part of coding theorem, and its conversewhich will lead to an upper bound on achievable rates.

The next definition of separated encoder strategies is oftenemployed in stochastic control systems with partial informa-tion. Recently, such strategies are analyzed via the so-celledPMS [11], [12], [13] to design encoders which achieve thecapacity of memoryless channels with feedback. One of theobjectives of this paper is to generalize the PMS to channelswith memory and feedback.

Definition 3.4: (Separated Encoder Strategies) Definethe conditional distribution Πx(dwj ; y

j−1)4= Prob(Wj ∈

dwj |Y j−1 = yj−1), j ∈ Zn+.

(a) Randomized EncodersA randomized encoder Pjnj=0 ∈ Qad[0, n] is calledseparated if Pj(dxj ;wj , y

j−1) depends on Y j−1 = yj−1 onlythrough the conditional distribution Πx(dwj ; y

j−1), j ∈ Zn+.

The set of separated randomized encoder strategies is denotedby Qsep

ad [0, n].

(b) Deterministic EncodersA deterministic encoder gjnj=0 ∈ Ead[0, n] is called sepa-rated if xj = gj(wj , y

j−1) depends on Y j−1 = yj−1 onlythrough the conditional distribution Πx(dwj ; y

j−1), j ∈ Zn+.

The set of separated deterministic encoder strategies is denotedby Esepad [0, n].

Thus, for any gnj=0 ∈ Esepad [0, n], the encoder strategy at time

j is of the form gj(wj , yj−1) = gj(wj ,Π

x(dwj ; yj−1)). Such

separated encoder strategies are well analyzed in stochasticcontrol problems with partial information [17], [18]. The

1453

connection is the following. Although one starts with a par-tially observable stochastic control problem, by identifying aninformation state (a quantity that carries the same informationas the observations), in this case, the conditional distribution,then the partially observable problem is converted into a fullyobservable problem with pay-off expressed as a functional ofthe information state. The resulting equivalent optimizationproblem is to control the information state, via separatedcontrol strategies in order to incur the best possible pay-off.By analogy, one can express the directed information I(Wn →Y n) in terms of the information state, Πx(dwj ; y

j−1) and thenemploy separated encoder strategies to maximize it, subjectto a dynamic recursion satisfied by the information state.In principle, this methodology will lead to a principle ofoptimality and an associated dynamic programming satisfiedby the optimal cost-to-go.

IV. MUTUAL AND DIRECTED INFORMATION FORCHANNELS WITH MEMORY AND FEEDBACK

The mutual information between two random sequences Wn

and Y n, denoted by I(Wn;Y n) is a measure of averageinformation the sequence Y n conveys to the sequence Wn.Since it is symmetric then it is also a measure of averageinformation the sequence Wn conveys to the sequence Y n. Aparticular decomposition of mutual information described in[3] is the following:

I(Wn;Y n)4= E

log

P0,n(dyn;wn)

P0,n(dwn)

=

n∑i=0

∫D(Pi(·; yi−1, wi)||Pi(·; yi−1))

× Pi(dwi;wi−1, yi−1)P0,i−1(dyi−1, dwi−1)

+

n∑i=0

∫D(Pi(·;wi−1, yi−1)||Pi(·;wi−1))

× P0,i−1(dwi−1, dyi−1)

= I(Wn → Y n) + I(Wn ← Y n)

where D(P ||Q)4=

∫log dP

dQdP is the relative entropy ofprobability measure P with respect to Q (which may takethe value +∞) and

I(Wn → Y n)4=

n∑i=0

I(W i;Yi|Y i−1) (IV.2)

I(Wn ← Y n)4=

n∑i=0

I(Y i−1;Wi|W i−1) (IV.3)

The quantity I(Wn → Y n) represents the average informationin the direction Wn → Y n (feedforward) via the sequenceof channels Pi(dyi; y

i−1, wi) : i = 0, 1, . . . , n whileI(Wn ← Y n) represents the average information in thedirection Wn ← Y n (feedback) via the sequence of channelsPi(dwi;w

i−1, yi−1) : i = 0, 1, . . . , n.Given a general channel with memory and feedback, thequantity which is often used to derive the operational meaning

of capacity is I(Wn → Y n).

Definition 4.1: The finite time information capacity is de-fined by

C10,n4= sup

(Wn,Xn)∈Aad[0,n]

1

n+ 1I(Wn → Y n) (IV.4)

Moreover,

C1∞4= lim inf

n→∞sup

(Wn,Xn)∈Aad[0,n]

1

n+ 1I(Wn → Y n)

The set Aad[0, n] describes the constraints on the source andencoder.

The reason for the above definition is that often, when thechannel has memory, the upper bound I(Wn → Y n) ≤I(Xn → Y n) may not be finite and/or W i ↔ (Xi, Y i−1)↔Yi, i = 0, 1, . . . , n may not be a Markov chain.

Suppose Assumptions 3.1 and 3.2 hold. Then

I(Wn → Y n) =

n∑i=0

E

logPi(dyi; y

i−1, wi)

Pi(dyi; yi−1)

=

n∑i=0

E

logPi(dyi; y

i−1, wi, xi)

Pi(dyi; yi−1)

=

n∑i=0

E

logPi(dyi; y

i−1, xi, wi)

Pi(dyi; yi−1)

× Pi(dyi; yi−1, xi, wi)Pi(dxi, dwi|yi−1)

=

n∑i=0

I(Wi, Xi;Yi|Y i−1)

where Pi(dyi; yi−1) =

∫Pi(dyi; y

i−1, wi, xi)Pi(dxi, dwi|yi−1).A converse coding theorem can be derived where themaximization is over Pi(dxi, dwi|yi−1) : i = 0, 1, . . . , n of∑n

i=0 I(Wi, Xi;Yi|Y i−1).The next theorem helps clarify the implications of Y i−1 ↔W i−1 ↔ Wi forming a Markov chain for i = 0, 1, . . . , n onvarious notions of information capacity for which operationalmeanings can be sought.

Theorem 4.2: The following statements are equivalent.1) P0,n(dyn;wn) = ⊗n

j=0Pj(dyj ; yj−1, wj), a.s., n ∈ Z+.

2) Yj ↔ (W j , Y j−1) ↔ (Wj+1,Wj+2, . . . ,Wn), forms aMarkov chain for j = 0, 1, . . . , n, n ∈ Z+.

3) I(Wn;Y n) = I(Wn → Y n), n ∈ Z+.4) I(Wn ← Y n) = 0, n ∈ Z+.5) Y j ↔ W j ↔ Wj+1 forms a Markov chain for j =

0, 1, . . . , n ∈ Z+.

The next assumption is equivalent to any of the statements ofTheorem 4.2.

Assumption 4.3: Y i−1 ↔ W i−1 ↔ Wi forms a Markovchain for i = 0, 1, . . . , n. Equivalently, Pi(dwi;w

i−1, yi−1) =Pi(dwi;w

i−1), almost all yi−1 for i = 0, 1, . . . , n.

1454

Clearly, any channel of the form Yi = gi(Wi, Y i−1) +

fi(Wi−1, Y i−1)Vi, where Vi is any noise such that

P (dwi;wi−1, vi−1) = P (dwi;w

i−1), for almost all vi−1

satisfies Assumption 4.3. For example, a random processWi defined via Wi+1 = f(W i, Bi), i ∈ Zn

+, W0 a R.V.,in which Bi, Vi,W0 are mutually independent satisfiesAssumption 4.3.

Note that under Assumption 4.3, I(Wn;Y n) = I(Wn →Y n)

4=

∑ni=0 I(W i;Yi|Y i−1).

Also note that for a given encoder strategyg

4= gi(wi, yi−1) : i = 0, 1, . . . , n the process

Y nj = Y g

j : j = 0, 1, . . . , n depends on the specificstrategy.

Assumption 4.4: W i ↔ (Y i−1, Xi) ↔ Yi is a Markovchain for i = 0, 1, . . . , n. Equivalently Pi(dyi; y

i−1, xi, wi) =Pi(dyi; y

i−1, xi), a.s. for i = 0, 1, . . . , n.

Note that any channel of the form Yi = fi(Yi−1, Xi, Vi),

Xi = ei(Xi−1,W i, Y i−1) : i = 0, 1, . . . , n ∈ Enmad [0, n]

for which Vi is independent of W i satisfies Assumption 4.4.

Next, the directed information I(Wn → Y n) will be relatedto I(Xn → Y n). Under Assumption 4.4, given an encoderstrategy e ∈ Enmad [0, n], then

I(Wn → Y n) =

n∑i=0

I(W i;Yi|Y i−1)

=

n∑i=0

E

logPi(dyi; y

i−1, wi)

Pi(dyi; yi−1)

=

n∑i=0

E

logPi(dyi; y

i−1, wi, xi)

Pi(dyi; yi−1)

=

n∑i=0

E

logPi(dyi; y

i−1, xi)

Pi(dyi; yi−1)

= I(Xn → Y n)

The last equality shows how one can define informationcapacity using the channel input and the channel kernel,independently of the source output.

Definition 4.5: Suppose Assumption 4.4 holds. The finitetime capacity is defined by

C20,n = sup

Pi(dxi;yi−1)ni=0∈Qpcad[0,n]

1

n+ 1I(Xn → Y n)

where Qpcad[0, n] is the power constraint set. Moreover,

C2∞ = lim inf

n→∞sup

Pi(dxi;yi−1)ni=0∈Qpcad[0,n]

1

n+ 1I(Xn → Y n)

A coding theorem related to Definition 4.5 is derived in [8].

Assumption 4.6: Xi−1 ↔ (Xi, Yi−1) ↔ Yi is a Markov

chain for i = 0, 1, . . . , n. Equivalently Pi(dyi; yi−1, xi) =

Pi(dyi; yi−1, xi), a.s. for i = 0, 1, . . . , n.

Suppose Assumption 4.6 holds. Then

I(Xi;Yi|Y i−1) = I(Xi;Yi|Y i−1), i = 0, 1, . . . , n (IV.5)

and hencen∑

i=0

I(Xi;Yi|Y i−1) =

n∑i=0

I(Xi;Yi|Y i−1) (IV.6)

Definition 4.7: Suppose Assumptions 4.4 and 4.6 hold. Thefinite time capacity is defined by

C30,n = sup

Pi(dxi;yi−1)ni=0∈Qpcad[0,n]

1

n+ 1

n∑i=0

I(Xi;Yi|Y i−1)

Moreover,

C3∞ = lim inf

n→∞sup

Pi(dxi;yi−1)ni=0∈Qpcad[0,n]

1n+1

n∑i=0

I(Xi;Yi|Y i−1)

The next statements and definition of information capacitywith memory is a natural generalization of that of discretetime memoryless channels with feedback. It appears that thestructure of the encoder, which gives an upper bound on theinformation capacity of memoryless channels with feedback,holds for the channels with memory as well.Suppose Assumption 4.6 holds. Then

n∑i=0

I(Xi;Yi|Y i−1) =

n∑i=0

I(Xi;Yi|Y i−1)

=

n∑i=0

H(Yi|Y i−1)−H(Yi|Y i−1, Xi)

=

n∑i=0

H(Xi|Y i−1)−H(Xi|Y i)

(IV.7)

≤n∑

i=0

H(Xi)−H(Xi|Y i)

(IV.8)

=

n∑i=0

I(Xi;Yi) (IV.9)

Note that the inequality in (IV.8) holds with equality if andonly if Xi ⊥ Y i−1,∀i ∈ Zn

+ (e.g., Xi is independent ofY i−1,∀i ∈ Zn

+). An innovations encoder which employs least-squares estimation to estimate the source symbols has thisproperty by construction.

1455

Theorem 4.8: Suppose Assumptions 4.4 and 4.6 hold. De-fine the restricted set of input distributions

Qpciad [0, n]

4=

Pi(dxi; y

i−1) ∈ Qpcad[0, n] :

Pi(dxi; yi−1) = Pi(dxi), a.s., i = 0, 1, . . . , n

(IV.10)

Any achievable rate R satisfies

R ≤ lim infn→∞

1

nlogMn

≤ lim infn→∞

supPi(dxi;yi−1)n−1

i=0 ∈Qpciad [0,n−1]

1

n

n−1∑i=0

I(Xi;Yi)

(IV.11)

Proof. Follows from the above discussion and Fano’s inequal-ity.

Finally, another definition is given which makes use of theprevious upper bound on C3

0,n.

Definition 4.9: Suppose Assumptions 4.4 and 4.6 hold. Thefinite time capacity is defined by

C40,n = sup

Pi(dxi;yi−1)ni=0∈Qpciad [0,n]

1

n+ 1

n∑i=0

I(Xi;Yi|Y i−1)

Moreover,

C4∞ = lim inf

n→∞sup

Pi(dxi;yi−1)ni=0∈Qpciad [0,n]

1

n+ 1

n∑i=0

I(Xi;Yi|Y i−1)

V. GENERALIZED POSTERIOR MATCHING SCHEMES FORCHANNELS WITH MEMORY AND FEEDBACK

Here it is shown how to design an encoder so that the directedinformation including the encoder, I(Wn → Y n) is preciselyequal to the supremum in C3

0,n.Suppose Assumptions 3.1 and 3.2 hold and, in addition,Pj(dyj ; y

j−1, x,wj) = Pj(dyj ; yj−1, xj), a.s., j ∈ Zn

+.Let P ∗i (dxi; y

i−1) : i = 0, 1, . . . , n ∈ Qpc0,n be the sequence

of stochastic kernels which achieves the supremum of C30,n.

Let F ∗Xi|Y i−1(xi) be its corresponding conditional distribution.Consider an encoder of the formx∗i = g∗i (wi, y

i−1)

= g∗,si (wi, Pi(dwi; yi−1)) : i = 0, 1, . . . , n

∈ Qsep

ad [0, n]

(V.12)

where Pi(dwi; yi−1) is a stochastic kernel, and denote by

FWi|Y i−1(wi) its corresponding conditional distribution func-tion.Define the posterior matching scheme

X∗i = g∗,si (Wi, FWi|Y i−1(Wi))

= F ∗,−1Xi|Y i−1 FWi|Y i−1(Wi) : i = 0, 1, . . . , n

(V.13)

This scheme corresponds to an encoder transmitting at eachi ∈ N+ the symbol X∗i via the mapping g∗,si (·, ·).The following hold at each i ∈ N+.

1) For a fixed Y i−1 = yi−1, FWi|Y i−1(wi) is a randomvariable uniformly distributed on the interval [0, 1).Hence, it is independent of yi−1.

2) For a fixed Y i−1 = yi−1, F ∗,−1Xi|Y i−1(·) is the inverseof a distribution function, applied to a uniformlydistributed random variable. Hence, it transformsthe uniform random variable Ui = FWi|Y i−1(wi)into a random variable X∗i having the finitecapacity achieving distribution F ∗Xi|Y i−1(xi). Thatis, F ∗,−1Xi|Y i−1 FWi|Y i−1(wi) for a fixed Y i−1 = yi−1

transforms X∗i into a RV distributed according toF ∗Xi|Y i−1 .

Moreover, the above PMS yields the following identities wheninvoked into I(Wn → Y n) =

∑ni=0 I(W i;Yi|Y i−1).

I(Wn → Y n)(∗1)=

n∑i=0

I(W i;Yi|Y i−1)

(∗1)=

n∑i=0

E logPi(dyi; y

i−1, wi)

Pi(dyi; yi−1)

(∗2)=

n∑i=0

E logPi(dyi; y

i−1, wi, xi.∗)

Pi(dyi; yi−1)

(∗3)=

n∑i=0

E logPi(dyi; y

i−1, wi, x∗i )

Pi(dyi; yi−1)

(∗4)=

n∑i=0

E logPi(dyi; y

i−1, x∗i )

Pi(dyi; yi−1)

(∗5)=

n∑i=0

∫log

Pi(dyi; yi−1, x∗i )

Pi(dyi; yi−1)Pi(dyi; y

i−1, x∗i )

× Pi(dx∗i ; yi−1)P0,i−1(dyi−1)

(∗6)= C3

0,n (V.14)

where (∗1) is by definition, (∗2) hold because know-ing the strategy g ∈ Esepad [0, n] and (wi, yi−1), thenPi(dyi; y

i−1, wi) = Pi(dyi; yi−1, wi, xi), a.s., (∗3), (∗4) hold

due to the assumptions at the start of the section, (∗5) followsby definition, and (∗6) is obtained because X∗i is distributedaccording to F ∗Xi|Y i−1(xi), the capacity achieving distribution.

It is important to note that the above PMS can be specializedto

X∗i = g∗,si (Wi, FWi|Y i−1(Wi))

= F ∗,−1Xi FWi|Y i−1(Wi) : i = 0, 1, . . . , n

(V.15)

in which case F ∗Xi(xi)ni=0 is the distribution of the capacity

achieving kernel Pi(dxi; yi−1) = Pi(dxi)ni=0 ∈ Q

pciad [0, n]

of C40,n. For memoryless channels with feedback one has the

1456

identity F ∗Xi|Y i−1(xi) = F ∗Xi(xi)ni=0.

The point to be made here is that C∗0,n is the closest definitionof capacity of channels with memory and feedback to that ofDMC with feedback.

VI. CONCLUSION AND FUTURE WORKS

The future work should concentrate on the following items.1) Derive the dynamic programming recursion for maxi-

mizing directed information I(Wn → Y n) using sepa-rated encoding strategies;

2) Show the direct part of coding theorem;3) Generalize the conditions to the case when the channel

depends on another random process which can be usedas side information either at the decoder or the encoder;

4) Introduce specific examples.

REFERENCES

[1] J. Massey, Causality, Feedback and Directed Information, in Proceedingsof the 1990 IEEE International Symposium on Information Theory andits Applications, pp. 303–305, Nov. 27–30, Hawaii, U.S.A., 1990.

[2] G. Kramer, Directed Information for Channels with Feedback, Ph.D.Thesis, Swiss Federal Institute of Technology, Diss. ETH No. 12656,1998.

[3] H. Marko, The Bidirectional Communication Theory: A Generalizationof Information Theory, IEEE Transactions on Communications, Com-21, No. 12, pp. 1345-1351, 1973.

[4] C. E. Shannon, The Zero Error Capacity of a Noisy Channel, IRETransactions on Information Theory, vol. 2, No. 3, pp. 112-124, 1956.

[5] R. L. Dubrushin, Information Transmission in Channel with Feedback,Theory of Probability and its Applications, vol. 3, No. 4, pp. 367–383,1958.

[6] P. Ebert, The Capacity of the Gaussian Channel with Feedback, BellSystems Technical Journal, vol. 49, pp. 1705-1712, 1970.

[7] T. M. Cover and S. Pombra, Gaussian Feedback Capacity, IEEE Trans-actions on Information Theory, vol. 35, No. 1, pp. 37–43, 1989.

[8] S. Tatikonda, Control Over Communication Constraints, Ph.D. Dissir-tation, M.I.T., Cambfridge, MA, 2000.

[9] S. Verdu and T. S. Han, A General Formula for Channel Capacity, inIEEE Transactions on Information Theory, vol. 40, no. 6, pp. 1147-1157, 1994.

[10] J. Chen and T. Berger, The Capacity of Finite-State Markov Channelswith Feedback, in IEEE Transactions on Information Theory, Vol. 55 ,no. 6 pp. 780-798, 2005.

[11] O. Shayevitz and M. Feder, Communication with Feedback via PosteriorMatching, in Proceedings of ISIT, pp. 391-395, 2007, Nice, France.

[12] O. Shayevitz and M. Feder, The Posterior Maching Feedback Scheme:Capacity Achieving Error Analysis, in Proceedings of ISIT, Toronto,Canada, 2008.

[13] O. Shayevitzand and M. Feder, Achieving the Empirical CapacityUsing Feedback: Memoryless Additive Models, in IEEE Transactionson Information Theory, vol. 55, no. 3 pp. 1269-1295, 2009.

[14] M. Horstein, Sequential Transmission Using Noiseless Feedback, inIEEE Transactions on Information Theory, vol. 9, no. 3, pp. 136-143,1963.

[15] S. K. Gorantla, T. P. Coleman, On Reversible Markov Chains andMaximization of Directed Information, in Proceedings of ISIT, pp. 216-220, 2010, Austin, Texas, U.S.A.

[16] S. Tatikonda and S. Mitter, The Capacity of Channels with Feedback,IEEE Transactions on Information Theory, vol. 55, No. 1, pp. 323–349,2009.

[17] C. D. Charalambous and R. J. Elliott, Certain classes of nonlinearpartially observable stochastic optimal control problems with explicitoptimal control laws equivalent to LEQG/LQG problems, IEEE Trans-actions on Automatic Control, vol. 42, No. 4, pp. 482–497, 1997.

[18] C. D. Charalambous and F. Rezaei, Stochastic Uncertain Systems Sub-ject to Relative Entropy Constraints: Induced Norms and MonotonicityProperties of Minimax Games, IEEE Transactions on Automatic Control,vol. 52, No. 4, pp. 647–663, 2007.

1457