Sign-Regressor Adaptive Filtering Algorithms Using Averaged Iterates and Observations

arX

iv:1

212.

5185

v1 [

mat

h.O

C]

20

Dec

201

2

Sign-Error Adaptive Filtering Algorithms for Markovian

Parameters∗

Araz Hashemi,† G. Yin,‡ Le Yi Wang§

December 20, 2012

Abstract

Motivated by reduction of computational complexity, this work develops sign-erroradaptive filtering algorithms for estimating time-varying system parameters. Differentfrom the previous work on sign-error algorithms, the parameters are time-varying andtheir dynamics are modeled by a discrete-time Markov chain. A distinctive feature ofthe algorithms is the multi-time-scale framework for characterizing parameter varia-tions and algorithm updating speeds. This is realized by considering the stepsize ofthe estimation algorithms and a scaling parameter that defines the transition rates ofthe Markov jump process. Depending on the relative time scales of these two pro-cesses, suitably scaled sequences of the estimates are shown to converge to either anordinary differential equation, or a set of ordinary differential equations modulated byrandom switching, or a stochastic differential equation, or stochastic differential equa-tions with random switching. Using weak convergence methods, convergence and ratesof convergence of the algorithms are obtained for all these cases.

Key Words. Sign-error algorithms, regime-switching models, stochastic approxima-tion, mean squares errors, convergence, tracking properties.

EDICS. ASP-ANAL

Brief Title. Sign-Error Algorithms for Markovian Parameters

∗This research was supported in part by the Army Research Office under grant W911NF-12-1-0223.†Department of Mathematics, Wayne State University, Detroit, MI 48202, [email protected].‡Department of Mathematics, Wayne State University, Detroit, MI 48202, [email protected].§Department of Electrical and Computer Engineering, Wayne State University, Detroit, MI 48202, ly-

[email protected].

1

http://arxiv.org/abs/1212.5185v1

1 Introduction

Adaptive filtering algorithms have been studied extensively, thanks to their simple recur-

sive forms and wide applicability for diversified practical problems arising in estimation,

identification, adaptive control, and signal processing [26].

Recent rapid advancement in science and technology has introduced many emerging ap-

plications in which adaptive filtering is of substantial utility, including consensus controls,

networked systems, and wireless communications; see [1, 2, 4, 5, 8, 7, 12, 13, 14, 16, 17, 18,

19, 20, 23, 24, 27]. One typical scenario of such new domains of applications is that the

underlying systems are inherently time varying and their parameter variations are stochastic

[29, 30, 31]. One important class of such stochastic systems involves systems whose randomly

time-varying parameters can be described by Markov chains. For example, networked sys-

tems include communication channels as part of the system topology. Channel connections,

interruptions, data transmission queuing and routing, packet delays and losses, are always

random. Markov chain models become a natural choice for such systems. For control strat-

egy adaptation and performance optimization, it is essential to capture time-varying system

parameters during their operations, which lead to the problems of identifying Markovian

regime-switching systems pursued in this paper.

When data acquisition, signal processing, algorithm implementation are subject to re-

source limitations, it is highly desirable to reduce data complexity. This is especially im-

portant when data shuffling involves communication networks. This understanding has mo-

tivated the main theme of this paper by using sign-error updating schemes, which carry

much reduced data complexity, in adaptive filtering algorithms, without detrimental effects

on parameter estimation accuracy and convergence rates.

In our recent work, we developed a sign-regressor algorithm for adaptive filters [28]. The

current paper further develops sign-error adaptive filtering algorithms. It is well-known that

sign algorithms have the advantage of reduced computational complexity. The sign operator

reduces the implementation of the algorithms to bits in data communications and simple bit

shifts in multiplications. As such, sign algorithms are highly appealing for practical appli-

cations. The work [11] introduced sign algorithms and has inspired much of the subsequent

developments in the field. On the other hand, employing sign operators in adaptive algo-

rithms has introduced substantial challenges in establishing convergence properties and error

2

https://www.researchgate.net/publication/3084658_Iterate-averaging_sign_algorithms_for_adaptive_filtering_with_applications_to_blind_multiuser_detection?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3084010_Convergence_Analysis_of_Self-Adaptive_Equalizers?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3028933_Analysis_of_stochastic_approximation_schemes_with_discontinuous_and_dependent_forcing_terms_with_applications_to_data_communication_algorithms?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/220589268_How_does_a_stochastic_optimizationapproximation_algorithm_adapt_to_a_randomly_evolving_optimumroot_with_jump_Markov_sample_paths?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3084014_Adaptive_filtering_with_binary_reinforcement?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3084000_Analysis_of_stochastic_gradient_algorithms_for_linear_regression_problems?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3176666_Distal_echo_cancellation_for_base_band_data_transmission?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/236736808_Stochastic_Approximation_Algorithms_and_Recursive_Algorithms_and_Applications?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/224733790_Dual_Sign_Algorithm_for_Adaptive_Filtering?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3084029_Weak_convergence_and_asymptotic_properties_of_adaptive_filters_with_constant_gains?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/234807431_Adaptive_Algorithms_and_Stochastic_Approximation?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/268531149_Quadratic_mean_and_almost_sure_convergence_of_unbounded_stochastic_approximation_algorithm_with_correlated_observations?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3175672_A_Tight_Upper_Bound_Of_The_Average_Absolute_Error_In_A_Constant_Stepsize_Sign_Algorithm?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/227244920_Adaptive_Interference_Suppression_for_Wireless_Multiple-Access_Communication_Systems?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/233822960_Adaptive_Signal_Processing?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3031586_Asymptotic_properties_of_sign_algorithms_for_adaptive_filtering?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/220133684_Regime_Switching_Stochastic_Approximation_Algorithms_with_Application_to_Adaptive_Discrete_Stochastic_Optimization?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/243787416_Stochastic_Approximation_and_Its_Applications?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/233947374_Sign-Error_Adaptive_Filtering_Algorithms_for_Markovian_Parameters?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/245194075_Adaptive_Blind_Multi-user_Detection?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

bounds.

A distinctive feature of the algorithms introduced in this paper is the multi-time-scale

framework for characterizing parameter variations and algorithm updating speeds. This is

realized by considering the stepsize of the estimation algorithms and a scaling parameter

that defines the transition rates of the Markov jump process. Depending on the relative

time scales of these two processes, suitably scaled sequences of the estimates are shown to

converge to either an ordinary differential equation, or a set of ordinary differential equations

modulated by random switching, or a stochastic differential equation, or stochastic differen-

tial equations with random switching. Using weak convergence methods, convergence and

rates of convergence of the algorithms are obtained for all these cases.

The rest of the paper is arranged as follows. Section 2 formulates the problems and

introduces the two-time-scale framework. The main algorithms are presented in Section 3.

Mean-squares errors on parameter estimators are derived. By taking appropriate continuous-

time interpolations, Section 4 establishes convergence properties of interpolated sequences of

estimates from the adaptive filtering algorithms. Our analysis is based on weak convergence

methods. The convergence properties are obtained by using martingale averaging techniques.

Section 5 further investigates the rates of convergence. Suitably interpolated sequences are

shown to converge to either stochastic differential equations or randomly-switched stochastic

differential equations, depending on relations between the two time scales. Numerical results

by simulation are presented to demonstrate the performance of our algorithms in Section 6.

2 Problem Formulation

Let

yn = ϕ′nαn + en, n = 0, 1, . . . , (1)

where ϕn ∈ Rr is the sequence of regression vectors, en ∈ R is a sequence of zero mean

random variables representing the error or noise, αn ∈ Rr is the time-varying true parameter

process, and yn ∈ R is the sequence of observation signals at time n.

Estimates of αn are denoted by θn and are given by the following adaptive filtering

algorithm using a sign operator on the prediction error

θn+1 = θn + µϕnsgn(yn − ϕ′nθn) (2)

3

where sgn(y) is defined as sgn(y) := 1{y>0} − 1{y<0} for y ∈ R1. We impose the following

assumptions.

(A1) αn is a discrete-time homogeneous Markov chain with state space

M = {a1, . . . , am0}, ai ∈ R

r, i = 1, . . . , m0, (3)

and whose transition probability matrix is given by

P ε = I + εQ, (4)

where ε > 0 is a small parameter, I is the Rm0×m0 identity matrix, and Q = (qij) ∈

Rm0×m0 is an irreducible generator (i.e., Q satisfies qij ≥ 0 for i 6= j and

∑m0

j=1 qij = 0

for each i = 1, . . . , m0) of a continuous-time Markov chain. For simplicity, assume that

the initial distribution of the Markov chain αn is given by P (α0 = ai) = p0,i, which is

independent of ε for each i = 1, . . . , m0, where p0,i ≥ 0 and∑m0

i=1 p0,i = 1.

(A2) The sequence of signals {(ϕn, en)} is uniformly bounded, stationary, and independent

of the parameter process {αn}. Let Fn be the σ-algebra generated by {(ϕj, ej), αj :

j < n;αn}, and denote the conditional expectation with respect to Fn by En.

(A3) For each i = 1, . . . , m0, define

gn := ϕnsgn(ϕ′n[αn − θn] + en)

gn(θ, i) := ϕnsgn(ϕ′n[ai − θ] + en)I{αn=ai}

gn(θ, i) := Engn(θ, i)

(5)

For each n and i, there is an A(i)n ∈ R

r×r such that given αn = ai,

gn(θ, i) = A(i)n (ai − θ)I{αn=ai} + o(|ai − θ|I{αn=ai})

EA(i)n = A(i)

(6)

(A4) There is a sequence of non-negative real numbers {φ(k)} with∑

k φ1/2(k) < ∞ such

that for each n and each j > n, and for some K > 0,

|EnA(i)j −A(i)| ≤ Kφ1/2(j − n) (7)

uniformly in i = 1, . . . , m0.

4

Remark 2.1 Let us take a moment to justify the practicality of the assumptions. The

boundedness assumption in (A2) is fairly mild. For example, we may use a truncated Gaus-

sian process. In addition, it is possible to accommodate unbounded signals by treating

martingale difference sequences (which make the proofs slightly simpler).

In (A3), we consider that while gn(θ, i) is not smooth w.r.t. θ, its conditional expectation

gn(θ, i) can be a smooth function of θ. The condition (6) indicates that gn(θ, i) is locally

(near ai) linearizable. For example, this is satisfied if the conditional joint density of (ϕn, en)

with respect to {ϕj, ej, j < n, ϕn} is differentiable with bounded derivatives; see [6] for more

discussion. Finally, (A4) is essentially a mixing condition which indicates that the remote

past and distant future are asymptotically independent. Hence we may work with correlated

signals as long as the correlation decays sufficiently quickly between iterates.

3 Mean Squares Error Bounds

Denote the sequence of estimation errors by θn := αn− θn. We proceed to obtain bounds for

the mean squares error in terms of the transition rate of the parameter ε and the adaptation

rate of the algorithm µ.

Theorem 3.1 Assume (A1)–(A4). Then there is an Nε > 0 such that for all n ≥ Nε,

E|θn|2 = E|αn − θn|2 = O(µ+ ε+ ε2/µ

). (8)

Proof. Define a function by V (x) = (x′x)/2. Observe that

θn+1 = αn+1 − θn+1 = θn − µϕnsgn(ϕ′nθn + en) + (αn+1 − αn) (9)

so

EnV (θn+1)− V (θn)= Enθ′n[(αn+1 − αn)− µϕnsgn(ϕ

′nθn + en)]

+ En|(αn+1 − αn)− µϕnsgn(ϕ′nθn + en)|2

(10)

By (A2), the Markov chain αn is independent of (ϕn, en) and I{αn=ai} is Fn-measurable.

Since the transition matrix is of the form P ε = I + εQ, we obtain

En(αn+1 − αn)=

m0∑

i=1

E(αn+1 − ai

∣∣∣αn = ai)I{αn=ai}

=

m0∑

i=1

[ m0∑

j=1

aj(δij + εqij)− ai

]I{αn=ai} = O(ε)

(11)

5

https://www.researchgate.net/publication/242933953_On_asymptotic_properties_of_a_constant-step-size_sign-error_algorithm_for_adaptive_filtering?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

Similarly,

En|αn+1 − αn|2=m0∑

j=1

m0∑

i=1

|aj − ai|2I{αn=ai}P (αn+1 = aj|αn = ai)

=

m0∑

j=1

m0∑

i=1

|aj − ai|2I{αn=ai}(δij + εqij)

= O(ε)

(12)

Note that |θn| = |θn| · 1 ≤ (|θn|2 + 1)/2, so

O(ε)|θn| ≤ O(ε)(V (θn) + 1). (13)

Since the signals {(ϕn, en)} are bounded, we have

En|(αn+1 − αn)− µϕnsgn(ϕ′nθn + en)|2

= En|αn+1 − αn|2 +O(µ2 + µε)[V (θn) + 1)(14)

Applying (14) to (10), we arrive at

EnV (θn+1)− V (θn) = −µEnθ′nϕnsgn(ϕ

′nθn + en) + Enθ

′n(αn+1 − αn)

+ En|αn+1 − αn|2 +O(µ2 + µε)[V (θn) + 1](15)

Note also that by (A3),

µEnθ′nϕnsgn(ϕ

′nθn + en)= µ

m0∑

i=1

Enθ′nϕnsgn(ϕ

′nθn + en)I{αn=ai}

= µ

m0∑

i=1

Enθ′nA

(i)n θnI{αn=ai} + µo(θn)

= µ

m0∑

i=1

θ′n[A(i)n −A(i)]θnI{αn=ai} + µ

m0∑

i=1

θ′nA(i)θnI{αn=ai} + µo(θn)

(16)

To treat the first three terms in (15), we define the following perturbed Liapunov functions

by

V µ1 (θ, n) :=

∞∑

j=n

m0∑

i=1

−µEnθ′[A

(i)j − A(i)]θI{αj=ai}

V µ2 (θ, n) :=

∞∑

j=n

θ′En(αj+1 − αj)

V µ3 (n) :=

∞∑

j=n

En(αn+1 − αn)′(αj+1 − αj)

(17)

6

By virtue of (A4), we have

|V µ1 (θ, n)| ≤ µ

m0∑

i=1

K|θ|2∞∑

j=n

φ1/2(j − n) ≤ O(µ)[V (θ) + 1] (18)

Note also that the irreducibility of Q implies that of I+εQ for sufficiently small ε > 0. Thus

there is an Nε such that for all n ≥ Nε, |(I + εQ)k−1lνε| ≤ λkc for some 0 < λc < 1, where νε

denotes the stationary distribution associated with the transition matrix I + εQ. Note that

the difference of the j + 1− n and j − n step transition matrices is given by

(I + εQ)j+1−n − (I + εQ)j−n

= [(I + εQ)− I](I + εQ)j−n

= [(I + εQ)− I][(I + εQ)j−n − 1lνε] + [(I + εQ)− I]1lνε

= (εQ)[(I + εQ)j−n − 1lνε].

The last line above follows from the fact Q1l = 0, hence [(I + εQ)− I]1lνε = 0. Thus

∞∑

j=n

|I + εQ)j+1−n − (I + εQ)j−n| ≤ O(ε)∞∑

j=n

λj−nc = O(ε). (19)

The forgoing estimates lead to∑∞

j=nEn(αj+1 − αj) = O(ε) and as a result

|V µ2 (θ, n)| ≤ O(ε)(V (θ) + 1). (20)

and similarly

|V µ3 (n)| = O(ε), (21)

so all the perturbations can be made small.

Now, we note that

EnVµ1 (θn+1, n+ 1)− V µ

1 (θn, n)

= EnVµ1 (θn+1, n+ 1)− EnV

µ1 (θn, n+ 1) + EnV

µ1 (θn, n+ 1)− V µ

n (θn, n).(22)

where

EnVµ1 (θn, n+ 1)− V µ

1 (θn, n) = µ

m0∑

i=1

θ′n[A(i)n − A(i)]θnI{αn=ai} (23)

and

EnVµ1 (θn+1, n+ 1)−EnV

µ1 (θn, n+ 1)

= µ∞∑

j=n+1

m0∑

i=1

En(θn+1 − θn)′[A

(i)j − A(i)]θn+1I{αn=ai}

+ µ

∞∑

j=n+1

m0∑

i=1

Enθ′n[A

(i)j − A(i)](θn+1 − θn)I{αn=ai}.

(24)

7

Using (11), we have

En|θn+1 − θn| ≤ En|αn+1 − αn|+ µEn|ϕnsgn(ϕ′nθn + en)| = O(ε+ µ). (25)

Thus, in view of (A4)

∣∣∣∣∣µ∞∑

j=n+1

m0∑

i=1

Enθ′nEn+1[A

(i)j − A(i)](θn+1 − θn)I{αn=ai}

∣∣∣∣∣ ≤ O(µ2 + µε)[V (θn) + 1], (26)

and∣∣∣∣∣µ

∞∑

j=n+1

m0∑

i=1

En(θn+1 − θn)′En+1[A

(i)j − A(i)]θn+1I{αn=ai}

∣∣∣∣∣ ≤ O(µ2 + µε)[V (θn) + 1]. (27)

Putting together (22)–(27), we establish that

EnVµ1 (θn+1, n + 1)− V µ

1 (θn, n) = µ

m0∑

i=1

Enθ′n[A

(i)j − A(i)]θnI{αn=ai} +O(µ2 + µε)[V (θn) + 1].

(28)

Likewise, we can obtain

EnVµ2 (θn+1, n+ 1)− V µ

2 (θn, n) = −Enθ′n(αn+1 − αn) +O(ε2 + µ2) (29)

and

EnVµ3 (n + 1)− V µ

3 (n) = −En|αn+1 − αn|2 +O(ε2). (30)

Now we define

W (θ, n) = V (θ) + V µ1 (θ, n) + V µ

2 (θ, n) + V µ3 (n).

Since each A(i) is a stable matrix there is a λ > 0 such that θ′A(i)θ ≥ λV (θ) for each i. Thus

we may take λ such that −µ∑m0

i=1 θ′A(i)θI{αn=ai}−µO(θ) ≤ −λµV (θ). Using this along with

(10), (16), (28)–(30), and the inequality O(µε) = O(µ2 + ε2), we arrive at

EnW (θn+1, n+ 1)−W (θn, n)

= −µm0∑

i=1

θ′nA(i)θnI{αn=ai} − µO(θn) +O(µ2 + ε2)[V (θn) + 1]

≤ −λµV (θn) +O(µ2 + ε2)[V (θn) + 1]

≤ −λµW (θn, n) +O(µ2 + ε2)[W (θn, n) + 1].

(31)

8

Choose µ and ε small enough so that there is a λ0 > 0 satisfying λ0 ≤ λ and

−λµ +O(µ2) +O(ε2) ≤ −λ0µ.

Then we obtain

EnW (θn+1, n+ 1) ≤ (1− λ0µ)W (θn, n) +O(µ2 + ε2).

Note that there is an Nε > 0 such that (1 − λ0µ)n ≤ O(µ) for n ≥ Nε. Taking expectation

in the iteration for W (θn, n) and iterating on the resulting inequality yield

EW (θn+1, n+ 1) ≤ (1− λ0µ)nW (θ0, 0) +O

(µ+ ε2/µ

).

Thus

EW (θn+1, n + 1) ≤ O(µ+ ε2/µ).

Finally, applying (18)–(21) again, we also obtain

EV (θn+1) ≤ O(µ+ ε+ ε2/µ).

Thus the desired result follows. �

4 Convergence Properties

4.1 Switching ODE Limit: µ = O(ε)

We assume the adaptation rate and the transition frequency are of the same order, that is

µ = O(ε). For simplicity, we take µ = ε. To study the asymptotic properties of the sequence

{θn}, we take a continuous-time interpolation of the process. Define

θµ(t) = θn, αµ(t) = αn, for t ∈ [nµ, nµ+ µ).

We proceed to prove that θµ(·) converges weakly to a system of randomly switching ordinary

differential equations.

Theorem 4.1 Assume (A1)–(A4) hold and ε = µ. Then the process (θµ(·), αµ(·)) convergesweakly to (θ(·), α(·)) such that α(·) is a continuous-time Markov chain generated by Q and

the limit process θ(·) satisfies the Markov switched ordinary differential equation

θ(t) = A(α(t))(α(t)− θ(t)), θ(0) = θ0. (32)

9

The theorem is established through a series of lemmas. We begin by using a truncation

device to bound the estimates. Define SN = {θ ∈ Rr : |θ| ≤ N} to be the ball with radius

N , and qN(·) as a truncation function that is equal to 1 for θ ∈ SN , 0 for θ ∈ SN+1, and

sufficiently smooth between. Then we modify algorithm (2) so that

θNn+1 := θNn + µϕnsgn(yn − ϕ′nθ

Nn )q

N(θNn ), n = 0, 1, . . . , (33)

is now a bounded sequence of estimates. As before, define

θN,µ(t) := θNn for t ∈ [µn, µn+ µ).

We shall first show that the sequence {θN,µ(·), αµ(·)} is tight, and thus by Prohorov’s

theorem we may extract a convergent subsequence. We will then show the limit satisfies a

switched differential equation. Lastly, we let the truncation bound N grow and show the

untruncated sequence given by (2) is also weakly convergent.

Lemma 4.2 The sequence (θN,µ(·), αµ(·)) is tight in D([0,∞) : Rr ×M).

Proof of Lemma 4.2. Note that the sequence αµ(·) is tight by virtue of [33, Theorem

4.3]. In addition, αµ(·) converges weakly to a Markov chain generated by Q. To proceed, we

examine the asymptotics of the sequence θN,µ(·). We have that for any δ > 0, and t, s > 0

satisfying s ≤ δ,

Eµt

∣∣θN,µ(t+ s)− θN,µ(t)∣∣2 ≤ Eµ

t

∣∣∣∣∣∣µ

(t+s)/µ−1∑

k=t/µ

ϕksgn(yk − ϕ′kθ

Nk )q

N(θNk )

∣∣∣∣∣∣

2

≤ µ2Eµt

(t+s)/µ−1∑

j=t/µ

(t+s)/µ−1∑

k=t/µ

ϕ′jϕksgn(yj − ϕ′

jθNj )sgn(yk − ϕ′

kθNk )q

N(θNj )qN(θNk )

≤ µ2

(t+s)/µ−1∑

j=t/µ

(t+s)/µ−1∑

k=t/µ

Eµt |ϕj |2Eµ

t |ϕk|2 ≤ O(s2) ≤ O(δ2).

(34)

For any T < ∞ and any 0 ≤ t ≤ T , use Eµt to denote the conditional expectation w.r.t. the

σ-algebra Fµt , we have

limδ→0

lim supµ→0

{sup0≤s≤δ

E[Eµt

∣∣θN,µ(t+ s)− θN,µ(t)∣∣2]

}= 0.

Applying the criterion [15, p.47], the tightness is proved. �

10

Since (θN,µ(·), αµ(·)) is tight, it is sequentially compact. By virtue of Prohorov’s theorem,

we can extract a weakly convergence subsequence. Select such a subsequence and still denote

it by (θN,µ(·), αµ(·)) for notational simplicity. Denote the limit by (θN(·), α(·)). We proceed

to characterize the limit process.

Lemma 4.3 The sequence (θN,µ(·), αµ(·)) converges weakly to (θN(·), α(·)) that is a solution

of the martingale problem with operator

LN1 f(θ

N , ai) := ∇f ′(θN , ai)A(i)[ai − θN ]qN (θN) +

m0∑

j=1

qijf(θN , aj), (35)

where for each i ∈ M, f(·, i) ∈ C10 (C1 functions with compact support).

Proof. To derive the martingale limit, we need only show that for the C1 function with

compact support f(·, i), for each bounded and continuous function h(·), each t, s > 0, each

positive integer κ, and each ti ≤ t for i ≤ κ,

Eh(θN(ti), α(ti) : i ≤ κ)[f(θN(t+ s), α(t+ s))− f(θN(t), α(t))

−∫ t+s

t

LN1 f(θ

N(τ), α(τ))dτ]= 0.

(36)

To verify (36), we use the processes indexed by µ. As before, note that

θN,µ(t + s)− θN,µ(t) =

(t+s)/µ−1∑

k=t/µ

µϕksgn(ϕ′k[αk − θk] + ek)q

N(θNk ). (37)

Subdivide the interval with the end points t/µ and (t + s)/µ− 1 by choosing mµ such that

mµ → ∞ as µ → 0 but δµ = µmµ → 0. By the smoothness of f(·, i), it is readily seen that

as µ → 0,

Eh(θN,µ(ti), αµ(ti) : i ≤ κ)

[f(θN,µ(t+ s), αµ(t+ s))− f(θN,µ(t), αµ(t))

]

→ Eh(θN (ti), α(ti) : i ≤ κ)[f(θN(t + s), α(t+ s))− f(θN(t), α(t))

].

(38)

11

Next, we insert a term to examine the change in the parameter α and the estimate θN

separately

limµ→0


[f(θN,µ(t+ s), αµ(t + s))− f(θN,µ(t), αµ(t))

]

= limµ→0


t+s∑

lδµ=t

[f(θNlmµ+mµ, αlmµ+mµ

)− f(θNlmµ, αlmµ

)]

= limµ→0


[ t+s∑

lδµ=t


)− f(θNlmµ+mµ, αlmµ

)]

+

t+s∑

lδµ=t

[f(θNlmµ+mµ, αlmµ


)]].

(39)

First, we work with the last term in (39). By using a Taylor expansion on each interval

indexed by l we have

limµ→0


[ t+s∑

lδµ=t

[f(θNlmµ+mµ, αlmµ


)]]

= limµ→0


t+s∑

lδµ=t

[δµ

1

mµ

lmµ+mµ−1∑

k=lmµ

∇f ′(θNlmµ, αlmµ

)

× ϕksgn(ϕ′k(αk − θNk ) + ek)q

N(θNk )

+

lmµ+µ−1∑

k=lmµ

[∇f ′(θN,+lmµ

, αlmµ)−∇f ′(θNlmµ, αlmµ)](θ

Nk+1 − θNk )qN(θNk )

].

(40)

where θN,+lmµ

is a point on the line segment joining θNlmµand θNlmµ+mµ

. Since

|θNlmµ+mµ− θNlmµ

| = O(δµ)

and ∇f ′(·, i) is smooth, we have the last term in (40) is o(1) in the sense of in probability

as µ → 0. To work with the first term we insert the conditional expectation Ek and apply

(6) to obtain

limµ→0


t+s∑

lδµ=t

δµ1

mµ

lmµ+mµ−1∑

k=lmµ


)

× Ek[ϕ′ksgn(ϕ

′k(αk − θNk ) + ek)]q

N(θNk )

= limµ→0


t+s∑

lδµ=t

m0∑

j=1

δµ1

mµ

lmµ+mµ−1∑

k=lmµ


)

×[A

(j)k (aj − θNk ) + o(|αk − θNk |)

]qN(θNk )I{αk=aj}.

(41)

12

Then for small µ,

E1

mµ

lmµ+mµ−1∑

k=lmµ


)o(|αk − θNk |) ≤ KE|αlmµ− θlmµ

| = O(µ1/2). (42)

Letting µlmµ → τ , then by (7),

limµ→0


t+s∑

lδµ=t

m0∑

j=1

δµmµ

lmµ+mµ−1∑

k=lmµ


)A(j)k (aj − θNk )

× qN(θNk )I{αk=aj}

= limµ→0


t+s∑

lδµ=t

m0∑

j=1

δµmµ

lmµ+mµ−1∑

k=lmµ


)[A(j)(aj − θNk )

+ [A(j)k − A(j)](aj − θNk )

]qN(θNk )I{αk=aj}

= limµ→0


t+s∑

lδµ=t

m0∑

j=1

δµ∇f ′(θNlmµ, αlmµ

)1

mµ

lmµ+mµ−1∑

k=lmµ

A(j)(aj − θNk )

× qN(θNk )I{αk=aj}

= Eh(θN (ti), α(ti) : i ≤ κ)

∫ t+s

t

∇f ′(θN (τ), α(τ))A(α(τ))[α(τ)− θN (τ)]qN(τ)dτ.

(43)

Likewise, we can obtain

limµ→0


t+s∑

lδµ=t


)− f(θNlmµ+mµ, αlmµ

)]

= Eh(θN,µ(ti), αµ(ti) : i ≤ κ)

lmµ+mµ−1∑

k=lmµ

[f(αNlmµ

, αlmµ+mµ)− f(θNlmµ

, αlmµ)]

= Eh(θN (ti), α(ti) : i ≤ κ)

[∫ t+s

t

Qf(θN(τ), α(τ))dτ

].

(44)

Combining (40)–(44) with (39), we have established (36) as desired, completing the proof of

the lemma. �

Completion of the Proof of Theorem 4.1. From Lemma 4.3, we have the truncated

sequence θN(·) satisfies the switched ODE θN (t) = A(α(t))[α(t) − θN(t)qN(t)], θ(0) = θ0.

Next, letting N → ∞, we show that the limit of the untruncated sequence θ(·) and the limit

of θN(·) as N → ∞ are the same. The argument is similar to that of [17, pp. 249-250]; we

explain the main steps below. Let P 0(·) and PN(·) be the measures induced by θ(·) and

θN(·), respectively. Since the martingale problem with operator LN1 has a unique solution,

13

the associated differential equation has a unique solution for each initial condition and P 0(·)is unique. For each T < ∞ and t ≤ T , P 0(·) agrees with PN(·) on all Borel subsets of the

set of paths in D[0,∞) with values in SN . By using P 0(supt≤T |θ(t)| ≤ N) → 1 as N → ∞,

and the weak convergence of θN,µ(·) to θN (·), we have θµ(·) converges weakly to θ(·). Thusthe proof of Theorem 4.1 is completed. �

Remark 4.4 The following calculation will be used for both the slow and fast Markov

chain cases. The result is essentially one about two-time-scale Markov chains considered

in [33]. Define a probability vector by pεn = (P (αn = a1), . . . , P (αn = am0)) ∈ R

1×m0 .

Note that pε0 = (p0,1, . . . , p0,m0) (independent of ε). Because the Markov chain is time

homogeneous, (P ε)n is the n-step transition probability matrix with P ε = I + εQ. Then, for

some 0 < λ1 < 1,

pεn = p(εn) +O(ε+ λ−n1 ), 0 ≤ n ≤ O(1/ε), (45)

where p(t) = (p1(t), . . . , pm0(t)) is the probability vector of the continuous Markov chain

with generator Q such that for all t ≥ 0

dp(t)

dt= p(t)Q, p(0) = p0, (46)

and p0 is the initial probability. In addition,

(P ε)n−n0 = Ξ(εn0, εn) +O(ε+ λ−(n−n0)1 ), (47)

where with t0 = εn0 and t = εn, Ξ(t0, t) satisfies

dΞ(t0, t)

dt= Ξ(t0, t)Q,

Ξ(t0, t0) = I.(48)

Define the continuous-time interpolation αε(t) of αn as

αε(t) := αn for t ∈ [nε, nε+ ε). (49)

Then αε(·) converges weakly to α(·), which is a continuous-time Markov chain generated by

Q with state space M. The Eαn can be approximated by

Eαn = α∗(εn) +O(ε+ λ−n1 ), for n ≤ O(1/ε),

α∗(εn) :=

m0∑

j=1

ajpj(εn).

14

https://www.researchgate.net/publication/268886676_Discrete-Time_Markov_Chains_Two-Time-Scale_Methods_and_Applications?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

4.2 Slowly-Varying Markov Chain: ε ≪ µ

In this case, since the Markov chain changes so slowly, the time-varying parameter process

is essentially a constant. To facilitate the discussion and to fix notation, we take ε = µ1+∆

for some ∆ > 0 in what follows.

The analysis is similar to the ε = O(µ) case. Begin by defining the continuous time

interpolation as before. While a truncation device is still needed, we omit it and assume the

iterates are bounded for notational brevity. The tightness of {θµ(·)} can be verified similar

to Lemma 4.2. To characterize the weak limit we note that the estimates from the previous

section remain valid, except that involving the Markov chain αk. Thus we need only examine

(from the second to last line of (43))

limµ→0


t+s∑

lδµ=t

m0∑

j=1


)

lmµ+mµ−1∑

k=lmµ

1

mµA(j)ajI{αk=aj}

= limµ→0


t+s∑

lδµ=t

m0∑

j=1


)1

mµ

lmµ+mµ−1∑

k=lmµ

A(j)ajElmµI{αk=aj}

= limµ→0


t+s∑

lδµ=t

m0∑

j=1


)

× 1

mµ

lmµ+mµ−1∑

k=lmµ

m0∑

i1=1

m0∑

i0=1

A(j)ajP (αk = aj |αlmµ= ai1)P (αlmµ

= ai1 |α0 = ai0)P (α0 = ai0)

→ Eh(θN(ti), α(ti) : i ≤ κ)

m0∑

i0=1

∫ t+s

t

∇f ′(θ(τ), α(τ))A(i0)ai0P (α0 = ai0)dτ.

(50)

To obtain the last line above, we have used that for lmµ ≤ k ≤ lmµ +mµ since ε = µ1+∆,

εlmµ +mµ ≤ µ∆(t + s) + δµ → 0 as µ → 0, we have by Remark 4.4,

(P ε)k−lmµ = Ξ(εk, εlmµ) +O(ε+ λ

−(k−lmµ)1

)

→ I as µ → 0,

(P ε)lmµ = Ξ(0, εlmµ) +O(ε+ λ

−lmµ

1 ))

→ I as µ → 0.

We omit the details, but present the main result as follows,

Theorem 4.5 Assume (A1)–(A4) hold, and ε = µ1+∆ for some ∆ > 0. Then we have θµ(·)

15

converges weakly to θ(·) such that θ(·) is the unique solution of the differential equation

d

dtθ(t) =

m0∑

i=1

A(i)(ai − θ(t))P (α0 = ai), θ(0) = θ0. (51)

4.3 Fast-Varying Markov Chain: µ ≪ ε

The idea for the fast varying chain is that the parameter changes so fast that it quickly

approaches the stationary distribution of the Markov chain. As a result, the limit dynamic

system is one that is averaged out with respect to the stationary distribution of the Markov

chain. In this section, we take ε = µγ where 1/2 < γ < 1. Then, letting µlmµ → τ as in the

proof of Theorem 4.1, we have ε(k − lmµ) = µγ(k − lmµ) → ∞. Thus, for some 0 < λ1 < 1,

Ξij(εlmµ, εk) = νj +O(ε+ λ−(k−lmµ)1 ),

where ν = (ν1, . . . , νm0) is the stationary distribution of the continuous-time Markov chain

with generator Q, Ξij(s1, s2) denotes the ijth entry of the matrix Ξ(s1, s2). Therefore, we

can show that as µ → 0,

limµ→0


t+s∑

lδµ=t

m0∑

j=1


)1

mµ

lmµ+mµ−1∑

k=lmµ

A(j)ajElmµI{αk=aj}

→ Eh(θN (ti), α(ti) : i ≤ κ)

m0∑

j=1

∫ t+s

t

∇f ′(θ(τ), α(τ))A(j)ajνjdτ.

(52)

Theorem 4.6 Assume (A1)–(A4) hold, and ε = µγ for some 1/2 < γ < 1. Then we have

θµ(·) converges weakly to θ(·) such that θ(·) is the unique solution of the differential equation

d

dtθ(t) =

m0∑

j=1

A(j)(νiai − θ(t)), θ(0) = θ0. (53)

5 Rates of Convergence

5.1 Scaled Errors: ε = µ

Define un := θn/√µ = (αn − θn)/

√µ. Then

un+1 = un −√µϕnsgn(ϕ

′nθn + en) +

αn+1 − αn√µ

(54)

16

In view of Theorem 3.1 there is a Nµ such that E|αn − θn|2 = O(µ) for n ≥ Nµ, with

which we can show {un : n ≥ Nµ} is tight. In addition, take Nµ large such that by (19), we

have∞∑

j=n

En(αj+1 − αj) = O(µ) (55)

Then define

uµ(t) := un for t ∈ [(n−Nµ)µ, (n−Nµ)µ+ µ).

We can then proceed to the study of the asymptotic distribution of uµ(·). As before, a

truncation device may be employed. For notational simplicity, it will be assumed here.

Lemma 5.1 The sequence {uµ(·)} is tight in D([0,∞);Rr).

Proof. Note that

uµ(t + s)− uµ(t) = −√µ

(t+s)/µ−1∑

k=t/µ

gk +α(t+s)/µ − αt/µ√

µ. (56)

Note that we have used the convention that t/µ denotes the integer part of t/µ in the above.

Use Eµt to denote the conditional expectation with respect to the σ-algebra Fµ

t = σ{uµ(τ) :

τ ≤ t}. Then by (55),

Eµt |uµ(t+ s)− uµ(t)|2 ≤ KEµ

t

∣∣∣∣∣∣

(t+s)/µ−1∑

k=t/µ

−√µgk

∣∣∣∣∣∣

2

+O(√µ) (57)

Now we examine

Eµt

∣∣∣∣∣∣

(t+s)/µ−1∑

k=t/µ

−√µgk

∣∣∣∣∣∣

2

=

m0∑

i=1

Eµt µ

(t+s)/µ−1∑

k=t/µ

(t+s)/µ−1∑

j=t/µ

g′kgjI{αk=ai=αj}

≤m0∑

i=1

Eµt µ

(t+s)/µ−1∑

k=t/µ

(t+s)/µ−1∑

j=t/µ

[A(i)k θk + o(θk)]

′[A(i)j θj + o(θj)]I{αk=ai}I{αj=ai}

≤m0∑

i=1

Eµt Kµ

∣∣∣∣∣∣

(t+s)/µ−1∑

k=t/µ

(A(i)k − A(i))θk + o(θk)

∣∣∣∣∣∣

2

I{αk=ai}

+

m0∑

i=1

Eµt Kµ

∣∣∣∣∣∣

(t+s)/µ−1∑

k=t/µ

A(i)θk

∣∣∣∣∣∣

2

I{αk=ai}.

(58)

17

Since E|θk|2 = O(µ) for k large (µ small), in the last term of (58) we have

Em0∑

i=1

Eµt Kµ

∣∣∣∣∣∣

(t+s)/µ−1∑

k=t/µ

A(i)θk

∣∣∣∣∣∣

2

I{αk=ai} ≤ Kµm0∑

i=1

E

(t+s)/µ−1∑

k=t/µ

∣∣∣θk∣∣∣2

I{αk=ai} ≤ O(µ)s. (59)

For the first term we use the mixing inequality of (A4),

E

m0∑

i=1

Eµt Kµ

∣∣∣∣∣∣

(t+s)/µ−1∑

k=t/µ

(A(i)k − A(i))θk + o(θk)

∣∣∣∣∣∣

2

I{αk=ai}

≤ E

m0∑

i=1

Eµt Kµ

(t+s)/µ−1∑

k=t/µ

(t+s)/µ−1∑

j=t/µ

[(A(i)k −A(i))θk]

′[(A(i)j − A(i))θj ]I{αk=ai=αj}

+Kµ

(t+s)/µ−1∑

k=t/µ

E∣∣∣θk

∣∣∣2

≤ E

m0∑

i=1

Kµ[Eµ

t

(t+s)/µ−1∑

k=t/µ

∣∣∣(A(i)k − A(i))

√µuk

∣∣∣∑

j≥k

∣∣∣(A(i)j − A(i))

√µuj

∣∣∣]I{αk=ai=αj} +O(µ)s

≤ O(µ)s

(60)

For any T < ∞ and any 0 ≤ t ≤ T ,

limδ→0

lim supµ→0

{sup0≤s≤δ

E[Eµt |uµ(s+ t)− uµ(t)|2]

}= 0,

so {uµ(·)} is tight. �

Note that gk(ai, i) = ϕksgn(ϕ′k[ai − ai] + ek) = ϕksgn(ek). The following is a variant of

the well-known central limit theorem for mixing processes; see [3] or [10] for details.

Lemma 5.2 Define k := ϕksgn(ek). Then

√µ

(t/µ)−1∑

j=0

j converges weakly to a Brownian motion w(t) (61)

with covariance Σt such that the covariance Σ is given by

Σ = E0′0 +

∞∑

j=1

Ej′0 +

∞∑

j=1

E0′j. (62)

Theorem 5.3 uµ(·) converges weakly to u(·) such that u(·) is the solution of

du(t) = −A(α)udt− Σ1/2dw, (63)

where w(·) is a standard Brownian motion.

18

https://www.researchgate.net/publication/275981329_Convergence_of_Probability_Measures?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

Proof. As usual, extract a convergent subsequence of uµ(·) (still denoted by uµ(·)) with

limit u(·). We will show that for each s, t > 0, the limit process satisfies

u(t+ s)− u(t) =

∫ t+s

t

−A(α(τ)u(τ)dτ −∫ t+s

t

Σ1/2dw(τ) (64)

Note from (56),

uµ(t+ s)− uµ(t) = −√µ

(t+s)/µ−1∑

k=t/µ

gk +O(√µ)

=

m0∑

i=1

[−√µ

(t+s)/µ−1∑

k=t/µ

gk]I{αk=ai} +O(√µ).

(65)

Definegk(i) := gkI{αk=ai}, gk(i) := Ekgk(i), and

∆k(i) := [gk(i)− gk(ai, i)− (gk(i)− gk(ai, i))].

We then expand on the (negative of the) inside of the sum indexed by i in (65) as

√µ

(t+s)/µ−1∑

k=t/µ

gk(i)

=

(t+s)/µ−1∑

k=t/µ

√µgk(ai, i) +

(t+s)/µ−1∑

k=t/µ

√µ[gk(i)− gk(ai, i)] +

(t+s)/µ−1∑

k=t/µ

√µ∆k(i)

=

(t+s)/µ−1∑

k=t/µ

√µk +

(t+s)/µ−1∑

k=t/µ

µ[A(i)k uk + o(|uk|)] +

(t+s)/µ−1∑

k=t/µ

√µ∆k(i).

(66)

Note that for the second term above we used gk(ai, i) = o(θk) = o(√µ|uk|) by (A3). First,

we show the last term in (66) is o(1). Since ∆k(i) is a martingale difference, we have

E

∣∣∣∣∣∣

(t+s)/µ−1∑

k=t/µ

√µ∆k(i)

∣∣∣∣∣∣

2

=

(t+s)/µ−1∑

k=t/µ

µ |∆k(i)|2

=

(t+s)/µ−1∑

k=t/µ

µE[gk(i)− gk(ai, i)]′[gk(i)− gk(ai, i)]

+

(t+s)/µ−1∑

k=t/µ

µE[gk(i)− gk(ai, i)]′[gk(i)− gk(ai, i)].

(67)

19

The boundedness of ϕk and uk implies√µϕ′

kuk → 0 in probability uniformly in k as µ → 0.

Hence, the first term in (67) has

(t+s)/µ−1∑

k=t/µ


=

(t+s)/µ−1∑

k=t/µ

µEϕ′kϕk[sgn(

√µϕ′

kuk + ek)− sgn(ek)]2 → 0 as µ → 0.

(68)

Using (A3) and (A4), along with the boundedness of uk, on the second term of (67) gives

(t+s)/µ−1∑

k=t/µ


=

(t+s)/µ−1∑

k=t/µ

µE[√µA

(i)k uk + o(

√µ|uk|)]′[

√µA

(i)k uk + o(

√µ|uk|)]

=

(t+s)/µ−1∑

k=t/µ

µ2E∣∣∣(A(i)

k − A(i))uk + A(i)uk + o(|uk|)∣∣∣2

≤ µ2K∞∑

k=t/µ

φ(k − t/µ) + µ2

(t+s)/µ−1∑

k=t/µ

K → 0 as µ → 0.

(69)

Hence

E

(t+s)/µ−1∑

k=t/µ

√µ∆k(i) → 0 as µ → 0.

Next, in the second term of (66) we have

(t+s)/µ−1∑

k=t/µ

µ[A(i)k uk + o(|uk|)]

= A(i)

(t+s)/µ−1∑

k=t/µ

µuk +

(t+s)/µ−1∑

k=t/µ

µ(A(i)k − A(i))uk +

(t+s)/µ−1∑

k=t/µ

µo(|uk|).(70)

Similar to the previous section, choose a sequence mµ such that mµ → ∞ as µ → 0 but

δµ/√µ =

√µmµ → 0. Then

(t+s)/µ−1∑

k=t/µ

µuk =t+s∑

lδµ=t

δµulmµ+

t+s∑

lδµ=t

δµ1

mµ

lmµ+mµ−1∑

k=lmµ

[uk − ulmµ]. (71)

20

Since for lmµ ≤ k < lmµ +mµ, uk − ulmµ= O(δµ/

√µ), so the second term above goes to 0

in probability, uniformly in t. Similarly, by (A3),

(t+s)/µ−1∑

k=t/µ

µ(A(i)k − A(i))uk =

t+s∑

lδµ=t

δµ1

mµ

lmµ+mµ−1∑

k=lmµ

(A(i)k − A(i))uk → 0. (72)

Likewise,∑(t+s)/µ−1

k=t/µ µo(|uk|) → 0 in probability uniformly in t.

Hence, putting the above estimates together we obtain

u(t+ s)− u(t) = limµ→0

uµ(t + s)− uµ(t)

= limµ→0

m0∑

i=1

[−Ai

t+s∑

lδµ=t

δµulmµ

]Iαk=ai −

(t+s)/µ−1∑

k=t/µ

√µk

= −∫ t+s

t

Aα(τ)u(τ)dτ −∫ t+s

t

Σ1/2dw(τ).

(73)

�

5.2 Scaled Errors: ε ≪ µ

The analysis for the cases ε ≪ µ and ε ≫ µ is similar to that for ε = O(µ). We omit

the details and present the main results. Recall that in the ε ≪ µ case, the parameter is

essentially a constant and thus we look to the initial distribution to determine the asymptotic

properties.

Define

α∗ :=

m0∑

i=1

aiP (α0 = ai), vn :=α∗ − θn√

µ

vµ(t) := vn for t ∈ [(n−Nµ)µ, (n−Nµ)µ+ µ)

A(∗) :=

m0∑

i=1

A(i)P (α0 = ai)

Then we have the following:

Theorem 5.4 Assume ε = µ1+∆ for some ∆ > 0. Then vµ(·) converges weakly to v(·) suchthat v(·) is the solution of

dv(t) = −A(∗)vdt− Σ1/2dw, (74)


21

5.3 Scaled Errors: ε ≫ µ

Again, here the idea is that the parameter varies so quickly that it quickly converges to

the stationary distribution ν = (ν1, . . . , νm0). Thus we look to the expectation against the

stationary distribution to determine the asymptotic properties.

Define

α :=

m0∑

i=1

aiνi zn :=α− θn√

µ

zµ(t) := zn for t ∈ [(n−Nµ)µ, (n−Nµ)µ+ µ)

A :=

m0∑

i=1

A(i)νi

We have the following result.

Theorem 5.5 Assume ε = µγ for some 1/2 < γ < 1. Then zµ(·) converges weakly to z(·)such that z(·) is the solution of

dz(t) = −Azdt− Σ1/2dw, (75)


6 Numerical Examples

Here we demonstrate the performance of the Sign-Error (SE) algorithms and compare it

with the Sign-Regressor (SR) and Least Mean Squares (LMS) algorithms (see [28, 32] re-

spectively). We fix the step size µ = .05 and consider three cases: ε = (3/5)µ ( ε = O(µ));

ε = µ2 (a slowly-varying Markov chain); and ε =√µ (a fast Markov chain).

We use state space M = {−1, 0, 1} with transition matrix P ε = I + εQ, where

Q =

−0.6 0.4 0.2

0.2 −0.5 0.3

0.4 0.1 −0.5

.

is the generator of a continuous-time Markov chain whose stationary distribution is therefore

ν = (1/3, 1/3, 1/3). Hence α =∑3

i=1 aiνi = 0 We take the initial distribution for α0 to be

(3/4, 1/8, 1/8). So α∗ =∑3

i=1 aiP (α0 = ai) = −0.625. {ϕn} and {en} are i.i.d. N (0, 1) and

N (0, .25), respectively. We proceed to observe 1000 iterations of the algorithm for the cases

22

https://www.researchgate.net/publication/3032091_Least_mean_square_algorithms_with_Markov_regime-switching_limit?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==


0 100 200 300 400 500 600 700 800 900 1000−1.5

−1

−0.5

0

0.5

1

1.5

Iteration

Valu

e

True ParameterSE EstimateSR EstimateLMS Estimate

Figure 1: Markov chain parameter process and estimates obtained by adaptive filtering withε = O(µ)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−1.5

−1

−0.5

0

0.5

1

1.5

Iteration

Val

ue

True ParameterSE EstimateSR EstimateLMS Estimate

Figure 2: Markov chain parameter process andestimates obtained by adaptive filtering withε ≪ O(µ)

0 100 200 300 400 500 600 700 800 900 1000−1

−0.5

0

0.5

1

1.5

Iteration

Val

ue

True ParameterSESRLMS

Figure 3: Markov chain parameter process andestimates obtained by adaptive filtering withε ≫ µ

ε = O(µ) and ε >> µ, and 10, 000 iterations for the case ε << µ (in order to illustrate some

variations of the parameter).

To observe the tracking behavior of the SE algorithm, in comparison to the SR and LMS

algorithms, we overlay the respective plots for each case. When ε = O(µ), the LMS and

SR estimates tend to be approximately equal, while the SE estimates show more deviations

from the other estimates. The SE algorithm responds to changes in the parameter more

quickly, while the LMS and SR algorithms adhere to the parameter more closely while it is

stationary. In the ε ≪ µ case, we see this behavior repeated. While all three estimates track

23

0 100 200 300 400 500 600 700 800 900 1000−5

−4

−3

−2

−1

0

1

2

3

4

5

Iteration

z

n SE

zn SR

zn LMS

Figure 4: Scaled error zn with fast varyingMarkov chain (ε ≫ µ): Diffusion behavior

0 100 200 300 400 500 600 700 800 900 1000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Iteration

Val

ue

Parameter AverageSE AverageSR AverageLMS Average

Figure 5: Average of parameter process andestimates over time with ε ≫ µ

the parameter closely, the LMS and SR estimates deviate from the parameter less than the

SE estimates between jumps of the parameter.

In the ε ≫ µ case, none of the algorithms can track the parameter at each iterate very

well. However, when we observe the scaled error against the stationary distribution of the

Markov chain zn, the expected diffusion behavior is displayed. Examining the cumulative

average of the parameter and the estimates of the iterates, we note that the parameter

average quickly converges to α. The LMS and SR estimate averages adhere closely to the

parameter average, while the SE estimate average deviates slightly more.

References

[1] A. Benveniste, M. Gooursat, and G. Ruget, Analysis of stochastic approximation schemeswith discontinuous and dependent forcing terms with applications to data communicationalgorithms, IEEE Trans. Automatic Control, AC-25 (1980), 1042-1058.

[2] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Approxima-tions, Springer-Verlag, Berlin, 1990.

[3] P. Billingsley, Convergence of Probability Measures, J. Wiley, New York, 1968.

[4] H.-F. Chen, Stochastic Approximation and Its Applications, Kluwer Academic, Dordrecht,Netherlands, 2002.

[5] H.-F. Chen and G. Yin, Asymptotic properties of sign algorithms for adaptive filtering, IEEETrans. Automat. Control, 48 (2003), 1545–1556.

[6] H.-F. Chen and G. Yin, On asymptotic properties of a constant-step-size sign-error algorithmfor adaptive filtering, Science in China Series F: Information Sciences, 45 (2002), 321–334.

[7] E. Eweda and O. Macchi, Quadratic mean and almost-sure convergence of unbounded stochas-tic approximation algorithms with correlated observations, Ann. Henri Poincare, 19 (1983),235-255.

[8] E. Eweda, A tight upper bound of the average absolute error in a constant step size signalgorithm, IEEE Trans. Acoust. Speech Signal Proc., ASSP-37 (1989), 1774-1776.

[9] E Eweda, Convergence of the sign algorithm for adaptive filtering with correlated data, IEEETrans. Inform. Theory, IT-37 (1991), 1450-1457.

24






https://www.researchgate.net/publication/3078003_Convergence_of_the_Sign_Algorithm_for_Adaptive_Filtering_with_Correlated_Data?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3078003_Convergence_of_the_Sign_Algorithm_for_Adaptive_Filtering_with_Correlated_Data?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==












https://www.researchgate.net/publication/275981329_Convergence_of_Probability_Measures?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

[10] S.N. Ethier and T.G. Kurtz, Markov Processes, Characterization and Convergence, Wiley,New York, 1986

[11] A. Gersho, Adaptive filtering with binary reinforcement, IEEE Trans. Inform. Theory, IT-30(1984), 191-199.

[12] L. Guo, Time-varying Stochastic Systems: Stability, Estimation and Control, Jilin ScienceTech. Press, Changchun, 1993.

[13] M.L. Honig, U. Madhow, and S. Verdu. Adaptive blind multiuser detection. IEEE Trans.Information Theory, 41 (1995) 944-960.

[14] M.L. Honig and H.V. Poor, Adaptive interference suppression in wireless communication sys-tems, in Wireless Communications: Signal Processing Perspectives, H.V. Poor and G.W. Wor-nell Eds., Prentice Hall, 1998.

[15] H.J. Kushner, Approximation and Weak Convergence Methods for Random Processes, withApplications to Stochastic Systems Theory, MIT Press, Cambridge, MA, 1984.

[16] H.J. Kushner and A. Shwartz, Weak convergence and asymptotic properties of adaptive filterswith constant gains, IEEE Trans. Inform. Theory IT-30 (1984), 177–182.

[17] H.J. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithms and Applica-tions, 2nd ed., Springer-Verlag, New York, NY, 2003.

[18] C.P. Kwong, Dual signal algorithm for adaptive filtering, IEEE Trans. Comm. COM-34(1986), 1272-1275.

[19] L. Ljung, Analysis of stochastic gradient algorithms for linear regression problems, IEEETrans. Inform. Theory, IT-30 (1984), 151-160.

[20] O. Macchi and E. Eweda, Convergence analysis of self- adaptive equalizers, IEEE Trans.Information Theory, IT-30 (1984), 161-176.

[21] M. Metivier and P. Priouret, Applications of a Kushner and Clark lemma to general classesof stochastic algorithms, IEEE Trans. Inform, Theory, IT-30 (1984), 140-151.

[22] G.V. Moustakides, Exponential convergence of products of random matrices, application toadaptive algorithms, Internat. J. Adaptive Control Signal Process, 12 (1998), 579-597.

[23] H.V. Poor and X. Wang. Code-aided interference suppression for DS/CDMA communications–Part I: Interference suppression capability, and Part II: Parallel blind adaptive implementa-tions, IEEE Trans. Comm., 45 (1997), 1101-1111, and 1112–1122.

[24] N.A.M. Verhoeckx, H.C. van den Elzen, F.A.M. Snijders, and P.J. van Gerwen, Digital echocancellation for baseband data transmission, IEEE Trans. Acoust. Speech Signal Processing,761-781.

[25] L.Y. Wang, G. Yin, J.-F. Zhang, and Y.L. Zhao, System Identification with Quantized Obser-vations: Theory and Applications, Birkhauser, Boston, 2010.

[26] B. Widrow and S.D. Stearns, Adaptive Signal Processing, Prentice-Hall, Englewood, Cliffs,NJ, 1985.

[27] G. Yin, Adaptive filtering with averaging, in Adaptive Control, Filtering and Signal Processing,IMA Volumes in Mathematics and Its Applications, Vol. 74, 375–396, K. Astrom, G. Goodwinand P.R. Kumar Eds., Springer-Verlag, New York, 1995.

[28] G. Yin, A. Hashemi, and L.Y. Wang, Sign-regressor adaptive filtering algorithms for Markovianparameters, preprint, 2011.

[29] G. Yin, C. Ion, and V. Krishnamurthy, How does a stochastic optimization/approximationalgorithm adapt to a randomly evolving optimum/root with jumpMarkov sample paths, Math.Programming, Ser. B, 120 (2009), 67–99.

[30] G. Yin, V. Krishnamurthy, and C. Ion, Iterate-averaging sign algorithms for adaptive filteringwith applications to blind multiuser detection, IEEE Trans. Inform. Theory, 49 (2003), 657-671.

25






















https://www.researchgate.net/publication/3084009_Applications_of_a_Kushner_and_Clark_lemma_to_general_classes_of_stochastic_algorithms?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/3084009_Applications_of_a_Kushner_and_Clark_lemma_to_general_classes_of_stochastic_algorithms?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==






https://www.researchgate.net/publication/266246835_Approximation_and_Weak_Convergence_Methods_for_Random_Processes_with_Applications_to_Stochastic_Systems_Theory_MIT_Press_Cambridge_MA?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/266246835_Approximation_and_Weak_Convergence_Methods_for_Random_Processes_with_Applications_to_Stochastic_Systems_Theory_MIT_Press_Cambridge_MA?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/278307608_Exponential_convergence_of_products_of_random_matrices_Application_to_adaptive_algorithms?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/278307608_Exponential_convergence_of_products_of_random_matrices_Application_to_adaptive_algorithms?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==





[31] G. Yin, V. Krishnamurthy, and C. Ion, Regime switching stochastic approximation algorithmswith application to adaptive discrete stochastic optimization, SIAM J. Optim., 14 (2004),1187–1215.

[32] G. Yin and V. Krishnamurthy, Least mean square algorithms with Markov regime switchinglimit, IEEE Trans. Automat. Control, 50 (2005), 577–593.

[33] G. Yin and Q. Zhang, Discrete-time Markov Chains: Two-time-scale Methods and Applica-tions, Springer, New York, NY, 2005.

[34] G. Yin, Q. Zhang, and G. Badowski, Discrete-time singularly perturbed Markov chains:Aggregation, occupation measures, and switching diffusion limit, Adv. in Appl. Probab., 35(2),449–476, 2003.

[35] G. Yin and C. Zhu, Hybrid Switching Diffusions: Properties and Applications, Springer, NewYork, 2010.

26

https://www.researchgate.net/publication/38349608_Discrete-time_singularly_perturbed_Markov_chains_Aggregation_occupation_measures_and_switching_diffusion_limit?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==










https://www.researchgate.net/publication/268650092_Hybrid_Switching_Diffusions_Properties_and_Applications?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

https://www.researchgate.net/publication/268650092_Hybrid_Switching_Diffusions_Properties_and_Applications?el=1_x_8&enrichId=rgreq-e85bbe83f6684428d2738da4c79fda72-XXX&enrichSource=Y292ZXJQYWdlOzIzMzk0NzM3NDtBUzoxMDQ1NjE2ODM0NjgyOTFAMTQwMTk0MDg1NjQ3Ng==

Sign-Regressor Adaptive Filtering Algorithms Using Averaged Iterates and Observations

Documents

Transcript of Sign-Regressor Adaptive Filtering Algorithms Using Averaged Iterates and Observations