Nonlinear Wavelet Estimation of Time-Varying Autoregressive Processes

Nonlinear wavelet estimation of time-varyingautoregressive processesRainer Dahlhaus1, Michael H. Neumann2 and Rainer von Sachs3Abstract. We consider nonparametric estimation of the parameter functionsai(�) , i = 1; : : : ; p , of a time-varying autoregressive process. Choosing anorthonormal wavelet basis representation of the functions ai , the empirical waveletcoe�cients are derived from the time series data as the solution of a least squaresminimization problem. In order to allow the ai to be functions of inhomogeneousregularity, we apply nonlinear thresholding to the empirical coe�cients and obtainlocally smoothed estimates of the ai . We show that the resulting estimatorsattain the usual minimax L2-rates up to a logarithmic factor, simultaneously ina large scale of Besov classes. The �nite{sample behaviour of our procedure isdemonstrated by application to two typical simulated examples.1991 Mathematics Subject Classi�cation. Primary 62M10; secondary 62F10Key words and phrases. Nonstationary processes, time series, wavelet estimators, time-varyingautoregression, nonlinear thresholdingRunning head. Estimation of time-varying processes1Institut f�ur Angewandte Mathematik, Universit�at Heidelberg, Im Neuenheimer Feld 294, D{69120 Heidelberg, Germany2SFB 373, Humboldt{Universit�at zu Berlin, Spandauer Stra�e 1, D{10178 Berlin, Germany3Fachbereich Mathematik, Universit�at Kaiserslautern, Erwin-Schr�odinger-Stra�e, D{67663Kaiserslautern, Germany

11. IntroductionStationary models have always been the main focus of interest in the theoreticaltreatment of time series analysis. For several reasons autoregressive models forma very important class of stationary models: They can be used for modeling awide variety of situations (for example data which show a periodic behavior),there exist several e�cient estimates which can be calculated via simple algorithms(Levinson{Durbin algorithm, Burg{algorithm), the asymptotic properties includingthe properties of model selection criteria are well understood.Frequently, people have tried to use autoregressive models also for modeling datathat show a certain type of nonstationary behaviour by �tting AR-models on smallsegments. This method is for example often used in signal analysis for coding a signal(linear predictive coding) or for modeling data in speech analysis. The underlyingassumption then is that the data are coming from an autoregressive process with timevarying coe�cients.Suppose we have some observations fX1; : : : ;XTg from a zero mean, autoregressiveprocess with time varying coe�cients a1(�); : : : ; ap(�). To get a tractable frame forour asymptotic analysis we assume that the functions ai are supported on the interval[0; 1] and connected to the underlying time series by an appropriate rescaling. Thisleads to the modelXt;T + pXi=1 ai(t=T )Xt�i;T = �(t=T ) "t ; t = 1; : : : ; T; (1.1)where the "t's are independent, identically distributed with E "t = 0 and var("t) = 1.To make this de�nition complete, assume that X0; : : : ;X1�p are random variablesfrom a stationary AR(p)-process with parameters a1(0); : : : ; ap(0) . As usual innonparametric regression, we focus on estimating the whole functions ai, although,strictly speaking, the intermediate values ai(s) for (t � 1)=T < s < t=T , are notidenti�able. This time varying autoregressive model is a special locally stationaryprocess as de�ned in Dahlhaus (1997). However, for the main results of this paperwe only use the representation (1.1) and not the general properties, like an analogueof Cram�er's representation, e.g., of a locally stationary process.The estimation problem now consists of estimating the parameter functionsai(�). Very often these functions are estimated at a �xed time point t0=Tby �tting a stationary model in a neighborhood of t0, e.g. by estimatinga1(t0=T ); : : : ; ap(t0=T ) with the classical Yule{Walker (or Burg{) estimate overthe segment Xt0�N;T ; : : : ;Xt0+N;T where N=T is small. This method has thedisadvantage that it automatically leads to a smooth estimate of ai(�). Suddenchanges in the ai(�), as they are quite common e.g. in signal analysis, cannot bedetected by this method. Moreover, the performance of this method depends onthe appropriate choice of the segmentation parameter N . Instead, in this paper wedevelop an automatic alternative, which avoids this a priori choice and adapts tolocal smoothness characteristics of the ai(�).Our approach consists in a nonlinear wavelet method for the estimation of thecoe�cients ai(�). This concept, based on orthogonal series expansions, has recently

2been entered in the nonparametric regression estimation problem due to Donoho andJohnstone (1992), and has been proven very useful if the class of considered functionsto be estimated exhibits a varying degree of smoothness. Some generalizationscan be found in Brillinger (1994), Johnstone and Silverman (1997), Neumann andSpokoiny (1995) and Neumann and von Sachs (1995). As usual, the unknownfunctions, i.e. ai(u), are expanded by orthogonal series w.r.t. a particularly chosenorthonormal basis of L2[0; 1], a wavelet basis. Basically, the basis functions aregenerated by dilations and translations of the so-called scaling function � and waveletfunction , which are both localized in spatial position (i.e. temporial, here) andfrequency. These basis functions, unlike most of the \traditional" ones (Fourier,(non-local) polynomials, etc.), are able to optimally compress both functions withrather homogeneous smoothness over the whole domain (like H�older or L2-Sobolev)as well as members of certain inhomogeneous smoothness classes like Lp-Sobolev orBesov Bmp;q with p < 2. Note that the better compressed a signal is (i.e. beingrepresented by a smaller number of coe�cients), the better performs an estimator ofthe signal which is optimally tuned w.r.t. bias-variance trade-o�. A strong theoreticaljusti�cation for the merits of using wavelet bases in this context has been givenby Donoho (1993): It was shown that wavelets provide unconditional bases for awide variety of these inhomogeneous smoothness classes which yields that waveletestimators can be optimal in the abovementioned sense.To actually achieve this optimality there is need for non-linearly modifying traditionallinear series estimation rules which are known to be optimal only in case ofhomogeneous smoothness: There the coe�cients of each resolution level j areessentially of the same order of magnitude, and the loss due to a levelwiseinclusion/exclusion rule, as opposed to a componentwise rule, is only small. However,under strong inhomogeneity, not only the coe�cients of each �xed level mightconsiderably di�er in their orders of magnitude but also have signi�cant values onhigher levels to be included by a suitably chosen inclusion rule. Surprisingly enough,this is possible by simple and intuitive schemes which are based on comparing thesize of the empirical (i.e. estimated) coe�cients with their variability. Such nonlinearrules can dramatically outperform linear ones for the mentioned cases of sparse signals(i.e. those of inhomogeneous function classes being represented in an unconditionalbases).In this work, we apply these locally adaptive estimation procedures to the particularproblem of estimating autoregression coe�cients which are functions of time. A basicproblem in this situation is to obtain adequate emprirical wavelet coe�cients. Forexample, if one made a wavelet expansion of the function bai(�) where bai( t0T ) were theYule{Walker estimate on a segment (as described above), then the information onirregularities of the functions ai(�) would already be lost (since the segment is smooth).No thresholding procedure of the empirical wavelet coe�cients would recover it.To overcome this problem we suggest in this paper to use the empirical waveletcoe�cients obtained from the solution of a least squares minimization problem Ina second step, soft or hard thresholding is applied. We show that in this situationour nonlinear wavelet estimator attains the usual near-optimal minimax rate of L2-convergence, in a large scale of Besov classes. The full procedure requires consistentestimators for the variance of the empirical coe�cients. In particular, a consistentestimator of the variance function is needed (cf. Section 3), e.g. the squared residualsof a local AR-�t.

3Finally, with this adaptive estimation of the time-varying autoregression coe�cients,we immediately provide a semiparametric estimate for the resulting time-dependentspectral density of the process given by (1.1). An alternative, fully nonparametricapproach for estimating the so-called evolutionary spectrum of a general locallystationary process (as de�ned in Dahlhaus (1997)) has been delivered by Neumannand von Sachs (1997), which is based on nonlinear thresholding in a two-dimensionalwavelet basis.The content of our paper is organized as follows: While in the next section we describedetails of our set-up and present this main result, in Section 3 the statistical propertiesof the empirical coe�cients are given. Section 4 shows the �nite{sample behaviour ofour procedure applied to two typical (simulated) time{varying autogressive processes.Section 5 deals with the proof of the main theorem. The remaining Sections 6 { 8collect some auxiliary results, both of own interest and in this particular context usedto derive the main proof (of Section 5).2. Assumptions and the main resultBefore we develop nonlinear wavelet estimators for the functions ai, we describethe general set-up. First we introduce an appropriate orthonormal basis ofL2[0; 1]. Assume we have a scaling function � and a so-called wavelet suchthat f2l=2�(2l: � k)gk2Z[ f2j=2 (2j: � k)gj� l;k2Z forms an orthonormal basis ofL2(R). The construction of such functions � and , which are compactly supported,is described in Daubechies (1988). It is well-known that the boundary-correctedMeyer wavelets (Meyer (1991)) or those developed by Cohen, Daubechies and Vial(1993) form orthonormal bases of L2[0; 1]. In both approaches Daubechies' waveletsare used to construct an orthonormal basis of L2[0; 1], essentially by truncationof the above functions to the interval [0; 1] and a subsequent orthonormalizationstep. Throughout this paper either of these bases can be used, which is denoted byf�lkgk2I0l [f jkgj�l;k2Ij , where �lk(x) = 2l=2�(2lx�k) and jk(x) = 2j=2 (2jx�k) ,and with certain modi�cations of those functions that have a support beyond theinterval [0; 1]. It is known that #Ij = 2j , and that #I0l = 2l for the Cohen,Daubechies, Vial (CDV) bases, whereas for the Meyer bases, #I0l = 2l+N for someinteger N depending on the regularity of the wavelet basis. For reasons of notationalsimplicity, in the sequel we restrict to treat the CDV bases, only.Accordingly, we can expand ai in an orthogonal seriesai = Xk2I0l �(i)lk �lk + Xj�l Xk2Ij �(i)jk jk; (2.1)where �(i)lk = R ai(u)�lk(u) du , �(i)jk = R ai(u) jk(u) du are the usual Fouriercoe�cients, also called wavelet coe�cients.Assume a degree of smoothness mi for the function ai, i.e. mi is a member of aBesov class Bmipi;qi(C) de�ned below. In accordance with this, we choose compactlysupported wavelet functions of regularity r > m := maxfmig , that is(A1) (i) � and are Cr[0; 1] and have compact support,(ii) R �(t) dt = 1; R (t)tk dt = 0 for 0 � k � r:The �rst step in each wavelet analysis is the de�nition of empirical versions of thewavelet coe�cients. We de�ne the empirical coe�cients simply as a least squares

4estimator, i.e. as a minimizer ofTXt=p+10B@Xt;T + pXi=1 264Xk2I0l �(i)lk �lk(t=T ) + j��1Xj=l Xk2Ij �(i)jk jk(t=T )375Xt�i;T1CA2 ; (2.2)where the choice of j� will be speci�ed below. Since f�lkgk [ f jkgl�j�j��1;k form abasis of the subspace Vj� of L2[0; 1], this amounts to an approximation of ai in justthis space Vj� .In the present paper we propose to apply nonlinear smoothing rules to the coe�cientse�(i)jk . It is well-known (cf. Donoho and Johnstone (1992)) that linear estimators can beoptimal w.r.t. the optimal rate of convergence as long as the underlying smoothness ofai is not too inhomogeneous. This situation changes considerably, if the smoothnessvaries strongly over the domain. Then we have the new e�ect that even at higherresolution scales a small number of coe�cients cannot be neglected, whereas theoverwhelming majority of them is much smaller than the noise level. This kindof sparsity of nonnegligible coe�cients is responsible for the need of a nonlinearestimation rule. Two commonly used rules to treat the coe�cients are1) hard thresholding �(h)( e�(i)jk ; �) = e�(i)jk I �j e�(i)jk j � ��and2) soft thresholding �(s)( e�(i)jk ; �) = �j e�(i)jk j � ��+ sgn( e�(i)jk ):To treat these coe�cients in a statistically appropriate manner, we have to tunethe estimator in accordance with their distribution. It turns out that, at the �nestresolution scales, this distribution actually depends on the (unknown) distributionof the Xt;T 's, whereas we can hope to have asymptotic normality if 2j = o(T ) .We show in Section 3 that we do not lose asymptotic e�ciency of the estimator, ifwe truncate the series at some level j = j(T ) with 2j(T ) � T 1=2. To give a de�niterule, we choose the highest resolution level j� � 1 such that 2j��1 � T 1=2 < 2j� ,i.e. we restrict our analysis to coe�cients e�(i)lk (k 2 I0l , i = 1; : : : ; p) and e�(i)jk (j � l,2j � T 1=2, k 2 Ij, i = 1; : : : ; p). Unlike in ordinary regression it is not possible in theautocorrelation problem considered here to include coe�cients from resolution scalesj up to 2j = o(T ) . This is due to the fact that the empirical coe�cients cannot bereduced to sums of independent (or su�ciently weakly dependent) random variables,which results in some additional bias term.Finally, we build an estimator of ai by applying the inverse wavelet transform to thenonlinearly modi�ed coe�cients.Before we state our main result, we introduce some more assumptions. The constantC used here and in the following is assumed to be positive, but need not be the sameat each occurrence.(A2) There exists some � 0 such thatjcumn("t)j � Cn(n!)1+ for all n; t

5(A3) There exists a � > 0 with1 + pXi=1 ai(s)zi 6= 0 for all jzj � 1 + � and all s 2 [0; 1]:Furthermore, � is assumed to be continuous with C1 � �(s) � C2 on [0; 1].Remark 1. Note that, besides the obvious case of the normal distribution, many ofthe distributions that can be found in textbooks satisfy (A2) for an appropriate choiceof . In Johnson and Kotz (1970) we can �nd closed forms of higher order cumulantsof the exponential, gamma and inverse Gaussian distribution, which show that thiscondition is satis�ed for = 0. The need for a positive occurs in the case ofheavier-tailed distribution, which could arise as the distribution of a sum of weaklydependent random variables, see, e.g., Statulevicius and Jakimavicius (1989).(A3) implies uniform continuity of the covariances of fXt;Tg (Lemma 8.1). Weconjecture that the continuity in (A3) can e.g. be relaxed to piecewise continuity.In the following we derive a rate for the risk of the proposed estimator uniformly overcertain smoothness classes. It is well{known that nonlinearly thresholded waveletestimators have the potential to adapt to spatial inhomogeneity. Accordingly, weconsider Besov classes as functional classes which admit functions with this feature.Furthermore, Besov spaces represent the most convenient scale of functional spacesin the context of wavelet methods, since the corresponding norm is equivalent to acertain norm in the sequence space of coe�cients of a su�ciently regular waveletbasis. For an introduction to the theory of Besov spaces Bmp;q see, e.g., Triebel (1990).Here m � 1 denotes the degree of smoothness and p; q (1 � p; q � 1) specify thenorm in which smoothness is measured. These classes contain traditional H�olderand L2-Sobolev smoothness classes, by setting p = q = 1 and p = q = 2 ,respectively. Moreover, they embed other interesting functional spaces like Sobolevspaces Wmp , for which the inclusions Bmp;p � Wmp � Bmp;2 in the case 1 < p � 2 , andBmp;2 � Wmp � Bmp;p if 2 � p < 1 hold true; see, e.g., Theorem 6.4.4 in Bergh andL�ofstr�om (1976).For convenience, we de�ne our functional class by constraints on the sequences ofwavelet coe�cients. Fix any positive constants Cij, i = 1; : : : ; p; j = 1; 2. We willassume that ai lies in the following set of functionsFi = 8<:f =Xk �lk�lk +Xj;k �jk jk�� k�l:k1 � Ci1; k�::kmi;pi;qi � Ci29=; ;where k�::km;p;q = 0B@Xj�l 242jsp Xk2Ij j�jkjp35q=p1CA1=q ;s = m+ 1=2 � 1=p. It is well-known that the class Fi lies between functional classesBmipi;qi(c) and Bmipi;qi(C), for appropriate constants c and C; see Theorem 1 in Donohoand Johnstone (1992) for the Meyer bases, and Theorem 4.2 of Cohen, Dahmen andDeVore (1995) for the CDV bases.To have enough regularity, we restrict ourselves to

6 (A4) esi > 1 where esi = mi + 1=2 � 1=epi , with epi = minfpi; 2g.In the case of normally distributed coe�cients e�(i)jk � N(�(i)jk ; �2) a verypopular method is to apply thresholds � = �p2 log n , where n is the numberof these coe�cients. As shown in Donoho et al. (1995), the application ofthese thresholds leads to an estimator which is simultaneously near-optimal ina wide variety of smoothness classes. Because of the heteroskedasticity of theempirical coe�cients in our case, we have to modify the above rule slightly. LetJT = n(j; k) ��l � j; 2j � T 1=2; k 2 Ij o and let �2ijk be the variance of the empiricalcoe�cient e�(i)jk . Then any threshold �ijk satisfying�ijkq2 log(#JT ) � �ijk = O(T�1=2qlog(T )) (2.3)would be appropriate. Particular such choices are the \individual thresholds"�ijk = �ijkq2 log(#JT )and the \universal threshold"�(i)T = �(i)T q2 log(#JT ); �(i)T = max(j;k)2JTf�ijkg:Let b�ijk be estimators of �ijk or �(i)T , respectively, which satisfy at least the followingminimal condition(A5) (i) P(j;k)2JT P �b�ijk < T�ijk� = O(T �); where � < 1=(2mi + 1) for some T ! 1,(ii) P(j;k)2JT P �b�ijk > CT�1=2qlog(T )� = O(T�1).With such thresholds b�ijk we build the estimatorbai(u) = Xk2I0l e�(i)lk �lk(u) + X(j;k)2JT �(:)( e�(i)jk ; b�ijk) jk(u); (2.4)where �(:) stands for �(h) or �(s), respectively.Finally we like to impose an additional condition on the matrix D being de�ned by(7.4) in Section 7.1. Basically, this matrix is the analogue to the p � (T � p) matrix((Xt�m))t=p+1;::: ;T ;m=1;::: ;p , as arising in the classical Yule-Walker-equations, whichdescribe the corresponding least squares problem for a stationary AR(p)-processfXtg.Here, we assume additionally that(A6) Ek(D0D)�1k2+� = O(T�2��)for some � > 0 .Theorem 2.1. (i) Assume (A1) through (A5). Thensupai2Fi nE �k bai � aik2L2[0;1] ^ C�o = O �(log(T )=T )2mi=(2mi+1)� :(ii) If in addition (A6) is ful�lled, thensupai2Fi nEk bai � aik2L2[0;1]o = O �(log(T )=T )2mi=(2mi+1)� :

7Remark 2. Even without (A6) we can show thatD0D is close to its expectation ED0D,and hence �min(D0D) is bounded away from zero, except for an event with a verysmall probability. To take this event into account, the somewhat unusual truncatedloss function is introduced in part (i) of the above theorem.Remark 3. In our estimator (2.4) we restricted ourselves to a �xed primary resolutionlevel l, that is l does not change with growing sample size T . In principle, we couldallow l to increase with T at a su�ciently slow rate. This was already considered,e.g., by Hall and Patil (1995) in a di�erent context. We expect the same rate for therisk of our estimator (2.4) as long as 2l(T ) � T 1=(2m+1), which can be shown similarlyto methods in Hall and Patil (1995).It is known that the rate T�2m=(2m+1) is minimax for estimating a function with degreeof smoothness m in a variety of settings (regression, density estimation, spectraldensity estimation). Although we do not have a rigorous proof for its optimality inthe present context, we conjecture that we cannot do better in estimating the ai's.Analogously to Donoho and Johnstone (1992) we can get exactly the rateT�2mi=(2mi+1) by the use of level-dependent thresholds �(i)(j; T;Fi). These thresholdshowever would depend on the assumed degree of smoothness mi, and it seems tobe di�cult to determine them in a fully data-driven way. In a simple model withGaussian white noise Donoho and Johnstone (1995) showed that full adaptivity canbe reached by minimization of an empirical version of the risk, using Stein's unbiasedestimator of risk. Because of our really strong version of asymptotic normality weare convinced that we could attain this optimal rate of convergence in the same way.Let us however note that the \log-thresholds" are much easier to apply, with onlythe small loss of a logarithmic factor in the rate. The surprising fact that a singleestimator is optimal within some logarithmic factor in a large scale of smoothnessclasses can be explained by the methodology quite di�erent from conventionalsmoothing techniques: Rather than aiming at an asymptotic balance relation betweensquared bias and variance of the estimator, which usually leads to the optimalrate of convergence, we perform something like an informal signi�cance test on thecoe�cients. This leads to a slightly oversmoothed, but nevertheless near-optimalestimator.3. Statistical properties of the empirical coefficientsBefore we prove the main theorem in Section 5, we give an exact de�nition of theempirical coe�cients and state some statistical properties of them.First note that our estimator, as a truncated orthogonal series estimator withnonlinearly modi�ed empirical coe�cients, involves two smoothing methodologies:one part of the smoothing is due to the truncation above some level j�. Whereassuch a truncation amounts to some linear, spatially not adaptive technique, the moreimportant smoothing is due to the pre-test like thresholding step applied to thecoe�cients below the level j�. This step aims at selecting those coe�cients which arein absolute value signi�cantly above the noise level and sorting the others out.From the de�nition of the Besov norm we obtain that (cf. Theorem 8 in Donoho et

8al. (1995)) supai2Fi8<:Xj�j�Xk j�(i)jk j29=; = O(2�2j�esi); (3.1)where esi = mi+1=2�1=minfpi; 2g . Hence, our loss due to the truncation is of orderT�2mi=(2mi+1) , if j� is chosen such that 2�2j�esi = O(T�2mi=(2mi+1)) . According toour assumption that esi > 1 , it can be shown by simple algebra that j� with2j��1 � T 1=2 < 2j� is large enough.A �rst observation about the statistical behavior of the empirical coe�cients is statedby the following assertion.Proposition 3.1. Assume (A1) through (A4), and (A6). Then(i) E(e�(i)lk � �(i)lk )2 = O(T�1);(ii) E( e�(i)jk � �(i)jk )2 = O(T�1)hold uniformly in i, k and j < j�.In view of the nonlinear structure of the estimator, the above assertion will not bestrong enough to derive an e�cient estimate for the rate of the risk of the estimator.If the empirical coe�cients were Gaussian, then the number of O(2j�) coe�cientswould be dramatically reduced by thresholding with thresholds that are larger bya factor of q2 log(#JT ) than the noise level. If we want to tune this thresholdingmethod in accordance to our particular case with non-Gaussian coe�cients, we haveto investigate the tail behavior of them. Hence, we state asymptotic normalityof the coe�cients with a special emphasis on moderate and large deviations. Toprove the following theorem we decompose the empirical coe�cients in a certainquadratic form and some remainder terms of smaller order of magnitude. Then wederive upper estimates for the cumulants of these quadratic forms, which provideasymptotic normality in terms of large deviations due to a lemma by Rudzkis, Saulisand Statulevicius (1978), see Lemma 6.2 in Section 6.It turns out that we can state asymptotic normality for empirical coe�cients e�(i)jk with(j; k) from the following set of indices. Let, for arbitrarily small �, 0 < � < 1=2 ,fJT = n(j; k) ��2j � T �; j < j�; k 2 Ij o :Proposition 3.2. Assume (A1) through (A4). ThenP (( e�(i)jk � �(i)jk )=�ijk � x) = (1� �(x)) + o(minf1� �(x);�(x)g) + O(T��)uniformly in (j; k) 2 fJT ; x 2 R for arbitrary � <1.We now derive the asymptotic variances of the e�(i)jk 's. For simplicity of our notation,again, w.l.o.g. restricted to the treatment of the CDV{bases, we identify 1; : : : ; �( � = 2j� ) with�l1; : : : ; �l;2l; l1; : : : ; l;2l; : : : ; j��1;1; : : : ; j��1;2j��1 and e�(i)1 ; : : : ; e�(i)� withe�(i)l1 ; : : : ; e�(i)l;2l; e�(i)l1 ; : : : ; e�(i)l;2l; : : : ; e�(i)j��1;1; : : : ; e�(i)j��1;2j��1, respectively.

9Furthermore, letc(s; k) := Z �� 2(s)2� ��1 + pXj=1 aj(s) exp(i�j)��2 exp(i�k) d�: (3.2)c(s; k) is the local covariance of lag k at time s 2 [0; 1] (cf. Lemma 8.1).Proposition 3.3. Assume (A1) through (A4) and (A6). Thenvar(e�(i)u ) = T�1 �A�1BA�1�p(u�1)+i;p(u�1)+i + o(T�1); (3.3)where Ap(u�1)+k;p(v�1)+l = Z u(s) v(s)c(s; k � l) ds;Bp(u�1)+k;p(v�1)+l = Z u(s) v(s)�2(s)c(s; k � l) ds :Furthermore, A�1BA�1 � E�1 , whereEp(u�1)+k;p(v�1)+l = Z u(s) v(s)(�2(s))�1c(s; k � l) ds:The eigenvalues of E are uniformly bounded.Remark 4. The above form of A and B suggests di�erent estimates for the variancesof e�(i)u and therefore also for the thresholds. One possibility is to use (3.3) and plug ina preliminary estimate (�2(s) may be estimated by a local sum of squared residuals).Another possibility is to use a nonparametric estimate of the local covariances c(s; k).However, these suggestions require more investigations.4. Some numerical examplesBefore proving our main theorem we want to apply the procedure to two simulatedautoregressive processes of order p = 2, both of length T = 1024 = 210:Xt;T + a1(t=T )Xt�1;T + a2(t=T )Xt�2;T = "t ; t = 1; : : : ; T ;where the "t are i.i.d. standard normal, i.e. E "t = 0 and var("t) = 1. In bothexamples, the autoregressive parameters ai = ai(t=T ); i = 1; 2, are functions whichchange over time, i.e. our simulated examples are realizations of a nonstationaryprocess which follows the model (1.1).In the �rst example, a1(u) = �1:69 for u � 0:6, a1(u) = �1:38 for u > 0:6, whereasa2(u) = 0:81 for all 0 � u � 1, i.e. the �rst coe�cient is a piecewise constant functionwith a jump at u = 0:6 and the second coe�cient is constant over time. This givesa time{varying spectral density of the process fXt;Tg which has a peak at �=9 fort � 0:6T and at 4�=9 for t > 0:6T (see Figure 1, bottom right plot).We have applied our estimation procedure using Haar wavelets and �xing the scale ofour least squares (LS{) procedure to be j� = 5, i.e. � = 32. Afterwards we feed theresulting solution ~�(i), for each i = 1; 2, a vector of length � (cf. also equation(7.2) ) into our fast wavelet transform, apply hard thresholding on all resultingwavelet coe�cients e�(i)jk on scales j = 0; :::; 4 and apply fast inverse wavelet transformup to scale 10, our original sample scale. Hereby, we use a universal data{driven

10Frequency

Tim

e

AR−spectrum by Pre−LS

−0.5 0 0.50

0.5

1

0 0.5 1−2

−1

0Wavelet estimator

Time0 0.5 1

−1

0

1

2Wavelet estimator

Time Frequency

Tim

e

AR−spectrum by wavelets

−0.5 0 0.50

0.5

1

Frequency

Tim

e

True AR−spectrum

−0.5 0 0.50

0.5

1

0 0.5 1−2

−1

0Pre−LS−solution for a_1

Time0 0.5 1

−2

0


Time

0 0.5 1−1

0

1

2True function a_2

Time0 0.5 1

−2

−1

0True function a_1

TimeFigure 1. Example 1: Preliminary LS{solution, wavelet thresholdestimator and true function for a1; a2 and resulting AR(2){spectrum.

11p2 log �{threshold based on an empirical variance estimator of the �nest waveletscale j� � 1 = 4.In Figure 1 we show, for a1 (left column) and a2 (middle column), in the upperrow the solution e�(i) of the LS{procedure (without performing non{linear waveletthresholding), for comparison upsampled (interpolated) to scale 10. In the middlerow the non{linear wavelet estimators (on scale 10) is shown, and in the bottom rowthe corresponding true function, all on an equispaced grid of resolution T�1 = 2�10of the interval [0; 1]. In the right column, by grey{scale images in the time{frequencyplane, we plot the resulting time{varying semiparametric spectral density, based onthe respective (estimated and true) autoregressive coe�cient functions. Note thatthe darker the scale the higher the value of the 2� d object as a function of time andfrequency.Note that although the number of samples used for denoising by non{linear waveletthresholding is comparatively small (� = 32 only), this second step delivers anadditional signi�cant contribution, which di�ers in its smoothness considerably fromthe LS{solution alone.We did not try di�erent threshold rules, which possibly could improve a bit on thedenoising. We found that the simple automatic universal rule is quite satisfactory,also as it is in accordance with the theoretically possibly range of thresholds as givenby (2.3). Of course, in one or the other realization we observed that randomly one ofthe coe�cients contributing only by noise was not set to zero, which, not surprisingly,had some disturbing e�ect on the visual appearance of the estimator in particular ofthe constant autoregressive coe�cient. Also, both in this and the next example wedid not observe any signi�cant di�erence between using hard or soft thresholding.Our second example is a slight modi�cation of both our �rst example and the one tobe found in Dahlhaus (1997): Again, the second autogressive coe�cient is constantover time, however the �rst one shows a smooth time variation of di�erent phase andoscillation between the imposed jumps at u = 0:25 and u = 0:75. This was achievedby choosing a1(u) = �1:8 � cos(1:5 � cos(4� � u + �)) for u � 0:25 and for u > 0:75,and a1(u) = �1:8 � cos(3 � cos(4� � u + �=2)) for 0:25 < u � 0:75, whereas againa2(u) = 0:81 for all 0 � u � 1.A simulation of this process with T = 1024 is shown in Figure 2. It is the samerealization that was used for the estimation procedure. Clearly one can observe theinstationary behaviour of this process.Here, we chose as wavelet basis a (periodic) Daubechies' with N = 4 vanishingmoments, i.e. �lter length 2N = 8, and we chose � = 64, i.e. j� = 6. Notethat for this speci�c example we replaced wavelets on the interval by a traditionalperiodic basis simply for reasons of computational convenience, as our chosen exampleis periodic with respect to time. However, we do not expect a big di�erence inperformance between these two bases. In Figure 3 we have plotted again the LS{solutions, the estimators based on wavelet hard thresholding with the same universalthreshold rule as before, and the true functions, both for a1, a2 and for the resultingtime{varying autoregressive spectrum.

120 200 400 600 800 1000 1200

−40

−30

−20

−10

0

10

20

30

40Example 2: Process data, T = 1024

Figure 2. Example 2: Realisation of a stretch of length T = 1024.0 0.5 1

−1

0

1


Time Frequency

Tim

eAR−spectrum by Pre−LS

−0.5 0 0.50

0.5

1

0 0.5 1−2

0

2Wavelet estimator

Time

0 0.5 1−2

0

2True function a_1

Time0 0.5 1

−1

0

1

2True function a_2

Time

Frequency

Tim

e

AR−spectrum by wavelets

−0.5 0 0.50

0.5

1

Frequency

Tim

e

True AR−spectrum

−0.5 0 0.50

0.5

1

0 0.5 1−2

0


Time

0 0.5 1−2

0

2Wavelet estimator

Time

Figure 3. Example 2: Preliminary LS{solution, wavelet thresholdestimator and true function for a1; a2 and resulting AR(2){spectrum.

135. Proof of the main theoremTo simplify the treatment of some particular remainder terms which occasionally arisein the following proofs, as e.g. in the decomposition (7.5), we introduce the followingnotation.De�nition 5.1. We write ZT = eO(�T );if for each � <1 there exists a C = C(�) such thatP (jZT j > C�T ) � CT��:(If we use this notation simultaneously for an increasing number of random variables,we mean the existence of a universal constant only depending on �.)Proof of Theorem 2.1. We prove only (ii). The proof of (i) without the additionalassumption (A6) is very similar, because the stochastic properties of the e�(i)jk 's arethen nearly the same. The only di�erence is, that we cannot guarantee the �nitenessof moments of the e�(i)jk 's, and therefore we need the truncation in the loss function.Using the monotonicity of �(:)( e�(i)jk ; :) in the second argument we get��(:)( e�(i)jk ; b�ijk) � �(i)jk �2� 8>>>>>>>>>>><>>>>>>>>>>>: ( e�(i)jk � �(i)jk )2 + (�(:)( e�(i)jk ; T�ijk) � �(i)jk )2;if b�ijk < T�ijk(�(:)( e�(i)jk ; T�ijk) � �(i)jk )2 + (�(:)( e�(i)jk ; CT�1=2qlog(T )) � �(i)jk )2;if T�ijk � b�ijk � CT�1=2qlog(T )(�(:)( e�(i)jk ; CT�1=2qlog(T )) � �(i)jk )2 + (�(i)jk )2;if b�ijk > CT�1=2qlog(T )

14which implies the decompositionEk bai � aik2� Xk E(e�(i)lk � �(i)lk )2 + X(j;k)2JT E(�(:)( e�(i)jk ; b�ijk)� �(i)jk )2 + Xj�j� Xk2Ij(�(i)jk )2� Xk E(e�(i)lk � �(i)lk )2+ X(j;k)2JT E(�(:)( e�(i)jk ; T�ijk)� �(i)jk )2+ X(j;k)2JT E(�(:)( e�(i)jk ; CT�1=2qlog T )� �(i)jk )2+ X(j;k)2JT EI �b�ijk < T�ijk� ( e�(i)jk � �(i)jk )2+ X(j;k)2JT(�(i)jk )2P �b�ijk > CT�1=2qlog T�+ Xj�j� Xk2Ij(�(i)jk )2= S1 + : : : + S6: (5.1)By (i) of Proposition 3.1 we get immediatelyS1 = O(T�1): (5.2)Let (j; k) 2 fJT . We choose a constant ijk such that�(:)(�; T�ijk) � �(i)jk ; if � � �(i)jk > ijk;�(:)(�; T�ijk) � �(i)jk ; if � � �(i)jk < ijk:(W.l.o.g.,we assume �(:)( ijk + �(i)jk ; T�ijk) � �(i)jk .)Let �T = CT�1=2plog T for some appropriate C. Then we decompose the termsoccurring in the sum S2 as follows:Sjk21 = EI � ijk � e�(i)jk � �(i)jk < �T� (�(:)( e�(i)jk ; T�ijk)� �(i)jk )2Sjk22 = EI ��T < e�(i)jk � �(i)jk < ijk� (�(:)( e�(i)jk ; T�ijk)� �(i)jk )2and Sjk23 = EI �j e�(i)jk � �(i)jk j � �T� (�(:)( e�(i)jk ; T�ijk)� �(i)jk )2:

15Using Proposition 3.2 we get, with �(i)jk � N(�(i)jk ; �2ijk), due to integration by partsw.r.t. xSjk21 = � Z hI ( ijk � x < �T ) (�(:)(�(i)jk + x; T�ijk)� �(i)jk )2i dfP �e�(i)jk � �(i)jk � x�g= Z fP � e�(i)jk � �(i)jk � x�gd hI ( ijk � x < �T ) (�(:)(�(i)jk + x; T�ijk)� �(i)jk )2i+ P �e�(i)jk � �(i)jk � ijk� (�(:)(�(i)jk + ijk; T�ijk)� �(i)jk )2� CT �Z fP ��(i)jk � �(i)jk � x�gd hI ( ijk � x < �T ) (�(:)(�(i)jk + x; T�ijk)� �(i)jk )2i+ P ��(i)jk � �(i)jk � ijk� (�(:)(�(i)jk + ijk; T�ijk) � �(i)jk )2o+O(T��)= CTEI � ijk � �(i)jk � �(i)jk < �T� (�(:)(�(i)jk ; T�ijk)� �(i)jk )2 + O(T��)for some CT ! 1. Analogously we getSjk22 � CTEI ��T � �(i)jk � �(i)jk < ijk� (�(:)(�(i)jk ; T�ijk)� �(i)jk )2 + O(T��):Finally, we have for any �1 with 0 < �1 < � and � as in (A6) thatSjk23 � �P (j e�(i)jk � �(i)jk j � �T )�1�2=(2+�1) �E j�(:)( e�(i)jk ; T�ijk)� �(i)jk j2+�1�2=(2+�1) = O(T��);which impliesE ��(:)( e�(i)jk ; T�ijk)� �(i)jk �2 � CTE ��(:)(�(i)jk ; T�ijk) � �(i)jk �2 + O(T��): (5.3)From Lemma 1 in Donoho and Johnstone (1994) we can immediately derive theformulaE ��(:)(�(i)jk ; �) � �(i)jk �2 � C �2ijk' ��ijk! ��ijk + 1! + minf(�(i)jk )2; �2g! ;(5.4)where ' denotes the standard normal density. This implies, by Theorem 7 in Donohoet al. (1995), thatX(j;k)2 eJT E(�(:)(�(i)jk ; T�ijk)� �(i)jk )2= O0B@T�1(#fJT )1� 2Tqlog(T ) + X(j;k)2 eJT minf(�(i)jk )2; ( T�ijk)2g1CA= O �(log(T )=T )2mi=(2mi+1)� :Therefore, in conjunction with (5.3), we obtain thatX(j;k)2 eJT E ��(:)( e�(i)jk ; T�ijk)� �(i)jk �2 = O �(log(T )=T )2mi=(2mi+1)� : (5.5)

16Further we get, because of j�(:)(�; �)� �j � � , thatX(j;k)2JT n eJT E ��(:)( e�(i)jk ; T�ijk)� �(i)jk �2� X(j;k)2JTn eJT [ 2E � e�(i)jk � �(i)jk �2 + 2( T�ijk)2 ]= #(JT n fJT ) O(T�1 log T ):If we choose � in the de�nition of fJT in such a way that � < 1=(2mi+1) , we obtain,by #(JT n fJT ) = O(T �) , thatX(j;k)2JT n eJT E ��(:)( e�(i)jk ; T�ijk)� �(i)jk �2 = O �T�2mi=(2mi+1)� : (5.6)By analogous considerations we can show thatS3 = O �(log(T )=T )2mi=(2mi+1)� : (5.7)From (7.14) and (7.22) we havee�(i)jk � �(i)jk = eO �T�1=2qlog(T ) + 2�j=2T�1=2 log(T )� ;which implies by (A5)(i) and Lemma 8.2 thatS4 = O �T�1(log(T ))2� X(j;k)2JT P �b�ijk < T�ijk�+ C X(j;k)2JT �P �j e�(i)jk � �(i)jk j > CT�1=2 log(T )��2=(2+�1) �Ej e�(i)jk � �(i)jk j2+�1�2=(2+�1)= O �T�2mi=(2mi+1)� : (5.8)The relation S5 = O �T�2mi=(2mi+1)� : (5.9)is obvious, due to (A5)(ii). Finally, it can be shown by simple algebra thatS6 = O(2�2j�esi) = O(T�2mi=(2mi+1)); (5.10)which completes the proof.6. Asymptotic normality of quadratic formsIn this section we list the basic technical lemmas which are necessary to proveasymptotic normality or to �nd stochastic estimates for quadratic forms. First, wequote a lemma that provides upper estimates for the cumulants of quadratic formsthat satisfy a certain condition on their cumulant sums. This result is a generalizationof Lemma 2 in Rudzkis (1978), which was formulated speci�cally for quadratic formsthat occur in periodogram-based kernel estimators of a spectral density. We obtaina slightly improved estimate, which turns out to be important, e.g., for certainquadratic forms with sparse matrices.

17We consider the quadratic form �T = X 0TAXT ;where XT = (X1; : : : ;XT )0A = ((aij))i;j=1;::: ;T ; aij = aji:Further, let �T = Y 0TAY T ;where Y T = (Y1; : : : ; YT )0 is a zero mean Gaussian vector with the same covariancematrix as XT .Lemma 6.1. Assume EXt = 0 and, for some � 0,sup1�t1�T 8<: TXt2;::: ;tk=1 jcum(Xt1 ; : : : ;Xtk)j9=; � Ck(k!)1+ for all T and k = 2; 3; : : :Then, for n � 2 , cumn(�T ) = cumn(�T ) + Rn;where (i) jcumn(�T )j � var(�T )2n�2(n� 1)! [�max(A)�max(Cov(XT ))]n�2(ii) Rn � 2n�2C2n((2n)!)1+ maxs;t fjastjg eA kAkn�21 ;eA = Xs maxt fjastjg; kAk1 = maxs (Xt jastj) :The proof of this lemma is given in Neumann (1996).Using the above lemma we obtain useful estimates for the cumulants, which can beused to derive asymptotic normality. For the reader's convenience we quote two basiclemmas on the asymptotic distribution of �T . The �rst one, which is due to Rudzkis,Saulis and Statulevicius (1978), states asymptotic normality under a certain relationbetween variance and the higher order cumulants of �T . Even if such a favorablerelation is not given, we can still get estimates for probabilities of large deviations onthe basis of the second lemma, which is due to Bentkus and Rudzkis (1980).Lemma 6.2. (Rudzkis, Saulis, Statulevicius (1978))Assume for some �T ! 0��cumn ��T=qvar(�T )�� (n!)1+ �n�2T for n = 3; 4; : : :Then P ��(�T � E�T )=qvar(�T ) � x�1 � �(x) �! 1

18holds uniformly over 0 � x � �T , where �T = o(�1=(3+6 )T ).Lemma 6.3. (Bentkus, Rudzkis (1980))Let jcumn(�T )j � n!2 !1+ HT�n�2T for n = 2; 3; : : :Then, for x � 0 ,P (��T � x) � exp0@� x22[HT + (x=�1=(1+2 )T )(1+2 )=(1+ )]1A� 8<: exp �� x24HT � ; if 0 � x � (H1+ T �T )1=(1+2 )exp ��14(x�T )1=(1+ )� ; if x � (H1+ T �T )1=(1+2 )7. Derivation of the asymptotic distribution of the empiricalcoefficients7.1. Preparatory considerations. Before we turn directly to the proofs of thePropositions 3.1 through 3.3, we represent the empirical coe�cients in a form thatallows to recognize easily the nature of every remainder term. Note that throughoutthe rest of the paper, for notational convenience we now omit the double index in thesequence fXt;Tg, i.e. in the following let Xt := Xt;T .Although it is essential for our procedure to have a multiresolution basis, i.e. empiricalcoe�cients from di�erent resolution levels, it turns out to be easier to analyze thestatistical behavior of such coe�cients coming from a single level. Since the empiricalcoe�cients of the multiresolution basis can be obtained as linear combinationsof coe�cients of an appropriate monoresolution basis, we are able to derive theasymptotic distribution of them.Since both f�l1; : : : ; �l;2l; l1; : : : ; l;2l; : : : ; j��1;1; : : : ; j��1;2j��1g andf�j�1; : : : ; �j�;2j�g are orthonormal bases of the same space Vj�, the minimization of(2.2) is equivalent to that ofTXt=p+10B@Xt + pXi=1 264 Xk2I0j� �(i)j�k�j�k(t=T )375Xt�i1CA2 : (7.1)Assume for a moment that D0D is positive de�nite, which is indeed truewith a probability exceeding 1 � O(T��) . The solution e� =(e�(1)j�1; : : : ; e�(p)j�1; : : : ; e�(1)j��; : : : ; e�(p)j��)0 , � = #I0j� = 2j� , can be written as theleast squares estimator e� = (D0D)�1D0Y (7.2)in the linear model Y = D� + ; (7.3)

19where Y = (Xp+1; : : : ;XT )0;D = �0BBBB@ �j�1(p+1T )Xp � � � �j�1(p+1T )X1 � � � �j��(p+1T )Xp � � � �j��(p+1T )X1�j�1(p+2T )Xp+1 � � � �j�1(p+2T )X2 � � � �j��(p+2T )Xp+1 � � � �j��(p+2T )X2... . . . ... . . . ... . . . ...�j�1(TT )XT�1 � � � �j�1(TT )XT�p � � � �j��(TT )XT�1 � � � �j��(TT )XT�p 1CCCCA ;(7.4)� = (�(1)j�1; : : : ; �(p)j�1; : : : ; �(1)j��; : : : ; �(p)j��)0and = ( p+1; : : : ; T )0:The residual term in (7.3) can, for t = p+ 1; : : : ; T , be written as t = Xt � (D�)t�p= � pXi=1 ai(t=T )Xt�i + "t + pXi=1 Xk2I0j� �(i)j�k�j�k(t=T )Xt�i = pXi=1Ri(t=T )Xt�i + "t;where Ri(u) = �ai(u) + Xk2I0j� �(i)j�k�j�k(u) = � Xj�j� Xk2Ij �(i)jk jk(u):With the de�nitionsS = pXi=1Ri(p + 1T )Xp+1�i; : : : ; pXi=1Ri(TT )XT�i!0and e = ("p+1; : : : ; "T )0;we decompose the right-hand side of (7.2) ase� = (D0D)�1D0D� + (ED0D)�1D0e + h(D0D)�1 � (ED0D)�1iD0e + (D0D)�1D0S= � + T1 + T2 + T3: (7.5)Because of the abovementioned relation between the two orthonormal bases of Vj� ,there exists an orthonormal (��)-matrix � with(�l1; : : : ; �l;2l; l1; : : : ; l;2l; : : : ; j��1;1; : : : ; j��1;2j��1)0 = �(�j�1; : : : ; �j��)0:This implies(�(i)j�1; : : : ; �(i)j��)0BB@ �j�1...�j�� 1CCA = (�(i)j�1; : : : ; �(i)j��)�00BBBBBBBBB@ �l1...�l;2l l1... j��1;2j��1 1CCCCCCCCCA :

20Hence, having the least squares estimator (e�(i)j�1; : : : ; e�(i)j��) according to the basisf�j�1; : : : ; �j��g, we obtain the least squares estimator in model (2.2) as�e�(i)l1 ; : : : ; e�(i)l;2l; e�(i)l1 ; : : : ; e�(i)l;2l; : : : ; e�(i)j��1;1; : : : ; e�(i)j��1;2j��1�0 = � �e�(i)j�1; : : : ; e�(i)j��0 :In other words, every empirical coe�cient e�(i)jk which is part of the solution to (2.2)can be written as e�(i)jk = �0ijk e�; (7.6)where k�ijkkl2 = 1. (Analogously, e�(i)lk = �0ik e�.)7.2. Proofs of the Propositions 3.1, 3.2 and 3.3.Proof of Proposition 3.1. For notational convenience we write down the proof forempirical coe�cients e�(i)jk only. The proof for the e�(i)lk 's is analogous.According to (7.5) we havee�(i)jk = �(i)jk + �0ijkT1 + �0ijkT2 + �0ijkT3: (7.7)From (i) and (iii) of Lemma 8.3 we concludeE(�0ijkT1)2 = �0ijk(ED0D)�1 Cov(D0e)(ED0D)�1�ijk� k�ijkk22k(ED0D)�1k22kCov(D0e)k2 = O(T�1): (7.8)The vector �ijk has a length of support of O(2j��j), which impliesXl j(�ijk)lj � k�ijkk2q#fl j (�ijk)l 6= 0g = O(2(j��j)=2): (7.9)We have, by Taylor expansion of the matrix (D0D)�1 , T2 = T21 + T22, whereT21 = (ED0D)�1((ED0D) �D0D)(ED0D)�1D0eand kT22k2 = eO �k(ED0D)�1k32k(ED0D) �D0Dk22kD0ek2� :Using (i) of Lemma 8.3, (8.8) and (8.9) we getkT21k1 � k(ED0D)�1k21k(ED0D) �D0Dk1kD0ek1= eO �2j�=2T�1 log(T )� : (7.10)Since we have enough moment assumptions, we obtain the analogous rate, butwithout the logarithmic factor, for the second moment of �0ijkT21 , i.e.E(�0ijkT21)2 = O �2j��j2j�T�2� : (7.11)Further, we have �0ijkT22 = eO �23j�=2T�3=2 log(T )� : (7.12)Using (i) of Lemma 8.3 and (i) of Lemma 8.4 we getk(D0D)�1k2 � k(ED0D)�1k2+ k(D0D)�1�(ED0D)�1k2 = O(T�1)+ eO(2j�=2T�3=2qlog(T ));

21which yields, in conjunction with Lemma 8.5, that�0ijkT3 = O �k(D0D)�1k2kD0Sk2�= eO �(2�j�minfesig + T�1=22�j�minfmi�1=2�1=(2pi)g)qlog(T )�= eO �T�1=2�� (7.13)for some � > 0 . Now we infer from (7.7), (7.8) and (7.11) through (7.13), which arein part eO-results rather than estimates for the expectations, thatEI(0) �( e�(i)jk � �(i)jk )2� = O(T�1);where 0 is an appropriate event with P (0) � 1 � O(T��) for � < 1 chosenarbitrarily large. This implies in conjunction with Lemma 8.2, with 0 < �1 < � ,thatEI(c0) �( e�(i)jk � �(i)jk )2� � �Ej e�(i)jk � �(i)jk j2+�1�2=(2+�1) (P (c0))1�2=(2+�1) = O(T�1);which �nishes the proof.Proof of Proposition 3.2. It will turn out that the asymptotic distribution of e�(i)jk ��(i)jkis essentially determined by the behavior of �0ijkT1. By (7.9), (7.10), (7.12) and (7.13)from the proof of Proposition 3.1 we infer that�0ijk(T2 + T3) = eO �2�j=2T�1=2 log(T ) + T�1=2�� (7.14)for some � > 0.First, note that the process fXt;Tg admits an MA(1){representationXt;T = 1Xs=0 t;T (s)"t�s (7.15)with 1Xs=0 supt;T fj t;T (s)jg � C for all T ;see K�unsch (1995).Now we turn to the derivation of the asymptotic distribution of �0ijkT1. It is clearthat, because of the MA(1)-representation of the process, �0ijkT1 can be rewrittenas Pu;v Au;v"u"v for some symmetric matrix A = A(i; j; k). In the following, withoutwriting down the explicit form of this matrix, we derive upper estimates for kAk1and eA = PumaxvfjAu;vjg.We have�0ijkT1 = � TXt=p+1 "t pXl=1Xt�l �Xu=1 �j�u(t=T )Xv �(ED0D)�1�p(u�1)+l;v (�ijk)v= Xl;s "Xt "t"t�l�swt(l; s)# ; (7.16)

22where wt(l; s) = t�l(s) �Xu=1�j�u(t=T )Xv �(ED0D)�1�p(u�1)+l;v (�ijk)v:If we write the expression in brackets on the right-hand side of (7.16) as Pij fWij"i"j,we obtain, by supvfj(�ijk)vjg = O(2�(j��j)=2), thatkfWk1 = O �T�1 supt fj t�l(s)jg2j=2� : (7.17)We can also rewrite wt(l; s) aswt(l; s) = � t�l(s)Xv (�ijk)vXu �(ED0D)�1�v;p(u�1)+l �j�u(t=T );which implies, by Pv j(�ijk)vj = O(2(j��j)=2) and by Pt �j�u(t=T ) = O(2�j�=2T ) ,that Xi supj fjfWijjg = Xt jwt(l; s)j = O(2�j=2): (7.18)Because of (A3), the summation over s does not a�ect the rates in (7.17) and (7.18),and so does not the (�nite) sum over l. Hence, with the notation of Lemma 6.1, weobtain kAk1 = O(T�12j=2); (7.19)eA = O(2�j=2): (7.20)Let (j; k) 2 fJT . Using Lemma 6.1 we obtain��cumn(�0ijkT1)�� CnT�1(n!)2+2 (T�12j=2)n�2; (7.21)which implies by Lemma 6.2P ��(�0ijkT1)=�ijk � x� = (1 � �(x))(1 + o(1)) (7.22)uniformly in 0 � x � �T , �T � T � for some � > 0. This relation can obviously beextended to x 2 (�1; �T ].Recall that e�(i)jk � �(i)jk = �0ijkT1 + eO(T�1=2��) (7.23)holds for some � > 0. Therefore we have for arbitrary � <1 thatP ��(�0ijkT1)=�ijk � CT�� x� � CT�� P ��( e�(i)jk � �(i)jk )=�ijk � x� � P ��(�0ijkT1)=�ijk + CT�� x� + CT��;which impliesP ��( e�(i)jk � �(i)jk )=�ijk � x� = [1 � �(x)] (1 + o(1)) + O �j�(x)��(x+ CT��)j�+O �j�(x)��(x� CT��)j� + O(T��): (7.24)

23Fix any c > 1. For x � c we have obviouslyj�(x)� �(x+ CT��)j � CT��(0) = o(1 � �(x)): (7.25)For c < x � (2� log T )1=2 we obtain by a formula for Mill's ratio (see Johnson andKotz (1970, vol. 2, p. 278)) thatj�(x)� �(x+ CT��)j � CT��(x)� CT��x�1� 1x2��1 (1� �(x))� CT��x�1� 1c2��1 (1� �(x)) = o(1 � �(x)): (7.26)The third term on the right-hand side of (7.24) can be treated analogously.For x > C(2� log T )1=2 we have obviouslyP ��( e�(i)jk � �(i)jk )=�ijk � x� = O(T��) = (1 ��(x))(1 + o(1)) + O(T��);(7.27)which completes the proof.Proof of Proposition 3.3. Because of ET1 = 0 we haveCov(T1) = ET1T 01 = (ED0D)�1 Cov(D0e)(ED0D)�1;which implies by (ii) and (iii) of Lemma 8.3 thatkCov(T1) � F�1GF�1k1 = o(T�1);where F = (fT Z �j�u(s)�j�v(s)c(s; k � l) dsgp(u�1)+k;p(v�1)+l)and G = (fT Z �j�u(s)�j�v(s)�2(s)c(s; k � l) dsgp(u�1)+k;p(v�1)+l):This yields kCov(�T1) � �F�1�0�G�0�F�1�0k1= kCov(�T1) � A�1BA�1k1 = o(T�1):Further, due to (6.13) we haveE(�0ijk (T2 + T3))2 = o(T�1);which proves the �rst assertion (3.3).

24The matrix B AA E ! is non-negative de�nite which leads with Theorem 12.2.21(5)of Graybill (1983) to A�1BA�1 � E�1 . Furthermore, we have with x 2 C �px�Ex = Z 10 Z �� jA(s; �)j2(�2(s))�1 ��Xu;k xp(u�1)+k u(s) exp(i�k)��2 d� ds� C Z 10 Z �� Xu;k xp(u�1)+k u(s) exp(i�k)��2 d� ds= 2�Ckxk2;which implies that the eigenvalues of E are uniformly bounded.8. AppendixIn order to preserve a clear presentation of our results, we put some of the technicalcalculations into this separate section. We assume throughout this section that theassumptions (A1) through (A5) are satis�ed.Let �t;T = Cov((Xt�1;T ; : : : ;Xt�p;T )0) .Lemma 8.1. By (A3), with some constants C1; C2 > 0,(i) �max(�t;T ) � C2 and �min(�t;T ) � C1 + o(1) , where the o(1) is uniform in t,(ii) there exists some function g, with g(s)! 0 as s! 0 , such thatk�t1;T � �t2;Tk � g(t1 � t2T ) for all t1; t2; T;(iii) c(s; k � l) is uniformly continuous in s andlimT!1t=T!s cov(Xt�l;T ;Xt�k;T ) = c(s; k � l):Proof. Completely analogously to the proof of Theorem 2.3 in Dahlhaus (1996) wecan show that Xt;T has the representationXt;T = Z �� exp(i�t)A0t;T(�) d�(�)with supt;� ��A0t;T(�) � A(t=T; �)�� = o(1);where �(�) is a process with mean zero and orthonormal increments,A0t;T(�) = 1p2� 1Xl=0 t;T (l) exp(�i�l) ;with t;T (l) given by the MA(1){representation (7.15), andA(s; �) = �(s)p2� 0@1 + pXj=1 aj(s) exp(�i�j)1A�1 :

25Then cov(Xt�l;T ;Xt�k;T ) = Z �� exp(i�(k � l))A0t�l;T (�)A0t�k;T (��) d�:Since A(s; �) is uniformly continuous in s, this is equal toZ �� exp(i�(k � l))jA(s; �)j2 d� + o(1) = c(s; k � l) + o(1); for t=T ! s;which implies (iii). Analogously, we get (ii). Furthermore, we have for x =(x1; : : : ; xp) 2 C px��t;Tx = Z �� pXj=1xj exp(�i�j)A0t�j;T (�)��2 d�= Z �� jA(t=T; �)j2 �� pXj=1xj exp(�i�j)��2 d� + kxk2 o(1):Under (A3) there exist constants with C1 � jA(s; �)j � C2 uniformly in s and �,which implies (i).Lemma 8.2. Assume additionally (A6) and let 0 < �1 < � . Then(i) Eje�(i)lk � �(i)lk j2+�1 = O(1);(ii) Ej e�(i)jk � �(i)jk j2+�1 = O(1)hold uniformly in i, k and j < j�.Proof.(i) In this part we derive estimates for the moments of kD0ek and kD0Sk, which willbe used later in this proof.Using theMA(1)-representation of fXtg we can write (D0e)p(u�1)+k as a quadraticform "0A" for some A = A(p; k) , where " = ("T ; : : : ; "1; "0; "�1; : : : )0 is an in�nite-dimensional vector according to the MA(1)-representation of fXtg. Since, however,the proof of Lemma 6.1 does not depend on the dimension of the matrix A, we canapply this lemma also to this in�nite-dimensional case.We obtain, using the notation of Lemma 6.1, thateA = O �2�j�=2T� ;maxfjastjg � kAk1 = O �2j�=2� ;which implies��cumn((D0e)p(u�1)+k)�� Cn(n!)2+2 T (2j�=2)n�2 for n � 2:Since E(D0e)p(u�1)+k = 0 , we get, for even s, thatE ��(D0e)p(u�1)+k��s = O0@ nXr=1 Yi1;::: ;ir: i1+:::+ir=n; ij�1 j cumij((D0e)p(u�1)+k)j1A � C(s)T s=2:

26Now we obtain, with � = O(2j� ) ,EkD0eks = E 0@Xu;k (D0e)2p(u�1)+k1As=2� (�p)s=2�1Xu;k E(D0e)sp(u�1)+k= O �(�p)s=2maxu;k fE(D0e)sp(u�1)+kg�= O �2j�s=2T s=2� : (8.1)Now we treat the quantity kD0Sk in an analogous way. (D0S)p(u�1)+k is a quadraticform in X = (X1; : : : ;XT )0 with a matrix A , which satis�es, according to (8.11),eA = O Xt j�j�u(t=T )jXi jRi(t=T )j!= O0@Xi sXt �j�u(t=T )2sXt Ri(t=T )21A= O �T (2�j�minfesig + T�1=22�j�minfmi�1=2�1=(2pi)g)� = O(T 1=2)and, by (8.10), kAk1 = O 2j�=2Xi kRik1! = O(2j�=2):Therefore, we get by Lemma 6.1 that��cumn((D0S)p(u�1)+k)�� Cn(n!)2+2 T (2j�=2)n�2 for n � 2;which implies, in conjunction with E(D0S)p(u�1)+k = O( eA) = O(T 1=2) , thatEkD0Sks = O �2j�s=2T s=2� : (8.2)(ii) According to (7.5) we have e�� = (D0D)�1(D0e+D0S) , which yields thatE j e�(i)jk � �(i)jk j2+�1 = Ej�0ijk (e�� )j2+�1� E �k(D0D)�1k2 (kD0ek2 + kD0Sk2)�2+�1� �Ek(D0D)�1k2+�� 2+�12+� �E (kD0ek + kD0Sk) (2+�1)(2+�)��1 �1� 2+�12+�= O �T�(2+�1)�O �(2j�=2T 1=2)2+�1�= O �(2j�=2T�1=2)2+�1� = O(1):

27Lemma 8.3. Let j� = j�(T )!1 and j� = o(T ). Then(i) k(ED0D)�1k1 = O(T�1);(ii) k(ED0D)�1 � (fT Z �j�u(s)�j�v(s)c(s; k � l) dsgp(u�1)+k;p(v�1)+l)�1k1 = o(T�1);(iii) kCov(D0e) � (fT Z �j�u(s)�j�v(s)�2(s)c(s; k � l) dsgp(u�1)+k;p(v�1)+l)k1 = o(T )hold uniformly in u; v; k; l.Proof.(i) Let M = T Diag[M1; : : : ;M�], where Mu = �t for any t with t=T 2 supp(�j�u).Because ofM�1 = T�1Diag[M�11 ; : : : ;M�1� ] we get by (i) and (ii) of Lemma 8.1 thatkM�1k1 = O(T�1): (8.3)Further, we have, by j� = j�(T )!1 and j� = o(T ), that(ED0D � M)p(u�1)+k;p(v�1)+l= TXt=p+1 �j�u( tT )�j�v( tT ) [(�t)kl � (Mu)kl]+ 24 TXt=p+1�j�u( tT )�j�v( tT ) � T�uv35 (Mu)kl= o(T ) (8.4)hold uniformly in u; v; k; l. Since �j�u and �j�v have disjoint support for ju� vj � C,we get (ED0D)kl = 0 for jk � lj � Cp. Therefore we obtain by (8.4)kED0D � Mk1 = o(T ): (8.5)Because of (8.3) and (8.5) there exists a T0 such thatkM�1=2(ED0D � M)M�1=2k � C < 1 for all T � T0:Therefore, by the spectral decomposition of (I +M�1=2(ED0D � M)M�1=2) thefollowing inversion formula holds:(ED0D)�1 = hM1=2 �I + M�1=2(ED0D � M)M�1=2�M1=2i�1= M�1=2 "I + 1Xs=1(�1)s(M�1=2(ED0D � M)M�1=2)s#M�1=2; (8.6)which implies (i).(ii) It can be shown in the same way as (8.4) thatk(ED0D) � (fT Z �j�u(s)�j�v(s)c(s; k � l) dsgp(u�1)+k;p(v�1)+l)k1 = o(T ); (8.7)

28which implies analogously to (8.6)k(ED0D)�1 � (fT Z �j�u(s)�j�v(s)c(s; k � l) dsgp(u�1)+k;p(v�1)+l)�1k1= k(ED0D)�1 1Xs=1(�1)s h(ED0D � (f: : : g))(ED0D)�1is k1 = o(T�1):(iii) Obviously we have ED0e = 0;which implies cov �(D0e)p(u�1)+k; (D0e)p(v�1)+l�= TXs;t=p+1 �j�u( sT )�j�v( tT )E"s"tXs�kXt�l= TXs=p+1 �j�u( sT )�j�v( sT )E"2sEXs�kXs�l= T Z �j�u(s)�j�v(s)�2(s)c(s; k � l) ds + o(T ):The corresponding result in the k:k1-norm follows from the same reasoning leadingto (8.5).Lemma 8.4. It holds that(i) (D0D)�1 � (ED0D)�1 1 = eO �2j�=2T�3=2qlog(T )�(ii) kD0ek22 = eO �2j�T log(T )� :Proof.(i) First, observe that by (A2) and the MA(1)-representation of fXtg,TXt2;::: ;tk=1 jcum (Xt1 ; : : : ;Xtk)j= TXt2;::: ;tk=1 ��cum0@ t1Xs1=�1 t1(t1 � s1)"s1; : : : ; tkXsk=�1 tk(tk � sk)"sk1A�� t1Xs=�1 TXt2;::: ;tk=s_1 j t1(t1 � s)j � � � j tk(tk � s)jj cumk("s)j� sups fj cumk("s)jg 1Xs=0 j t1(s)j TXt=s_1 j t(t� s)j!k�1� C2k(k!)1+ :

29We see that (D0D)p(u�1)+k;p(v�1)+l = TXt=p+1 �j�u(t=T )�j�v(t=T )Xt�kXt�lis a quadratic form with a matrix A satisfying, in the notation of Lemma 6.1,kAk1 = O(2j�); eA = O(T ):This implies by Lemma 6.1 that��cumn �(D0D)p(u�1)+k;p(v�1)+l�� Cn(n!)2+2 (2j�)n�1T� n!2 !1+(1+2 ) HT�n�2T ;where HT � 2j�T , �T � 2�j� .Hence, we get by Lemma 6.3 thatP �j(D0D)p(u�1)+k;p(v�1)+l � (ED0D)p(u�1)+k;p(v�1)+lj � x� � exp �C x22j�=2T !for 0 � x � (H1+ T �T )1=(1+2 ) :Since (H1+ T �T )1=(1+2 ) � 2j� =(1+2 )T (1+ )=(1+2 ) � 2j�=2T 1=2 , we get(D0D)p(u�1)+k;p(v�1)+l � (ED0D)p(u�1)+k;p(v�1)+l = eO �2j�=2T 1=2qlog(T )� :Since �j�u and �j�v have disjoint support for ju� vj � C, we immediately obtainkD0D � ED0Dk1 = eO �2j�=2T 1=2qlog(T )� ; (8.8)which yields, in conjunction with (i) of Lemma 8.3,k(D0D)�1 � (ED0D)�1k1 � k(ED0D)�1k1 1Xs=1 �kD0D � ED0Dk1k(ED0D)�1k1�s= O(T�1) eO �2j�=2T 1=2qlog(T ) T�1�= eO�2j�=2T�3=2qlog(T )� :(ii) From similar arguments we obtain(D0e)p(u�1)+k = eO �T 1=2qlog(T )� ; (8.9)which implies (ii).Lemma 8.5. It holdskD0Sk22 = eO �T 2(2�2j�minfesig + T�12�j�minf2mi�1�1=pig) log(T )� :

30Proof. Because of our assumption mi + 1=2 � 1= epi > 1 we getkRik1 = O0@Xj�j� 2j=2maxk fj�(i)jk jg1A= O0@Xj�j� 2j=22�jsi1A = O �2�j�(mi�1=pi)� (8.10)and TV (Ri) = O0@Xj�j� 2j=2Xk j�(i)jk j1A= O0@Xj�j� 2j=2 Xk j�(i)jk jpi!1=pi 2j(1�1=pi)1A= O0@Xj�j� 2j=22�jsi2j(1�1=pi)1A = O �2�j�(mi�1)� ;which implies T�1 TXt=1 (Ri(t=T ))2 � kRik2L2[0;1]� TXt=1 Z t=T(t�1)=T jRi(t=T ) +Ri(u)j jRi(t=T )�Ri(u)j du= Xt O �T�1kRik1 TV (Ri)j[ t�1T ; tT )�= O �T�12�j�(2mi�1�1=pi)� :Since we know from Theorem 8 in Donoho et al. (1995) thatkRik2L2[0;1] = Xj�j�Xk j�(i)jk j2 = O �2�2j�esi� ;we have that T�1 TXt=1 (Ri(t=T ))2 = O �2�2j�esi + T�12�j�(2mi�1�1=pi)� : (8.11)Now, (D0S)p(u�1)+k = TXt=p+1�j�u(t=T )Xt�k pXi=1Xt�iRi(t=T )= eO�2j�=2qlog(T )� Xt=T2supp(�j�u) pXi=1 jRi(t=T )j;

31which implieskD0Sk22 = eO �2j� log(T )� pXi=1 �Xu=10@ Xt=T2supp(�j�u) jRi(t=T )j1A2= eO �2j� log(T )� pXi=1 �Xu=10@ Xt=T2supp(�j�u)Ri(t=T )21A T2�j�= eO �T 2(2�2j�minfesig + T�12�j�minf2mi�1�1=pig) log(T )� :Acknowledgment. We thank three anonymous referees and the Associate Editor fortheir suggestions that helped to improve the presentation of our work. Also weare indebted to Markus Monnerjahn who contributed the software for the numericalsimulation results. References1. Bentkus, R. and Rudzkis, R. (1980). On exponential estimates of the distribution of randomvariables. Liet. Mat. Rinkinys 20, 15{30. (in Russian)2. Bergh, J. and L�ofstr�om, J. (1976). Interpolation Spaces: An Introduction, Springer, Berlin.3. Brillinger, D. R. (1994). Some asymptotics of wavelets in the stationary error case. TechnicalReport, Department of Statistics, University of California, Berkeley.4. Cohen, A., Dahmen, W. and DeVore, R. (1995). Multiscale Decompositions on BoundedDomains. IGPM{Preprint 114, TH Aachen.5. Cohen, A., Daubechies, I. and Vial, P. (1993). Wavelets on the Interval and Fast WaveletTransform. Appl. Comp. Harmonic Anal. 1, 54{81.6. Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. Ann. Statist. 25,1{37.7. Dahlhaus, R. (1996). On the Kullback-Leibler information divergence of locally stationaryprocesses. Stoch. Proc. Appl. 62, 139{168.8. Daubechies, I. (1988). Orthogonal bases of compactly supported wavelets, Comm. Pure Appl.Math. 41, 909{996.9. Donoho, D. L. (1993). Unconditional bases are optimal bases for data compression and statisticalestimation. Appl. Comp. Harmonic Anal. 1, 100{115.10. Donoho, D. L. and Johnstone, I. M. (1992). Minimax estimation via wavelet shrinkage. TechnicalReport No. 402, Department of Statistics, Stanford University, Ann. Statist., to appear.11. Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage.Biometrika 81, 425{455.12. Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via waveletshrinkage. J. Am. Statist. Ass. 90, 1200{1224.13. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995). Wavelet shrinkage:asymptopia? (with discussion) J. R. Statist. Soc., Ser. B 57, 301{369.14. Graybill, F. A. (1983). Matrices with Applications in Statistics. Wadsworth. Belmont, CA.15. Hall, P. and Patil, P. (1995). Formulae for mean integrated squared error of nonlinear wavelet{based density estimators. Ann. Statist. 23, 905{928.16. Johnson, N. L. and Kotz, S. (1970). Distributions in Statistics. Continuous UnivariateDistributions - 1, 2. Wiley, New York.17. Johnstone, I. M. and Silverman, B. W. (1997). Wavelet threshold estimators for data withcorrelated noise. J. Royal Statist. Soc., Ser. B 59, 319{351.

3218. K�unsch, H. R. (1995). A note on causal solutions for locally stationary AR-processes. ETHZ�urich.19. Meyer, Y. (1991). Ondelettes sur l'intervalle. Revista Math. Ibero-Americana 7 (2), 115{133.20. Neumann, M. H. (1996). Spectral density estimation via nonlinear wavelet methods forstationary non-Gaussian time series, J. Time Ser. Anal. 17, 601{633.21. Neumann, M. H. and von Sachs, R. (1995). Wavelet thresholding: beyond the Gaussiani.i.d. situation. In LN Statistics 103: Wavelets and Statistics, A. Antoniadis ed., 301{329.22. Neumann, M. H. and von Sachs, R. (1997). Wavelet thresholding in anisotropic function classesand application to adaptive estimation of evolutionary spectra. Ann. Statist. 25, 38{76.23. Neumann, M. H. and Spokoiny, V. G. (1995). On the e�ciency of wavelet estimators underarbitrary error distributions. Math. Methods of Statist. 4, 137{166.24. Rudzkis, R. (1978). Large deviations for estimates of spectrum of stationary series. LithuanianMath. J. 18, 214{226.25. Rudzkis, R., Saulis, L. and Statulevicius, V. (1978). A general lemma on probabilities of largedeviations. Lithuanian Math. J. 18, 226{238.26. Statulevicius, V. A. and Jakimavicius, D. A. (1989). Theorems on large deviations for dependentrandom variables. Soviet Math. Dokl. 38, 442{445.27. Triebel, H. (1990). Theory of Function Spaces II. Birkh�auser, Basel.

Nonlinear Wavelet Estimation of Time-Varying Autoregressive Processes

Documents

Transcript of Nonlinear Wavelet Estimation of Time-Varying Autoregressive Processes