Mixing Strategies for Density Estimation

Mixing Strategies for Density Estimation �By Yuhong YangIowa State UniversityAbstractGeneral results on adaptive density estimation are obtained with respect to any countablecollection of estimation strategies under Kullback-Leibler and square L2 losses. It is shownthat without knowing which strategy works best for the underlying density, a single strategycan be constructed by mixing the proposed ones to be adaptive in terms of statistical risks.A consequence is that under some mild conditions, an asymptotically minimax-rate adaptiveestimator exists for a given countable collection of density classes, i.e., a single estimator canbe constructed to be simultaneously minimax-rate optimal for all the function classes beingconsidered. A demonstration is given for high-dimensional density estimation on [0; 1]d wherethe constructed estimator adapts to smoothness and interaction-order over some piecewise Besovclasses, and is consistent for all the densities with �nite entropy.1. Introduction. In Recent years, there has been an increasing interest in adaptivefunction estimation. The main objective, if possible, is to construct a single estimator sothat it is automatically asymptotically optimal in terms of a minimax risk for each functionclass in a given collection. Adaptive function estimators were constructed, for example, byEfroimovich and Pinsker (1984) and Efroimovich (1985) for ellipsoidal classes; by H�ardle andMarron (1985) using adaptive kernel estimators for some Lipschitz classes; and by Donohoet al and others using wavelet analysis for Besov classes (see, e.g., [15]). General schemeshave also been proposed for the construction of adaptive estimators. Barron and Cover (1991)derived general adaptation risk bounds for density estimation based on minimum descriptionlength (MDL) criterion. These bounds were used to demonstrate adaptive properties of MDLcriterion including adaptation over classical function classes using metric entropies. Lepskii(1991) gave some su�cient conditions to ensure existence of minimax-rate adaptive estimatorsand constructed adaptive estimators speci�cally for ellipsoidal classes under Lp loss for 2 < p �1: In addition to the use of MDL criterion, other adaptation schemes by model selection havebeen developed including very general penalized contrast criteria in Birg�e and Massart (1996),and Barron, Birg�e and Massart (1999) with a variety of interesting applications; penalized� AMS 1991 subject classi�cations. Primary 62G07; secondary 62B10, 62C20, 94A29.Key words and phrases. Density estimation, rates of convergence, adaptation with respect to estimationstrategies, minimax adaptation. 1

maximum likelihood criteria in Yang and Barron (1998); and complexity penalized criteriabased on V-C theory, see, e.g., Devroye, Gy�or�, and Lugosi (1996, Chapter 18) and Lugosi andNobel (1996). Functional aggregation of estimators to adapt to within order n�1=4 in L2 riskis proposed in Juditsky and Nemirovski (1996).Our interest in this work concerns adaptivity in a most general sense. The questions wetend to address is: Given a countable collection of estimation strategies (regardless of how theyhave been obtained), is it possible to �nd an adaptive strategy so that it automatically performsas well as the best one in the list in an asymptotic sense? Such a strategy will be said to beadaptive with respect to the collection of the original ones. In related context of estimatinga functional, negative results have been obtained showing that optimal rate adaptation maynot be possible (see Lepskii (1991) and Brown and Low (1996)). Here we give positive resultsfor global density estimation. Di�erently from the previous work on adaptation, no speci�cproperties will be required here on the collection of strategies. Thus advantages of a list ofpossibly completely di�erent strategies can be combined in terms of statistical risks, and ifdesired, adaptive strategies constructed using various schemes available (e.g, automated kernelsmoothing, wavelet procedures, smoothing splines, neural net estimation, etc.) can also beincluded in the list for even more adaptivity. The bene�t of considering such a list of verydi�erent procedures could be substantial especially for high-dimensional density estimation,where to overcome the curse of dimensionality, searching over di�erent characterizations offunctions is desired for better accuracy (see Section 4 for a demonstration).Estimation strategies are often derived for speci�c function classes. For a collection of suchstrategies which are constructed to be minimax optimal for the corresponding target classes,adaptation with respect to the strategies as explained above implies minimax adaptation withrespect to the target classes. In this sense, the notion of adaptation with respect to a collectionof strategies is more general compared with minimax adaptation with respect to a collectionof density classes. Results on minimax adaptation will be given as consequences of the mainresults on combining strategies.In the revision of an earlier version of this paper, an editor and an associate editor broughtour attention to an independent research of Catoni (1997) completed after our submission ofthis work. A result similar in spirit to our Theorem 1 under K-L loss was given.Density estimation is closely related to universal coding as illustrated in Barron (1987),Clark and Barron (1990), Barron and Cover (1991), Yang (1996) (a formal statement is givenas Lemma 2.6), and Haussler and Opper (1997). This relationship as we learned from Barron(1987) and Barron and Cover (1991) will be used for our construction of an adaptive strategy.For adaptation under the squared L2 loss, some results used in our analysis come from Yang2

and Barron (1999), which derives minimax rate of convergence for a �xed general function class.Some recent results on universal coding are redundancy bounds for Bayes hierarchical coding inFeder and Merhav (1996) and redundancy bounds for individual sequences using a sequentialprocedure for binary tree sources in Willems et al (1995).Finally, it is worth mentioning that Bayesian model averaging methods have also beenproposed to combine various models (see, e.g., Kass and Raftery (1995) and Berger and Pericchi(1996)). Our method permits but does not require the estimators in the models to be obtainedin a Bayesian framework, which sometimes has di�culties in the choice of insensitive priors onthe parameters. In addition, our adaptation recipe works for combining estimation strategieseven when some or all of them are not model-based procedures.1.1. Some setups. Let X1; X2, :::; Xn be i.i.d. observations with density f(x); x 2 X withrespect to a �-�nite measure �. Here the space X is general and could be any dimensional.The goal is to estimate the unknown density f based on the data.Let k g1 � g2 k2= �R jg1 � g2j2d��1=2 be the L2 distance between functions g1 and g2 withrespect to �. The Kullback-Leibler (K-L) divergence between two densities f and g is de�ned asD(f k g) = R f log (f=g)d�: Both D(f k f) and k f � f k22 will be considered as loss functions.In this paper, a density estimation strategy � refers to an estimation procedure producingdensity estimators f�;0; f�;1(x;X1); :::; f�;n�1(x;X1; :::; Xn�1); ::: based on observation(s)X0; X1,...,Xn�1; ::: respectively (here f�;0 is an initial guess based on no data (X0) andX i = (X1; :::; Xi)for i � 1). Let Rseq(f ;n; �) = 1n + 1 nXi=0ED(f k f�;i)denote the average cumulative risk for estimating f using strategy � up to n observations. Thisnotion of risks (some times called redundancy or regret) is considered by many in the contextof data compression, prediction, gambling and computational learning theory (see, e.g., Clarkeand Barron (1990) and Barron and Xie (1997) for asymptotics on �nite dimensional models,Yang and Barron (1999, Section 3) and Haussler and Opper (1997) for rates of convergence overa given density class). It is a reasonable and stable discrepancy measure to evaluate di�erentstrategies. The individual risk ED(f k f�;n) at sample size n denoted by R(f ;n; �) will also beconsidered. Similarly de�ne rseq(f ;n; �) and r(f ;n; �) for the squared L2 loss.A minimax risk measures di�culty in estimation in a uniform sense. Let l be a chosen lossfunction, then for a density estimator f , the risk is El(f; f). Let F be a class of densities.Then the minimax risk for estimating a density in F at sample size n is de�ned asR(F ; l;n) = minf maxf2F El(f; f);where the minimization is over all density estimators.3

The symbol \�" will be used to mean the same order, i.e., an � bn if an=bn is boundedabove and away from zero.The paper is organized as follows. In Section 2, we present results on adaptation with respectto estimation strategies; in Section 3, minimax adaptation results for function classes are given.A demonstration of the results is provided in Section 4. A generalization for predictation fordependent data is given in Section 5. The proofs of the results are given in Section 6.2. Adaptation with respect to estimation strategies. Let f�j ; j � 1g be any collectionof density estimation strategies. Here the index set fj � 1g is allowed to degenerate to a �niteset. As mentioned earlier, there is no restriction at all on the choice of the strategies and theycould be proposed for di�erent purposes, classes, and/or under di�erent assumptions. Some ofthem could be based only on heuristics but with practical signi�cance. Strategy �j producesdensity estimators fj;0; fj;1(x;X1); ::: based on observation(s) X0; X1,... respectively.2.1. Adaptation under K-L risk. The following is a simple yet powerful recipe to get anadaptive strategy by mixing f�j ; j � 1g given in Yang (1996). Let � = f�j ; j � 1g be a set ofpositive numbers satisfying Pj�1 �j = 1: They may be viewed as weights or prior probabilitiesof the strategies.Letq0(x) =Pj�1 �jfj;0(x)q1(x; x1) = Pj�1 �jfj;0(x1)fj;1(x;x1)Pj�1 �jfj;0(x1)q2(x; x1; x2) = Pj�1 �jfj;0(x1)fj;1(x2;x1)fj;2(x;x1;x2)Pj�1 �jfj;0(x1)fj;1(x2;x1):::qn�1(x; x1; x2;:::; xn�1) = Pj�1 �jfj;0(x1)fj;1(x2;x1)��fj;n�2(xn�1;x1;x2;:::;xn�2)fj;n�1(x;x1;x2;:::;xn�1)Pj�1 �jfj;0(x1)fj;1(x2;x1)��fj;n�2(xn�1;x1;x2;:::;xn�2):::De�ne estimators for i � 0 based on X1; :::Xi as followsfseq;i(x) = qi(x;X1; :::Xi): (1)They are valid probability density estimators at each sample size. Call this estimation strategy��seq . This strategy will be shown to be adaptive in terms of the average cumulative risk. Foradaptation under the individual risk, letfn(x) = 1n + 1 nXi=0 qi(x;X i):4

It is a valid density estimator of f based on Xn. The strategy producing fn; n � 1 will becalled a combined strategy denoted by ��.Consider infj�1 1n+ 1log 1�j + Rseq(f ;n; �j)! :It is the best trade-o� between the average cumulative risk and the logarithm of the inverseweight (or prior probability) relative to the sample size over all the estimation strategies.Theorem 1: For any given countable collection of estimation strategies f�j ; j � 1g and �,we can construct a single estimation strategy ��seq as given in the above recipe such that for anyunderlying density f ,Rseq(f ;n; ��seq) � infj�1 1n+ 1log 1�j +Rseq(f ;n; �j)! : (2)The combined strategy �� has individual risk bounded by the same quantityR(f ;n; ��) � infj�1 1n + 1log 1�j +Rseq(f ;n; �j)! : (3)Remarks:1. A similar adaptation bound is given in Yang (1997) for nonparametric regression underGaussian errors with known variance using a connection between estimating the regressionfunction and the joint density of the observation. Adaptation risk bounds for regression bymodel selection are in Barron, Birg�e and Massart (1999) and Yang (1999).2. An individual risk bound is given in Catoni (1997) for a similarly de�ned strategy. Hisformulation has a computational advantage and can avoid an extra logarithmic term in the riskbound for parametric estimation.From (2), up to an additive penalty of order 1=n; the adaptive strategy ��seq performs as wellas any strategy in the list in terms of the average cumulative risk. For a strategy � with regularlydecreasing risk converging essentially more slowly than the parametric case, Rseq(f ;n; �) andR(f ;n; �) are of the same order (see Section 3). When such strategies are combined, (3) ensuresadaptation in terms of individual risk.For applications, we may assign smaller weights (or prior probabilities) �j for more complexestimation strategies. Then the risk bounds in the theorem are trade-o�s between accuracy andcomplexity. For a complex strategy (with a small weight), its role in the risk bound becomessigni�cant only when the sample size becomes large.A strategy is said to be consistent for f under loss l; if El(f; f�;n)! 0 as n! 1: A simpleconsequence of Theorem 1 is that if any of the strategy in the list is consistent for the unknowndensity, so is the combined adaptive strategy ��.5

2.2. Adaptation under L2 loss. For the adaptation results under the squared L2 loss, unlikethe bounds in Theorem 1, some mild technical conditions will be used (which arise from relatingthe K-L and L2 distances). Throughout the paper, for squared L2 adaptation, we assume thatthe dominating measure � is �nite and is normalized to be a probability measure, and theunknown density is uniformly upper bounded, i.e., k f k1� A <1 for a known constant A:For each f , let g = (f + 1)=2 be a mixture of f and the uniform density 1. We have thefollowing conclusion.Theorem 2: For any given countable collection of strategies f�j ; j � 1g, we can constructa strategy �� such thatr(f ;n; ��) � C infj�1 1n+ 1log 1�j + 1n+ 1 nXi=0 r(g; i; �j)! ; (4)where the constant C depends only on A.Note that the risk of the combined strategy at an unknown density f is bounded in terms ofthe risks of the original strategies at g = (f + 1)=2 instead of f itself. For usual nonparametricprocedures, the risks at f and g are most likely to be bounded at the same rate. Formally, thisdoes not cause trouble for applications where minimax risks are considered for nonparametricclasses including f and g at the same time as is the case for the classical convex classes. Thistechnical di�culty is avoided if one is wiling to assume that the unknown density is boundedaway from zero, for which case the K-L divergence and the squared L2 distance are equivalentand thus r(g; i; �j) can be replaced by r(f ; i; �j) directly in the theorem.In light of Theorems 1 and 2, adaptive estimators can be obtained using the adaptationrecipe for a countable collection of function classes as will be given in the next section. Theresults are also useful for combining estimation procedures with hyperparameters (e.g., band-width for a kernel estimator). Various conclusions can be made for a combined strategy with asuitable discretization of the hyperparameters.3. Adaptation with respect to function classes. Let fFj; j � 1g be a collection ofdensity classes. Assume the true function is in one of the classes, i.e., f 2 [j�1Fj . The questionwe want to address is: Without knowing which class contains f , can we have one estimator(not depending on j) such that it converges asymptotically at the minimax rate of the classcontaining f? If such an estimator exists, we call it a minimax-rate adaptive estimator withrespect to the classes fFj ; j � 1g. This concept of adaptation can be obviously extended toany given collection of classes not necessarily countable.We need a regularity condition for our results. The familiar rates of convergence for functionestimation are n�� (logn)� for some 0 � � < 1 and � 2 R: When 0 < � < 1 (then R(F ; l;n)converges essentially more slowly than the parametric case), we have that (1=n)Pni=0R(F ; l; i)6

is of the same order as R(F ; l;n). For such a case, we say the class has a regular nonparametricrisk rate.Theorem 3: Let fFj, j � 1g be any collection of density classes each with a regularnonparametric risk rate under the loss being considered.1. There exists a combined minimax-rate adaptive strategy under the K-L loss.2. Assume that fFj, j � 1g is uniformly bounded with supj�1 supf2Fj k f k1� A < 1:If in addition, each Fj is convex including the uniform density 1 or the classes are uniformlybounded away from zero, then there exists a minimax-rate adaptive estimator over the classesunder the squared L2 loss.4. An illustration. Consider estimating a density function on [0; 1]d with respect toLebesgue measure �: We focus on the K-L loss.1. Densities with �nite entropy. Let H consist of all densities that have �nite entropy, i.e.,H = ff : R f log fd� < 1g: Note that densities in H are not necessarily bounded above oraway from zero. The class is very large and no uniform rate of convergence is possible underK-L or L2 loss.2. Piecewise Besov classes with di�erent interaction order and smoothness. To allow discon-tinuities, consider the following modi�cation of a function class. Let G be a class of functions on[0; 1]d that are uniformly upper bounded and lower bounded away from zero. For an integer k;and positive constants 1 and 2, let Gk; 1; 2 = fh(x)�Pki=1 bi1Bi=c : h 2 G, Bi's are hypercubespartitioning [0; 1]d satisfying Pki=1 bi�(Bi) � 1 and 0 � bi � 2; 1 � i � kg: Here c is thenormalizing constant to make Gk; 1; 2 a class of probability density functions. Note that withthe constraints on bi's and the volumes of Bi's, the densities in Gk; 1 ; 2 are uniformly upperbounded. However, they are allowed to be 0 on some hypercubes and arbitrarily close to 0 onothers. If the functions in G are smooth, then the densities in Gk; 1; 2 are piecewise smooth.The modi�cation provides somewhat more exibility than the original class.For 1 � �; q � 1 and � > 0, let B�;rq;� be the Besov space consisting of all functionsg 2 Lq[0; 1]r such that the Besov norm satis�es k g kB�;rq;�< 1 (see, e.g., Triebel (1975) andDeVore and Lorentz (1993)). Let B�;rq;� (C) denote the subset of positive functions g in the Besovspace with k g kB�;rq;� + k log g k1� C (without the extra boundness assumption here, we couldnot identify the minimax rate of convergence under the K-L loss). De�neS�;1q;� (C) = fPdi=1 gi(xi) : gi 2 B�;1q;� (C); 1� i � dgS�;2q;� (C) = fP1�i<j�d gi;j(xi; xj) : gi;j 2 B�;2q;� (C); 1 � i < j � dg� � �S�;dq;� (C) = B�;dq;� (C):The simplest function class S�;1q;� (C) contains additive functions (no interaction) and the com-7

plexity of the classes increases when r increases. To allow discontinuity, let S�;rq;� (C)k; 1; 2 bethe modi�ed class (piecewise Besov in some sense) from S�;rq;� (C) as de�ned earlier. It is easy toshow that the metric entropy of S�;rq;� (C) under the L2 distance and its covering entropy underthe K-L divergence are of the same orders as B�;rq;� (C). By Theorem 5 of Yang and Barron(1999), the minimax rate of convergence under the K-L (or squared L2) loss for estimating adensity in S�;rq;� (C)k; 1; 2 is n�2�=(2�+r) for 1 � r � d:4.1. Desired properties on estimation. Suppose we have the following wish list for theadaptive estimator fn; n � 1 to be constructed.1. fn is consistent for all f 2 H;2. fn converges automatically at the optimal rate n�2�=(2�+r) if f 2 S�;rq;� (C)k; 1; 2 withoutknowing any of the hyper-parameters;3. fn behaves well if a projection pursuit density estimator happens to converge reasonablyfast.The rationale behind the wish list is as follows. Besov classes with di�erent choices ofthe hyperparameters provide quite some exibility in modeling a density (see, e.g., Donoho etal (1996)). The piecewise modi�cation allows discontinuity of the density. When � is smallrelative to d; the rate of convergence is rather slow (well-known as curse of dimensionality). Theconsideration of di�erent interaction order can lead to a substantial improvement if f happensto be in S�;rq;� (C)k; 1; 2 with r much smaller than d. Projection pursuit (see, e.g., Huber (1985))is another approach to high-dimensional density estimation by dimension reduction. Despitethe lack of theory on convergence rate property, such procedures have practical merits. Hencethe third wish above. (Of course, one could go on with more target classes or add di�erentstrategies such as neural nets in the wish list, sacri�cing simplicity and computation ease.)Finally, since the true density f may well not be in any of these classes, we want at leastconsistency for every f 2 H:4.2. Method of adaptation. To use the adaptation recipe, it su�ces to construct a consistentestimator forH and optimal-rate estimators for the classes S�;rq;� (C)k; 1; 2 and then combine themand the projection pursuit estimator appropriately.Barron (1988) (see also Barron, Gy�o� and van de Meulen (1992)) constructed a histogramestimator consistent forH under the K-L loss, i.e., there is a strategy �H such thatR(f ;n; �H)!0 for each f 2 H.For adaptation among the piecewise Besov classes, we may �rst obtain adaptivity overthe smoothness parameters 1 � � � 1; 1 � q � 1; � > d=q for �xed r; C; k; 1 and 2: To that end, a suitable discretization of the smoothness parameters leaves us a countablecollection of density classes to work with, for each of which a minimax-rate adaptive estimator8

can be constructed e.g., utilizing a covering set under K-L divergence as in Yang and Barron(1999). Then the adaptation recipe together with an appropriate assignment of weights onthe discretized values can result in adaptation for these classes following some \continuity"argument. Further adaptation is obtained by combining these estimators over integer values ofr (between 1 and d), C; k; 1 and 2: This strategy, say �B , is minimax-rate adaptive over allthe piecewise Besov classes.Finally, we combine the above strategies �H ; �B; and a chosen projection pursuit strategy(e.g., with equal weights). Then the overally combined strategy makes the 3 wishes come true.A similar result holds for the squared L2 risk applying Theorem 2 assuming the unknowndensity is bounded above by a known constant. Adaptation results over Besov classes bywavelets are in Donoho et al (1996), Birg�e and Massart (1996), and Juditsky (1997).Adaptation over density classes with di�erent characteristics is discussed in Barron andCover (1991) using MDL criterion, and later in Barron, Birg�e and Massart (1999) and Yangand Barron (1998) by other model selection criteria.5. A generalization for prediction for dependent data. A similar adaptation resultholds for prediction with dependent data.Let X1; X2, :::; Xn be a stochastic process. One is interested in \estimating" or predictingthe conditional density fi(xi+1jxi) of Xi+1 given X i. For an predictor fi; we measure the lossusing K-L divergence D(fi k fi) = R fi(xi+1jxi) log �fi(xi+1jxi)=fi(xi+1)��(dxi+1): The averagecumulative prediction risk is 1n + 1 nXi=0ED(fi k fi):Let � denote a prediction strategy which produces predictors f�;0; f�;1(x;X1); ::: based onobservation(s) X0; X1,::: respectively. Let Rpred(ffi; i � 1g;n; �) = (1=(n+ 1))Pni=0ED(fi kf�;i) denote the average cumulative risk when the true conditional densities are fi; i � 1:Let f�j ; j � 1g be a collection of density prediction strategies. We have the followingconclusion on adaptive prediction.Proposition 1: For any given countable collection of prediction strategies f�j ; j � 1g and�, we can construct a single adaptive prediction strategy �� such thatRpred(ffi; i � 1g;n; ��) � infj�1 1n+ 1log 1�j + Rpred(ffi; i � 1g;n; �j)! :The proof of the proposition is similar to that of Theorem 1 and is omitted here.6. Proof of the results.Proof of Theorem 1: Let fn(xn) denote the product density f(x1) � f(x2) � � � �f (xn) :Let q(n)j = fj;0(x1) � fj;1(x2; x1) � � � � � fj;n�1(xn; x1; :::; xn�1):9

It is a joint density function on the product space of X1; :::; ;Xn. Then let q(n) = Pj�1 �jq(n)jbe a mixture from q(n)j 's. The cumulative risk of the constructed estimators satisfyPn�1i=0 EfD(f k fseq;i) =Pn�1i=0 Ef R f(x) log f(x)fseq;i(x)�(dx)=Pn�1i=0 Ef R f(xi+1) log f(xi+1)fseq;i(xi+1)�(dxi+1)=Pn�1i=0 R f(x1)f(x2) � � � f(xi)f(xi+1) log f(xi+1)qi(xi+1;x1;:::;xi)�(dx1)�(dx2) � � � �(dxi+1)=Pn�1i=0 R fn(xn) log f(xi+1)qi(xi+1;x1;:::;xi)�(dx1)�(dx2) � � � �(dxn)= R fn(xn) �Pn�1i=0 log f(xi+1)qi(xi+1;x1;:::;xi)��(dx1)�(dx2) � � � �(dxn)= R fn(xn) �log fn(xn)q(n) ��(dx1)�(dx2) � � � �(dxn)= D(fn k q(n)):Thus we have nRseq(f ;n� 1; ��seq) = D(fn k q(n)): To prove Theorem 1, our task is to boundD(fn k q(n)): For any f and any j � 1, since log(x) is an increasing function, we haveD(fn k q(n)) � Z fn(xn) log fn(xn)�jq(n)j (xn)d�(xn)= log 1�j + Z fn(xn) log fn(xn)q(n)j (xn)d�(xn):The term R fn(xn) log �fn(xn)=q(n)j (xn)�d�(xn) can be bounded in terms of risks of the esti-mators produced by strategy �j . Indeed, as earlier,Z fn(xn) log fn(xn)q(n)j (xn)d�(xn) = n�1Xi=0 EfD(f k fj;i) = nRseq(f ;n� 1; �j):Thus we have Rseq(f ;n � 1; ��seq) � (1=n) log (1=�j) + Rseq(f ;n � 1; �j): Since the inequalityholds for all j; minimizing over j; we have the �rst inequality in Theorem 1. Let fn�1 =(1=n)Pn�1i=0 fseq;i; by convexity, as in Barron (1987), we haveEfD(f k fn�1) � 1n n�1Xi=0 EfD(f k fseq;i) � infj�1 1n log 1�j + Rseq(f ;n� 1; �j)! :This completes the proof of Theorem 1.Proof of Theorem 2: We use an idea in ([33], Section 2) to change the problem toanother one for which application of Theorem 1 gives bounds under square L2 loss. In additionto the observed i.i.d. sample X1; X2; :::; Xn from f , let W1;W2; :::Wn be an independent samplegenerated i.i.d. from the uniform distribution on X with respect to �. Let eXi be Xi or Wiwith probability (1=2; 1=2) according to independent coin ips. Then eXi has density g(x) =(f(x) + 1)=2. Clearly the new density g is bounded below (away from 0), whereas the familyof the original densities need not be. Since the unknown density g is known to be boundedbetween 1=2 and 1=2+A=2, we can project (if necessary) the estimators produced by the originalstrategies into this range without increasing the squared L2 risk (see, e.g., ([33], Section 2)).10

Then we apply the adaptation recipe using the generated sample eXi; 1 � i � n to get anadaptive strategy �� with risk boundR(g;n� 1;��) � infj�1 1n log 1�j + 1n n�1Xi=0 R(g; i; �j)! :Note that from the construction recipe, the adaptive estimators are convex combinations ofthe original estimators (but with random coe�cients), thus they also stay between 1=2 and(A+1)=2. With this boundedness property, the ratio of K-L divergence and squared L2 distanceis bounded above and below by constants depending only on A (see, e.g., ([33], Section 2)). Asa consequence, we haver(g;n� 1;��) � CA infj�1 1n log 1�j + 1n n�1Xi=0 r(g; i; �j)! :Finally observing that for any estimator g, the estimator f = 2g � 1 of f has risk boundedby Ekf � fk2 � 4Ekg � gk2, from above, we have a combined strategy with the claimed riskbound in Theorem 2. Note that the combined strategy is randomized because of the dependenceon the generated random variables. Due to convexity of the loss, a nonrandomized strategycan be obtained with no bigger risk by averaging out the randomness in the generated sampleWi; 1 � i � n and the coin ips. This completes the proof of Theorem 2.Proof of Theorem 3: For each class Fj ; let �j be an asymptotic minimax strategy. Theexistence of a minimax-rate adaptive estimator under K-L loss follows directly from Theorem1 together with the assumption that the minimax risk is at a regular nonparametric rate.For the proof of the conclusions on the squared L2 loss, observing that under the assumptionon the density classes, for each f in a class, so is g = (f + 1)=2, applying the risk bound givenin Theorem 2, we havesupf2Fj r(f ;n� 1;��) � C infj�1 1n log 1�j + 1n n�1Xi=0 supf2Fj r(f ;n; �j)! :Taking the strategies to be minimax-rate optimal for the classes respectively, we have theminimax-rate adaptation under the squared L2 loss. This completes the proof of Theorem 3.Acknowledgments. The author is grateful to Andrew Barron, from whom he learned theideas on adaptive function estimation. The author thanks Mark Low for a helpful discussion.He also thanks the referees for their comments on earlier versions of this paper.References[1] A.R. Barron, \Are Bayes rules consistent in information?" Open Problems in Communi-cation and Computation, 85-91. T. M. Cover and B. Gopinath editors, Springer-Verlag,1987. 11

[2] A.R. Barron, \The convergence in information of probability density estimators" presentedat IEEE Int. Symp Inform. Theory Kobe, Japan, June 19-24, 1988.[3] A.R. Barron and T.M. Cover, \Minimum complexity density estimation, " IEEE, Trans.on Information Theory , 37 1034-1054, 1991.[4] A.R. Barron, L. Birg�e and P. Massart, \Risk bounds for model selection via penalization,"Probability Theory and Related Fields, 113 301-413, 1999.[5] A.R. Barron, L. Gy�or�, and E.C. van de Meulen, \Distribution estimation consistentin total variation and in two types of information divergence," IEEE Transactions onInformation Theory, 38 1437-1454, 1992.[6] A.R. Barron and Q. Xie, \Asymptotic minimax regret for data compression, gambling,and prediction," preprint, 1996.[7] J.O. Berger and L.R. Pericchi, \The intrinsic Bayes factor for model selection and predic-tion," J. Amer. Statist. Assoc., 91 109-122, 1996.[8] L. Birg�e and P. Massart, \From model selection to adaptive estimation," Research Papersin Probability and Statistics: Festschrift for Lucien Le Cam, (D. Pollard, E. Torgersen andG. Yang, eds.), Springer, New York, 1996.[9] L.D. Brown and M.G. Low, \A constrained risk inequality with applications to nonpara-metric functional estimation", Ann. Statistics., 24 2524-2535, 1996.[10] O. Catoni, \The mixture approach to universal model selection," Technical Report LIENS-97-22, Ecole Normale Superieure, Paris, France, 1997.[11] B. Clarke and A.R. Barron, \Information-theoretic asymptotics of Bayes methods." IEEETrans. Inform. Theory 36 453-471, 1990.[12] T.M. Cover and J.A. Thomas, Elements of Information Theory, Wiley, New York, 1991.[13] R.A. DeVore and G.G. Lorentz, Constructive Approximation, Springer-Verlag, New York,1993.[14] L. Devroye, L. Gy�or�, and G. Lugosi, A Probabilistic Theory of Pattern Recognition,Springer-Verlag, New York, 1996.[15] D.L. Donoho, I.M. Johnstone, G. Kerkyacharian and D. Picard, \Density estimation bywavelet thresholding," Ann. Statistics, 24 508-539, 1996.[16] S.Yu. Efroimovich, \Nonparametric estimation of a density of unknown smoothness," The-ory probab. Appl. 30 557-568, 1985.[17] S.Yu. Efroimovich and M.S. Pinsker, \A self-educating nonparametric �ltration algorithm,"Automation and Remote Control, 45 58-65, 1984.[18] M. Feder and N. Merhav, \Hierarchical universal coding," IEEE Trans. Inform. Theory 421354-1364, 1996.[19] W. H�ardle and J.S. Marron, \Optimal bandwidth selection in nonparametric regressionfunction estimation," Ann. Statist., 13 1465-1481, 1985.[20] D. Haussler and M. Opper, \ Mutual information, metric entropy and cumulative relativeentropy risk," Ann. Statist. 25 2451-2492, 1997.[21] P.J. Huber, \ Projection pursuit," Ann. Statist. 13 435-475, 1985.[22] A. Juditsky, \Wavelet estimators: adapting to unknown smoothness," Math. Methods ofStatist., 6 1-25, 1997. 12

[23] A. Juditsky and A. Nemirovski, \Functional aggregation for nonparametric estimation,"Publication Interne, IRISA, N. 993, 1996.[24] R.E. Kass and A.E. Raftery, \Bayes factors," J. Amer. Statist. Assoc., 90 773-795, 1995.[25] O.V. Lepskii, \Asymptotically minimax adaptive estimation I: Upper bounds. Optimallyadaptive estimates," Theory probab. Appl. 36 682-697, 1991.[26] G. Lugosi and A. Nobel, \Adaptive model selection using empirical complexities," unpub-lished manuscript, 1996.[27] H. Triebel, \Interpolation properties of �-entropy and diameters. Geometric characteristicsof embedding for function spaces of Sobolev-Besov type," Mat. Sbornik 98, 27-41, 1975.[28] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens, \The context-tree weighting method:basic properties," IEEE Transaction on Information Theory, 41 653-664, 1998.[29] Y. Yang, Minimax Optimal Density Estimation, unpublished Ph. D. Dissertation, Depart-ment of Statistics, Yale University, May, 1996.[30] Y. Yang, \On adaptive function estimation," Technical Report #30, Department of Statis-tics, Iowa State University, 1997.[31] Y. Yang, \Model selection for nonparametric regression," Statistica Sinica, 9 475-499,1999.[32] Y. Yang and A.R. Barron, \An asymptotic property of model selection criteria," IEEETransaction on Information Theory, 44 95-116, 1998.[33] Y. Yang and A.R. Barron, \Information-theoretic determination of minimax rates of con-vergence," accepted by Ann. Statististics, 1999.

13

Mixing Strategies for Density Estimation

Documents

Transcript of Mixing Strategies for Density Estimation