Trimmed Comparison of Distributions

8
Trimmed Comparison of Distributions Pedro César ÁLVAREZ-ESTEBAN, Eustasio DEL BARRIO, Juan Antonio CUESTA-ALBERTOS, and Carlos MATRÁN This article introduces an analysis of similarity of distributions based on the L 2 -Wasserstein distance between trimmed distributions. Our main innovation is the use of the impartial trimming methodology, already considered in robust statistics, which we adapt to this setup. Instead of simply removing data at the tails to provide some robustness to the similarity analysis, we develop a data-driven trimming method aimed at maximizing similarity between distributions. Dissimilarity is then measured in terms of the distance between the optimally trimmed distributions. We provide illustrative examples showing the improvements over previous approaches and give the relevant asymptotic results to justify the use of this methodology in applications. KEY WORDS: Asymptotics; Impartial trimming; Similarity; Trimmed distributions; Wasserstein distance. 1. INTRODUCTION An intrinsic consequence of randomness is variability. Sam- ples obtained from a random experiment generally will differ, and even two ideal samples coming from the same random gen- erator cannot be expected to be the same. A main challenge for the statistician is to be able to detect departures from this ideal equality that cannot reasonably be attributed to randomness. Often the researcher is not really concerned about exact co- incidence, but rather wants to guarantee that the random gen- erators do not differ too much. The usual approach in the sta- tistical literature to this “not differ too much” involves fixing a certain parameter related to the distribution of the random gen- erators (possibly the distribution itself) and checking whether some distance between the parameters in the two samples lies below a given threshold. In this article we propose a different approach to the problem with a motivation somehow influenced by robust statistics. Imagine that we want to compare two univariate data sam- ples. We observe that the associated histograms look different, but we realize that we can remove a certain fraction, say 5% of the data in the first sample and another 5% of data in the second sample, in a such way that the remaining data in both samples produce very similar histograms. We then would be tempted to say that the (95%) core of the underlying distributions are similar. This could be the case when, for instance, trying to as- sess the similarity of two human populations with respect to a given feature. Both populations could be initially equal, but the presence of different immigration patterns might cause a differ- ence in the overall distribution of that feature, whereas on the other hand, the “cores” of both populations remain equal. An- other example in which we could be interested in comparing the “cores” of two distributions is when we want to check equality in the distributions generating the two samples of a physical Pedro César Álvarez-Esteban is Associate Professor (E-mail: pedroc@eio. uva.es), Eustasio del Barrio is Associate Professor, and Carlos Matrán is Profes- sor, Department of Statistics and Operations Research, University of Valladolid, Valladolid, Spain. Juan Antonio Cuesta-Albertos is Professor, Department of Mathematics, Statistics, and Computation, University of Cantabria, Santander, Spain. This research was supported in part by the Spanish Ministry of Science and Technology and FEDER (grant BFM2005-04430-C02-01 and 02) and by the Consejería de Educación y Cultura de la Junta de Castilla y León (grant PA- PIJCL VA102/06). The data sets corresponding to the multiclinical study were kindly provided by Axel Munk and Claudia Czado. The data used in Section 3 are available at the majors.dat file in the examples data sets of many statistics packages. We obtained them from the textbook by Moore and McCabe (2003). The computational analyses were done using R statistical software. The R pro- grams and functions used to analyze the examples considered in this work are available at http://www.eio.uva.es/~pedroc/R/. The authors thank the reviewing team for their careful reading of the manuscript and useful suggestions. magnitude but find that the measuring devices are not perfect and introduce some distortions when the true values lie within a certain range, leaving other values unaffected. The distortions introduced by the two measuring devices could be of different types, but if they did not affect more than a small fraction of the observations, again the “core” of the distributions could be equal. Let us formalize this idea of the core of a distribution. When trimming a fraction (of size at most α) of the data in the sam- ple to allow a better comparison with the other sample, we re- place the empirical measure 1 n n i =1 δ x i with a new probability measure that gives weight 0 to the observations in the bad set and weight 1 nk to every observation remaining in the sample. Here k is the number of trimmed observations; thus k and 1 nk 1 n 1 1α . Instead of simply keeping/removing data, we could increase the weight of data in good ranges (by a factor bounded by 1 1α ) and downplay the importance of data in bad zones, not necessarily removing them. The new trimmed em- pirical measure can be written as 1 n n i =1 b i δ x i , where 0 b i 1 (1 α) and 1 n n i =1 b i = 1. If the random generator of the sample were P , then the the- oretical counterpart of the trimming procedure would be to re- place the probability P(B) = B 1 dP by the new measure ˜ P(B) = B gdP, where 0 g 1 (1 α) and gdP = 1. (1) We call a probability measure like ˜ P in (1) an α-trimming of P . We show in Section 2 that all α-trimmings of P can be ex- pressed in terms of trimming functions. For a given trimming function, h, P h denotes the corresponding α-trimming of P. The trimming function h determines which zones in the distri- bution P are downplayed or removed. Turning to the measurements-with-errors example, the un- derlying distribution of the samples, P and Q, could be dif- ferent because of the distortions introduced by the measuring device, but a suitable trimming function, h, could produce α- trimmings, P h and Q h , that are very similar or even equal. The right trimming function generally will be unknown, and © 2008 American Statistical Association Journal of the American Statistical Association June 2008, Vol. 103, No. 482, Theory and Methods DOI 10.1198/016214508000000274 697

Transcript of Trimmed Comparison of Distributions

Trimmed Comparison of DistributionsPedro Ceacutesar AacuteLVAREZ-ESTEBAN Eustasio DEL BARRIO Juan Antonio CUESTA-ALBERTOSand Carlos MATRAacuteN

This article introduces an analysis of similarity of distributions based on the L2-Wasserstein distance between trimmed distributions Ourmain innovation is the use of the impartial trimming methodology already considered in robust statistics which we adapt to this setupInstead of simply removing data at the tails to provide some robustness to the similarity analysis we develop a data-driven trimming methodaimed at maximizing similarity between distributions Dissimilarity is then measured in terms of the distance between the optimally trimmeddistributions We provide illustrative examples showing the improvements over previous approaches and give the relevant asymptotic resultsto justify the use of this methodology in applications

KEY WORDS Asymptotics Impartial trimming Similarity Trimmed distributions Wasserstein distance

1 INTRODUCTION

An intrinsic consequence of randomness is variability Sam-ples obtained from a random experiment generally will differand even two ideal samples coming from the same random gen-erator cannot be expected to be the same A main challenge forthe statistician is to be able to detect departures from this idealequality that cannot reasonably be attributed to randomness

Often the researcher is not really concerned about exact co-incidence but rather wants to guarantee that the random gen-erators do not differ too much The usual approach in the sta-tistical literature to this ldquonot differ too muchrdquo involves fixing acertain parameter related to the distribution of the random gen-erators (possibly the distribution itself) and checking whethersome distance between the parameters in the two samples liesbelow a given threshold In this article we propose a differentapproach to the problem with a motivation somehow influencedby robust statistics

Imagine that we want to compare two univariate data sam-ples We observe that the associated histograms look differentbut we realize that we can remove a certain fraction say 5 ofthe data in the first sample and another 5 of data in the secondsample in a such way that the remaining data in both samplesproduce very similar histograms We then would be temptedto say that the (95) core of the underlying distributions aresimilar This could be the case when for instance trying to as-sess the similarity of two human populations with respect to agiven feature Both populations could be initially equal but thepresence of different immigration patterns might cause a differ-ence in the overall distribution of that feature whereas on theother hand the ldquocoresrdquo of both populations remain equal An-other example in which we could be interested in comparing theldquocoresrdquo of two distributions is when we want to check equalityin the distributions generating the two samples of a physical

Pedro Ceacutesar Aacutelvarez-Esteban is Associate Professor (E-mail pedroceiouvaes) Eustasio del Barrio is Associate Professor and Carlos Matraacuten is Profes-sor Department of Statistics and Operations Research University of ValladolidValladolid Spain Juan Antonio Cuesta-Albertos is Professor Department ofMathematics Statistics and Computation University of Cantabria SantanderSpain This research was supported in part by the Spanish Ministry of Scienceand Technology and FEDER (grant BFM2005-04430-C02-01 and 02) and bythe Consejeriacutea de Educacioacuten y Cultura de la Junta de Castilla y Leoacuten (grant PA-PIJCL VA10206) The data sets corresponding to the multiclinical study werekindly provided by Axel Munk and Claudia Czado The data used in Section 3are available at the majorsdat file in the examples data sets of many statisticspackages We obtained them from the textbook by Moore and McCabe (2003)The computational analyses were done using R statistical software The R pro-grams and functions used to analyze the examples considered in this work areavailable at httpwwweiouvaes~pedrocR The authors thank the reviewingteam for their careful reading of the manuscript and useful suggestions

magnitude but find that the measuring devices are not perfectand introduce some distortions when the true values lie withina certain range leaving other values unaffected The distortionsintroduced by the two measuring devices could be of differenttypes but if they did not affect more than a small fraction ofthe observations again the ldquocorerdquo of the distributions could beequal

Let us formalize this idea of the core of a distribution Whentrimming a fraction (of size at most α) of the data in the sam-ple to allow a better comparison with the other sample we re-place the empirical measure 1

n

sumni=1 δxi

with a new probabilitymeasure that gives weight 0 to the observations in the bad setand weight 1

nminuskto every observation remaining in the sample

Here k is the number of trimmed observations thus k le nα

and 1nminusk

le 1n

11minusα

Instead of simply keepingremoving data wecould increase the weight of data in good ranges (by a factorbounded by 1

1minusα) and downplay the importance of data in bad

zones not necessarily removing them The new trimmed em-pirical measure can be written as

1

n

nsum

i=1

biδxi where 0 le bi le 1

(1 minus α)and

1

n

nsum

i=1

bi = 1

If the random generator of the sample were P then the the-oretical counterpart of the trimming procedure would be to re-place the probability P(B) = int

B1dP by the new measure

P (B) =int

B

g dP

where 0 le g le 1

(1 minus α)and

int

g dP = 1 (1)

We call a probability measure like P in (1) an α-trimming ofP We show in Section 2 that all α-trimmings of P can be ex-pressed in terms of trimming functions For a given trimmingfunction h Ph denotes the corresponding α-trimming of P

The trimming function h determines which zones in the distri-bution P are downplayed or removed

Turning to the measurements-with-errors example the un-derlying distribution of the samples P and Q could be dif-ferent because of the distortions introduced by the measuringdevice but a suitable trimming function h could produce α-trimmings Ph and Qh that are very similar or even equalThe right trimming function generally will be unknown and

copy 2008 American Statistical AssociationJournal of the American Statistical Association

June 2008 Vol 103 No 482 Theory and MethodsDOI 101198016214508000000274

697

698 Journal of the American Statistical Association June 2008

we should look for the best possible one This makes sense ifwe consider a metric d between probability measures and take

h0 = arg minh

d(PhQh) (2)

If the α-trimmings Ph0 and Qh0 are equal then we can say thatthe core of the distributions coincide It also would be of in-terest to check whether these optimally trimmed probabilitiesare close to one another Our goal is to introduce and analyzemethods for testing the similaritydissimilarity of trimmed dis-tributions

In this article we consider the L2-Wasserstein (or Mallows)distance between distributions Note that in a related workMunk and Czado (1998) considered a trimmed version of theWasserstein distance consisting in trimming both distributionssolely in their tails and in a symmetric way In the next sec-tion we discuss this approach in our context However we wantto emphasize that in this article we use impartial trimming notonly as a way to robustify a statistical procedure but also asa method to discard part of the data to achieve the best possi-ble fit between two given samples or between a sample and atheoretical distribution thus searching for the maximum simi-larity between them To the best of our knowledge this pointof view has not been previously considered in the literature andcan lead to a new methodology in relation with the similarityconcept However the fact that the data themselves decide themethod of trimming is common to several statistical method-ologies (see eg Cuesta Gordaliza and Matraacuten 1997 Garciacutea-Escudero Gordaliza Matraacuten and Mayo-Iscar 2008 Gordaliza1991 Maronna 2005 Rousseeuw 1985) described here by theterm ldquoimpartial trimmingrdquo

The article organized as follows In Section 2 we formally in-troduce the trimming methodology to measure dissimilaritiesWe present the properties of trimming and a preliminary exam-ple describing the innovation of our methodology with respectto the naif approach of symmetrically trimming to gain robust-ness Asymptotics for our dissimilarity measure complete themathematical analysis considered in Section 2 In Section 3 wecompare our methodology with that of Munk and Czado on areal data set showing the flexibility that impartial trimming in-troduces in the similarity setup We give proofs of our results inthe Appendix

2 MEASURING DISSIMILARITIES THROUGHIMPARTIAL TRIMMING

As discussed earlier we could consider trimmings of a proba-bility distribution on a Borel set simply by considering the con-ditional probability given that set But here it is convenient tointroduce a slightly more general concept Trimmed probabili-ties can be defined in general probability spaces although forpractical purposes we restrict ourselves to probabilities on thereal line

Definition 1 Let P be a probability measure on R and let0 le α lt 1 We say that a probability measure P lowast on R is anα-trimming of P if P lowast is absolutely continuous with respect toP (P lowast P ) and dP lowast

dPle 1

1minusα

We denote the set of α-trimmings of P by T α(P ) that is ifP denotes the set of probability measures on R then

T α(P ) =

P lowast isinP P lowast PdP lowast

dPle 1

1 minus αP -as

(3)

The limit case in which α = 1 T 1(P ) is just the set of proba-bility measures absolutely continuous with respect to P

An equivalent characterization is that P lowast isin T α(P ) if andonly if P lowast P and dP lowast

dP= 1

1minusαf with 0 le f le 1 If f takes

only the values 0 and 1 then it is the indicator of a set say Asuch that P(A) = 1 minus α and trimming corresponds to consider-ing the probability measure P(middot|A) Definition (3) allows us toreduce the weight of some regions of the sample space withoutcompletely removing them from the feasible set

The following proposition gives an useful characterizationof trimmings of a probability distribution in terms of the trim-mings of the U[01] distribution

Proposition 1 Let Cα be the class of absolutely continuousfunctions h [01] rarr [01] such that h(0) = 0 and h(1) = 1with derivative hprime such that 0 le hprime le 1

1minusα For any real proba-

bility measure P we have the following

a T α(P ) = P lowast isin P P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cαb T α(U[01]) = P lowast isin P P lowast(minusinfin t] = h(t)0 le t le 1

h isin CαIt will be useful to write Ph for the probability measure

with distribution function h(P (minusinfin t]) leading to T α(P ) =Ph h isin Cα

To measure closeness between distributions we resort to theL2-Wasserstein distance defined in the set P2 of probabilitieswith finite second moment For P and Q in P2 W2(PQ)

is defined as the lowest L2-distance between random variables(rvrsquos) defined on any probability space with distributions P

and Q The measure of closeness or matching between P andQ at a given level α or equivalently between their distributionfunctions F and G is now defined by

τα(PQ) equiv τα(FG) = infhisinCα

W22 (PhQh) (4)

The following alternative expression for W2(PQ) is a keyaspect to the usefulness of this distance in statistics on the lineIf F and G are the distribution functions of P and Q and Fminus1

and Gminus1 are the respective (left-continuous) quantile functionsthen the L2-Wasserstein distance between P and Q is given by(see eg Bickel and Freedman 1981)

W2(PQ) =[int 1

0(Fminus1(t) minus Gminus1(t))2 dt

]12

(5)

Recall that Fminus1 is defined on (01) by Fminus1(t) = infs F(s) get which satisfies that its distribution function is F From thisit is obvious that for the probability measures based on two sam-ples (resp one sample and a theoretical distribution) W2 coin-cides with the L2 distance to the diagonal in a QndashQ plot (respprobability plot)

From (5) and Proposition 1 we obtain the equivalent expres-sion of (4) as

τα(FG) = infhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt (6)

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 699

The infimum in (6) is easily attained at the function h0 below(see Gordaliza 1991) associated with a set with Lebesgue mea-sure 1minusα We call this minimizer h0 an impartial α-trimmingbetween P and Q Obviously after Proposition 1 h0(F (x))

and h0(G(x)) are the distribution functions of the impartiallyα-trimmed probabilities

To analyze (6) let us consider the map t rarr |Fminus1(t) minusGminus1(t)| as a random variable defined on (01) endowed withthe Lebesgue measure Let

LFG(x) = t isin (01) |Fminus1(t) minus Gminus1(t)| le x

x ge 0

denote its distribution function and write Lminus1FG for the corre-

sponding quantile inverse If LFG is continuous at Lminus1FG(1 minus

α) then LFG(Lminus1FG(1 minus α)) = 1 minus α and

infhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

=int 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt (7)

where

hprime0(t) = 1

1 minus αI(|Fminus1(t)minusGminus1(t)|leLminus1

FG(1minusα)) (8)

In this case h0 is in fact the unique mimimizer of the criterionfunctional

Even if LFG is not continuous at Lminus1FG(1 minus α) we can en-

sure the existence of a set A0 (not necessarily unique) such that(A0) = 1 minus α andt isin (01) |Fminus1(t) minus Gminus1(t)| lt Lminus1

FG(1 minus α)

sub A0 sub t isin (01) |Fminus1(t) minus Gminus1(t)| le Lminus1

FG(1 minus α)

Obviously if for any such A0 we consider the function h0 isin Cα

with hprime0 = 1

1minusαIA0 then the infimum in (6) is attained at h0

Therefore problem (6) is equivalent to

infA

(1

1 minus α

int

A

(Fminus1(t) minus Gminus1(t))2 dt

)12

(9)

where A varies on the Borel sets in (01) with Lebesgue mea-sure equal to 1 minus α

21 Comparison With Symmetric Trimming

Munk and Czado (1998) (see also Czado and Munk 1998Freitag Czado and Munk 2007) considered a trimmed versionof the Wasserstein distance for the assessment of similarity be-tween the distribution functions F and G as

α(FG) =(

1

1 minus α

int 1minusα2

α2(Fminus1(t) minus Gminus1(t))2 dt

)12

(10)

Note that the right side of the foregoing expression equalsW2(PαQα) where Pα is the probability measure with distrib-ution function

Fα(t) = 1

1 minus α(F (t) minus α2)

Fminus1(α2) le t lt Fminus1(1 minus α2) (11)

Figure 1 Histograms of trimmed data (cholesterol levels) in twoclinical centers The white part in the bars shows the trimming propor-tion in the associated zone

and similarly for Qα When comparing two samples this corre-sponds to the distance between the sample distributions associ-ated with the symmetrically trimmed samples This naif way oftrimming is widely used and confers protection against conta-mination by outliers However the arbitrariness in the choice ofthe trimming zones has been largely reported as a serious draw-back of procedures based on this method (see eg Cuesta et al1997 Garciacutea-Escudero et al 2008 Gordaliza 1991 Rousseeuw1985) In our setting the question is why two distributions thatare very different in their tails are considered similar but onesthat differ in their central parts are considered nonsimilar

To get an idea of the differences between our approach andthe symmetrical trimming let us recall example 1 of Munk andCzado (1998) which corresponds to a multiclinical study oncholesterol and fibrinogen levels in two sets of patients (of size116 and 141) in two clinical centers For the fibrinogen data ourimpartial trimming proposal for α = 1 essentially coincideswith the symmetrical trimming But Figure 1 displays the ef-fects of our trimming proposal for the cholesterol data showinga significant trimming in the middle part of the histograms aswell corresponding to both centers This even improves on thelevel of similarity shown by Munk and Czado (1998) strength-ening their assessment of similarity on these data but also pro-vides a descriptive look at the way in which both populations(dis)agree The posterior analysis of the trimmed data can bevery useful in a global comparison of the populations

22 Nonparametric Test of Similarity

As is usual in many statistical analyses the interest of statisti-cians when analyzing similarity of distributions relies on assert-ing the equivalence of the involved probability distributions Inhypothesis testing this is achieved by taking equivalence or sim-ilarity as the alternative hypothesis whereas dissimilarity is thenull hypothesis In agreement with this point of view Munk andCzado (1998) considered the testing problem with the null hy-pothesis that the trimmed distance (10) exceeds some valuea threshold to be analyzed by the experimenters and statisticiansin an ad hoc way Graphics of p values for different values(see Fig 4 in Sec 3) play a key role in this analysis and thefact that our measure of dissimilarity (τα(FG))12 is mea-sured on the same scale as the variable of interest favors thisgoal We also note that it is the Wasserstein distance betweentrimmed versions of the original distributions This allows us to

700 Journal of the American Statistical Association June 2008

handle the very nice properties of this distance (see eg Bickeland Freedman 1981) in a friendly way in connection with ourproblem

Let X1 Xn (resp Y1 Ym) be iid observations withcommon distribution function F (resp G) and let X(1)

X(n) (resp Y(1) Y(m)) be the corresponding ordered sam-ples We base our test of H0 τα(FG) gt 2

0 against Ha τα(FG) le 2

0 on the empirical counterparts of τα(FG)namely Tnα = τα(FnG) where Fn denotes the empirical dis-tribution function based on the data in the one-sample problemand Tnmα = τα(FnGm) in the two-sample case Our next re-sults show that under some mild assumptions on F and G Tnα

and Tnmα are asymptotically normal a fact that we use laterto approximate the critical values of H0 against Ha For nota-tional reasons in the Appendix we give the proof only of theone-sample statement

To obtain the asymptotic behavior of our statistics we as-sume that

F and G have absolute moments of order 4 + δfor some δ gt 0

(12)

A further regularity assumption is that F has a continuouslydifferentiable density F prime = f such that

supxisinR

∣∣∣∣F(x)(1 minus F(x))f prime(x)

f 2(x)

∣∣∣∣ lt infin (13)

Additional notation includes h0 as defined in (8) and

l(t) =int Fminus1(t)

Fminus1(12)

(x minus Gminus1(F (x))

)hprime

0(F (x)) dx (14)

and

s2nα(G) = 4

(1 minus α)2

1

n

nminus1sum

ij=1

(

mini j minus ij

n

)

anianj (15)

where

ani = (X(i+1) minus X(i)

)((X(i+1) + X(i)

)2 minus Gminus1(in)

)

times I(|X(i)minusGminus1(in)|leLminus1

FnG(1minusα)) (16)

Theorem 2 Assume that F and G satisfy (12) and (13)and that LFG is continuous at Lminus1

FG(1 minus α) Thenradic

n(Tnα minusτα(FG)) is asymptotically centered normal with variance

σ 2α (FG) = 4

(int 1

0l2(t) dt minus

(int 1

0l(t) dt

)2)

(17)

This asymptotic variance can be consistently estimated bys2nα(G) given by (15) If G also satisfies (13) and n

n+mrarr λ isin

(01) thenradic

nmn+m

(Tnmα minus τα(FG)) is asymptotically cen-

tered normal with variance (1minusλ)σ 2α (FG)+λσ 2

α (GF ) Thisvariance can be consistently estimated by s2

nmα =m

n+ms2nα(Gm) + n

n+ms2mα(Fn)

If τα(FG) = 0 then Theorem 2 reduces toradic

nTnα rarr0 in probability note that τα(FG) = 0 implies that (x minusGminus1(F (x)))2hprime

0(F (x)) = 0 for almost every x and thus σ 2α (F

G) = 0 This generally would suffice for our applications butwe also give the exact rate and the limiting distribution in theAppendix

(a) (b)

(c) (d)

Figure 2 Histograms for variable GPA (a) Males (b) females(c) computer science students (d) engineering students

3 EXAMPLE AND SIMULATIONS

Our analysis is based on the variable college grade pointaverage (GPA) collected from a group of 234 students Thisvariable takes values of 0ndash4 The students are classified by thevariables gender and major (1 = computer science 2 = engi-neering 3 = other sciences) We are interested in studying thedistributional similarity of the GPA between males (n = 117)and females (m = 117) and also between students with a ma-jor in computer sciences (n = 78) and students with a major inengineering (m = 78) Figure 2 shows the histogram for eachsample

Comparisons of these samples using classical proceduresproduce the results displayed in Table 1 Because the ShapirondashWilks tests reject the normality of the four samples we use non-parametric methods like the KolmogorovndashSmirnov test (KS)or the WilcoxonndashMannndashWhitney test (WMW) to analyze thenull hypothesis that both samples come from the same distribu-tion in the comparisons of GPA by sex and GPA by major Thep values of these tests clearly reject the null hypotheses

Under the possibility of impartially trimming both samplesas described in Section 2 we obtain the optimal trimming func-tions displayed in Figure 3 In this figure and for each compar-ison we plot the value of |Fminus1

n (t) minus Gminus1m (t)| and the cutting

values Lminus1FnGm

(1 minusα) for α = 05 1 and 2 Figure 3(a) showsthat the optimal trimming involves the lower tail but not ex-actly from the lower end point When the trimming level grows

Table 1 Two-sample p values for classical tests

p value

Test GPA by gender GPA by major

ShapirondashWilks (sample 1) 0176 0360ShapirondashWilks (sample 2) 0217 0001KS 0028 0040WMW 0004 0175

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 701

(a) (b)

Figure 3 Trimming functions (a) GPA by gender (b) GPA by major ( α = 05 α = 1 α = 2)

(α = 1 and 2) the trimmed zone is not an interval and it in-cludes points around percentiles 20 40 60 and 70Figure 3(b) shows that the points that should be trimmed tomake both samples more similar are between percentiles 10and 30 This example illustrates a nonsymmetrical dissimi-larity between samples in fact in the first comparison the lesssimilar zone is close to the lower tail but not to the upper tailwhere the values are more similar

31 p Value Curve

To gain some insight into the assessment of the similarityor dissimilarity of the underlying distributions we can use thep value curve to test the null hypothesis H0 τα(FG) gt 2

0against Ha τα(FG) le 2

0 In the two-sample comparisoncase we use the statistic

Znmα =radic

nm

n + m

(Tnmα minus 20)

snmα

(18)

To obtain the values of Tnmα we compute |Fminus1n (t) minus

Gminus1m (t)|2 over a grid in [01] using the (1 minus α)-quantile of

these values to determine Lminus1FnGm

(1 minus α) The integral is then

calculated numerically The computation of snmα is done sim-ilarly

The asymptotic p value curve P(0) is defined as

P(0) = sup(FG)isinH0

limnmrarrinfinPFG(Znmα le z0) = (z0)

where z0 is the observed value of Znmα (Note that the supre-mum is attained when the distance between both distributionsis exactly 0) These asymptotic p value curves can be used intwo ways On one hand given a fixed value of 0 that con-trols the degree of dissimilarity it is possible to find the p

value associated with the corresponding null hypothesis to de-cide whether or not the distributions are similar On the otherhand given a fixed test level (p value) we can find the valueof 0 such that for every ge 0 we should reject the hy-pothesis H0 τα(FG) ge 2 In this way we can get a soundidea of the degree of dissimilarity between the distributions Tohandle the values of 0 the experimenter should take into ac-count how to interpret the Wasserstein distance recalling thatin the case where F and G belong to the same location fam-ily their Wasserstein distance is the absolute difference of theirlocations

Figure 4 illustrates the improved assessment obtained by im-partial trimming over the Munk and Czado methodology It dis-

(a) (b)

Figure 4 p value curves using impartial and symmetrical (MC) trimmings (a) GPA by gender (b) GPA by major [ α = 051 α = 051(MC) α = 102 α = 102 (MC) α = 205 α = 205 (MC)]

702 Journal of the American Statistical Association June 2008

plays the p value curves using impartial trimming and symmet-rical trimming for both comparisons for different trimming lev-els (α = 05 1 and 2) For each plot a horizontal line marksa reference level for the test (05) The GPAs of males and fe-males are similar up to 0 ranging from 32 to 36 (dependingon the trimming size) when impartial trimmings are used Thesevalues represent between 100 times 322815 = 114 and 128of the average of the medians of the samples But when us-ing symmetrical trimmings the horizontal line cuts the p valuecurves for 0 ranging from 56 to 59 between 20 and 21of the average of the medians A similar analysis of the com-parison of GPAs by major leads us to values of 0 rangingfrom 29 to 36 which represent between a 96 and a 119of the average of the medians when using impartial trimmingInstead when using symmetrical trimming these percentagesrange from 166 to 195

32 Simulations

We end this section by reporting a small simulation studyto illustrate our procedurersquos performance for finite samples fortesting H0 τα(FG) gt 2

0 against Ha τα(FG) le 20 in the

two-sample problem We considered two different contami-nated normal models two different trimming sizes and sev-eral values of the threshold 0 In each situation we gener-ated 10000 replicas of the trimmed score Znmα as defined in(18) for several values of n = m We compared these replicaswith the 05 theoretical quantile of the standard normal distrib-ution rejecting H0 for observed values smaller than this quan-tity Table 2 shows the observed rejection frequencies We findgood agreement with our asymptotic results even for moder-ate sample sizes with low rejection frequencies for thresholds0 smaller than the true distance and high rejection frequenciesotherwise When the threshold equals the true distance we alsocan see how the observed frequency approximates the nominallevel

4 CONCLUSIONS AND POSSIBLE EXTENSIONS

We have introduced a procedure to compare two samples orprobability distributions on the real line based on the impartialtrimming methodology The procedure is designed mainly toassess similarity of the core of two samples by discarding thatpart of the data that has a greater influence on the dissimilarityof the distributions Our method is based on trimming the corre-sponding samples according to the same trimming function butit allows nonsymmetrical trimming thus it can greatly improvethe previous methodology based on simply trimming the tailsWe have evaluated the performance of our procedure throughan analysis of some real data samples that emphasized the ap-pealing possibilities in data analysis and the significance of theanalysis of the p value curves for assessing similarities A sim-ulation study has provided also evidence about the behavior ofthe procedure for finite samples in agreement with asymptoticresults Although we treated only dissimilarities based on theWasserstein distance other metrics or dissimilarities could behandled under the same scheme

Representation of trimmings of any distribution in terms ofthose of the uniform distribution is no longer possible in themultivariate setting However under very general assumptionsit has been proven (see Cuesta and Matraacuten 1989) that given twoprobabilities P and Q on R

k there exists an ldquooptimal transportmaprdquo T such that Q = P T minus1 and if X is any random vectorwith law P then EXminusT (X)2 = W2

2 (PQ) Moreover if Pα

is an α-trimming of P then Pα T minus1 is an α-trimming of Q

and T is an optimal map between Pα and Pα T minus1 so the mul-tivariate version of (4) would be the minimization over the setof α-trimmings of P of the expression W2

2 (PαPα T minus1) Wealso should mention that obtaining the optimal map T remainsan open problem for k gt 1 Although these are troubling factsobtaining the optimal trimming between two samples is alreadypossible through standard optimization procedures A final dif-ficulty concerns the asymptotic behavior of the involved statis-tics to which the techniques used in our proofs do not extend

Table 2 Simulated powers for the trimmed scores Znmα

P = 95 N(01) + 05 N(51) P = 9 N(01) + 1 N(51)

Q = 95 N(01) + 05 N(minus51) Q = 09 N(01) + 01 N(minus51)

α = 05 (1 minus α)20 n Frequency α = 1 (1 minus α)2

0 n Frequency

[(1 minus α)τα(PQ) = 384] 25 100 0320 [(1 minus α)τα(PQ) = 1004] 25 100 0028200 0268 200 0500 0086 500 0

1000 0021 1000 05000 0 5000 0

5 100 1412 5 100 0109200 1633 200 0031500 2264 500 0002

1000 3134 1000 05000 7648 5000 0

1 100 4912 1 100 0850200 6957 200 0727500 9474 500 0657

1000 9989 1000 05845000 10000 5000 0486

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimmingfunctions would provide an interesting alternative to our presentproposal Through our research still in progress we have iden-tified a radically different behavior to that presented in this ar-ticle for identical trimming in both samples

APPENDIX PROOFS AND FURTHER RESULTS

In this appendix ρn(t) = radicnf (Fminus1(t))(Fminus1

n (t) minus Fminus1(t)) de-notes the weighted quantile process where f is the density functionof F

Proof of Proposition 1

Let A= P lowast isinP P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cα For P lowast isinA absolute continuity of h entails

P lowast(s t] = h(P (minusinfin t]) minus h(P (minusinfin s])

=int P(minusinfint]P(minusinfins]

hprime(x) dx le 1

1 minus αP (s t]

Thus P lowast P and dP lowastdP

le 11minusα

and therefore P lowast isin T α(P )Conversely given P lowast isin T α(P ) if F is the distribution function of

P and we define h(t) = int t0

dP lowastdP

(Fminus1(s)) ds then it is immediate thath isin Cα and

P lowast(minusinfin t] =int t

minusinfindP lowastdP

(s) dF (s)

=int F(t)

0

dP lowastdP

(Fminus1(s)) ds = h(P (minusinfin t])

Therefore P lowast isin A and part a is proven The proof of part b is imme-diate from the proof of part a

The following lemmas collect some results which we use in ourproofs of Theorems 2 and A1 These results can be easily proven us-ing Schwarzrsquos inequality standard arguments in empirical processestheory or the ArzelaacutendashAscoli theorem

Lemma A1 If F and G have finite absolute moment of order r gt 4then the following hold

aradic

nint 1n

0 (Fminus1(t))2 dt rarr 0 andradic

nint 1

1minus1n(Fminus1(t))2 dt rarr 0

bradic

nint 1n

0 (Fminus1n (t))2 dt rarr 0 and

radicn

int 11minus1n(Fminus1

n (t))2 dt rarr 0 inprobability

cint 1

0radic

t (1 minus t)g(Gminus1(t))|Fminus1(t) minus Gminus1(t)|dt lt infin

d Furthermore if G satisfies (13) then 1radicn

int 1minus1n1n

t (1 minus t)

g2(Gminus1(t)) dt rarr 0

Lemma A2 Under the middot infin topology the set Cα in Proposition 1and the set Cα(FG) = h isin Cα

int 10 (Fminus1(t)minusGminus1(t))2hprime(t) dt = 0

for FG isinP2 are compact

Proof of Theorem 2

From theorem 621 of Csoumlrgouml and Horvaacuteth (1993) and (13) we canassume that there exist Brownian bridges Bn satisfying

n12minusν sup1nletle1minus1n

|ρn(t) minus Bn(t)|(t (1 minus t))ν

=

OP (logn) if ν = 0

OP (1) if 0 lt ν le 12(A1)

Now we set Mn(h) = radicn

int 10 (Fminus1

n (t) minus Gminus1(t))2hprime(t) dt and

Nn(h) = 2int 1minus1n

1n

Bn(t)

f (Fminus1(t))(Gminus1(t) minus Fminus1(t))hprime(t) dt

+ radicn

int 1minus1n

1n(Gminus1(t) minus Fminus1(t))2hprime(t) dt

Observe that

suphisinCα

|Mn(h) minus Nn(h)|

le radicn

int 1n

0(Fminus1

n (t) minus Gminus1(t))2 dt

+ radicn

int 1

1minus1n(Fminus1

n (t) minus Gminus1(t))2 dt

+ 1radicn

int 1minus1n

1n

|ρn(t) minus Bn(t)|2f 2(Fminus1(t))

dt

+ 1radicn

int 1minus1n

1n

Bn(t)2

f 2(Fminus1(t))dt

+ 2int 1minus1n

1n

|ρn(t) minus Bn(t)|f (Fminus1(t))

|Gminus1(t) minus Fminus1(t)|dt

= An1 + An2 + An3 + An4 + An5

Lemma A1 implies that An1 rarr 0 and An2 rarr 0 in probability

From (A1) we get An3 le OP (1) 1radicn

int 1minus1n1n

t (1minust)

f 2(Fminus1(t))dt and

this last integral tends to 0 by Lemma A1 Thus An3 rarr 0 inprobability Similarly An4 rarr 0 in probability Finally (A1) yields

An5 le OP (1)nνminus12 int 1minus1n1n

(t (1minust))ν

f (Fminus1(t))|Gminus1(t) minus Fminus1(t)|dt for

some ν isin (012) Lemma A1 shows thatint 1

0(t (1minust))12

f (Fminus1(t))|Gminus1(t) minus

Fminus1(t)|dt lt infin Thus by dominated convergence we obtain thatAn5 rarr 0 in probability Collecting the foregoing estimates we obtainsuphisinCα

|Mn(h) minus Nn(h)| rarr 0 in probability and thusradic

n(Tnα minusSnα) rarr 0 in probability where

radicnSnα = infhisinCα

Nn(h) Therefore

we need only show thatradic

n(Snα minus τα(FG)) rarrw N(0 σ 2α(FG))

where

radicnSnα = inf

hisinCα

[

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime(t) dt

]

Let us denote

hn = arg minhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radicn

int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

Clearly hprimen(t) rarr hprime

0(t) for almost every t Furthermore optimality ofhn shows that

Bn = radicnSnα minus

(

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime

0(t) dt

)

le 0

but in contrast

Bn = radicn

(int 1

0(Fminus1(t) minus Gminus1(t))2hprime

n(t) dt

minusint 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt

)

+ 2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))(hprime

n(t) minus hprime0(t)) dt

= Bn1 + Bn2

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297

698 Journal of the American Statistical Association June 2008

we should look for the best possible one This makes sense ifwe consider a metric d between probability measures and take

h0 = arg minh

d(PhQh) (2)

If the α-trimmings Ph0 and Qh0 are equal then we can say thatthe core of the distributions coincide It also would be of in-terest to check whether these optimally trimmed probabilitiesare close to one another Our goal is to introduce and analyzemethods for testing the similaritydissimilarity of trimmed dis-tributions

In this article we consider the L2-Wasserstein (or Mallows)distance between distributions Note that in a related workMunk and Czado (1998) considered a trimmed version of theWasserstein distance consisting in trimming both distributionssolely in their tails and in a symmetric way In the next sec-tion we discuss this approach in our context However we wantto emphasize that in this article we use impartial trimming notonly as a way to robustify a statistical procedure but also asa method to discard part of the data to achieve the best possi-ble fit between two given samples or between a sample and atheoretical distribution thus searching for the maximum simi-larity between them To the best of our knowledge this pointof view has not been previously considered in the literature andcan lead to a new methodology in relation with the similarityconcept However the fact that the data themselves decide themethod of trimming is common to several statistical method-ologies (see eg Cuesta Gordaliza and Matraacuten 1997 Garciacutea-Escudero Gordaliza Matraacuten and Mayo-Iscar 2008 Gordaliza1991 Maronna 2005 Rousseeuw 1985) described here by theterm ldquoimpartial trimmingrdquo

The article organized as follows In Section 2 we formally in-troduce the trimming methodology to measure dissimilaritiesWe present the properties of trimming and a preliminary exam-ple describing the innovation of our methodology with respectto the naif approach of symmetrically trimming to gain robust-ness Asymptotics for our dissimilarity measure complete themathematical analysis considered in Section 2 In Section 3 wecompare our methodology with that of Munk and Czado on areal data set showing the flexibility that impartial trimming in-troduces in the similarity setup We give proofs of our results inthe Appendix

2 MEASURING DISSIMILARITIES THROUGHIMPARTIAL TRIMMING

As discussed earlier we could consider trimmings of a proba-bility distribution on a Borel set simply by considering the con-ditional probability given that set But here it is convenient tointroduce a slightly more general concept Trimmed probabili-ties can be defined in general probability spaces although forpractical purposes we restrict ourselves to probabilities on thereal line

Definition 1 Let P be a probability measure on R and let0 le α lt 1 We say that a probability measure P lowast on R is anα-trimming of P if P lowast is absolutely continuous with respect toP (P lowast P ) and dP lowast

dPle 1

1minusα

We denote the set of α-trimmings of P by T α(P ) that is ifP denotes the set of probability measures on R then

T α(P ) =

P lowast isinP P lowast PdP lowast

dPle 1

1 minus αP -as

(3)

The limit case in which α = 1 T 1(P ) is just the set of proba-bility measures absolutely continuous with respect to P

An equivalent characterization is that P lowast isin T α(P ) if andonly if P lowast P and dP lowast

dP= 1

1minusαf with 0 le f le 1 If f takes

only the values 0 and 1 then it is the indicator of a set say Asuch that P(A) = 1 minus α and trimming corresponds to consider-ing the probability measure P(middot|A) Definition (3) allows us toreduce the weight of some regions of the sample space withoutcompletely removing them from the feasible set

The following proposition gives an useful characterizationof trimmings of a probability distribution in terms of the trim-mings of the U[01] distribution

Proposition 1 Let Cα be the class of absolutely continuousfunctions h [01] rarr [01] such that h(0) = 0 and h(1) = 1with derivative hprime such that 0 le hprime le 1

1minusα For any real proba-

bility measure P we have the following

a T α(P ) = P lowast isin P P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cαb T α(U[01]) = P lowast isin P P lowast(minusinfin t] = h(t)0 le t le 1

h isin CαIt will be useful to write Ph for the probability measure

with distribution function h(P (minusinfin t]) leading to T α(P ) =Ph h isin Cα

To measure closeness between distributions we resort to theL2-Wasserstein distance defined in the set P2 of probabilitieswith finite second moment For P and Q in P2 W2(PQ)

is defined as the lowest L2-distance between random variables(rvrsquos) defined on any probability space with distributions P

and Q The measure of closeness or matching between P andQ at a given level α or equivalently between their distributionfunctions F and G is now defined by

τα(PQ) equiv τα(FG) = infhisinCα

W22 (PhQh) (4)

The following alternative expression for W2(PQ) is a keyaspect to the usefulness of this distance in statistics on the lineIf F and G are the distribution functions of P and Q and Fminus1

and Gminus1 are the respective (left-continuous) quantile functionsthen the L2-Wasserstein distance between P and Q is given by(see eg Bickel and Freedman 1981)

W2(PQ) =[int 1

0(Fminus1(t) minus Gminus1(t))2 dt

]12

(5)

Recall that Fminus1 is defined on (01) by Fminus1(t) = infs F(s) get which satisfies that its distribution function is F From thisit is obvious that for the probability measures based on two sam-ples (resp one sample and a theoretical distribution) W2 coin-cides with the L2 distance to the diagonal in a QndashQ plot (respprobability plot)

From (5) and Proposition 1 we obtain the equivalent expres-sion of (4) as

τα(FG) = infhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt (6)

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 699

The infimum in (6) is easily attained at the function h0 below(see Gordaliza 1991) associated with a set with Lebesgue mea-sure 1minusα We call this minimizer h0 an impartial α-trimmingbetween P and Q Obviously after Proposition 1 h0(F (x))

and h0(G(x)) are the distribution functions of the impartiallyα-trimmed probabilities

To analyze (6) let us consider the map t rarr |Fminus1(t) minusGminus1(t)| as a random variable defined on (01) endowed withthe Lebesgue measure Let

LFG(x) = t isin (01) |Fminus1(t) minus Gminus1(t)| le x

x ge 0

denote its distribution function and write Lminus1FG for the corre-

sponding quantile inverse If LFG is continuous at Lminus1FG(1 minus

α) then LFG(Lminus1FG(1 minus α)) = 1 minus α and

infhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

=int 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt (7)

where

hprime0(t) = 1

1 minus αI(|Fminus1(t)minusGminus1(t)|leLminus1

FG(1minusα)) (8)

In this case h0 is in fact the unique mimimizer of the criterionfunctional

Even if LFG is not continuous at Lminus1FG(1 minus α) we can en-

sure the existence of a set A0 (not necessarily unique) such that(A0) = 1 minus α andt isin (01) |Fminus1(t) minus Gminus1(t)| lt Lminus1

FG(1 minus α)

sub A0 sub t isin (01) |Fminus1(t) minus Gminus1(t)| le Lminus1

FG(1 minus α)

Obviously if for any such A0 we consider the function h0 isin Cα

with hprime0 = 1

1minusαIA0 then the infimum in (6) is attained at h0

Therefore problem (6) is equivalent to

infA

(1

1 minus α

int

A

(Fminus1(t) minus Gminus1(t))2 dt

)12

(9)

where A varies on the Borel sets in (01) with Lebesgue mea-sure equal to 1 minus α

21 Comparison With Symmetric Trimming

Munk and Czado (1998) (see also Czado and Munk 1998Freitag Czado and Munk 2007) considered a trimmed versionof the Wasserstein distance for the assessment of similarity be-tween the distribution functions F and G as

α(FG) =(

1

1 minus α

int 1minusα2

α2(Fminus1(t) minus Gminus1(t))2 dt

)12

(10)

Note that the right side of the foregoing expression equalsW2(PαQα) where Pα is the probability measure with distrib-ution function

Fα(t) = 1

1 minus α(F (t) minus α2)

Fminus1(α2) le t lt Fminus1(1 minus α2) (11)

Figure 1 Histograms of trimmed data (cholesterol levels) in twoclinical centers The white part in the bars shows the trimming propor-tion in the associated zone

and similarly for Qα When comparing two samples this corre-sponds to the distance between the sample distributions associ-ated with the symmetrically trimmed samples This naif way oftrimming is widely used and confers protection against conta-mination by outliers However the arbitrariness in the choice ofthe trimming zones has been largely reported as a serious draw-back of procedures based on this method (see eg Cuesta et al1997 Garciacutea-Escudero et al 2008 Gordaliza 1991 Rousseeuw1985) In our setting the question is why two distributions thatare very different in their tails are considered similar but onesthat differ in their central parts are considered nonsimilar

To get an idea of the differences between our approach andthe symmetrical trimming let us recall example 1 of Munk andCzado (1998) which corresponds to a multiclinical study oncholesterol and fibrinogen levels in two sets of patients (of size116 and 141) in two clinical centers For the fibrinogen data ourimpartial trimming proposal for α = 1 essentially coincideswith the symmetrical trimming But Figure 1 displays the ef-fects of our trimming proposal for the cholesterol data showinga significant trimming in the middle part of the histograms aswell corresponding to both centers This even improves on thelevel of similarity shown by Munk and Czado (1998) strength-ening their assessment of similarity on these data but also pro-vides a descriptive look at the way in which both populations(dis)agree The posterior analysis of the trimmed data can bevery useful in a global comparison of the populations

22 Nonparametric Test of Similarity

As is usual in many statistical analyses the interest of statisti-cians when analyzing similarity of distributions relies on assert-ing the equivalence of the involved probability distributions Inhypothesis testing this is achieved by taking equivalence or sim-ilarity as the alternative hypothesis whereas dissimilarity is thenull hypothesis In agreement with this point of view Munk andCzado (1998) considered the testing problem with the null hy-pothesis that the trimmed distance (10) exceeds some valuea threshold to be analyzed by the experimenters and statisticiansin an ad hoc way Graphics of p values for different values(see Fig 4 in Sec 3) play a key role in this analysis and thefact that our measure of dissimilarity (τα(FG))12 is mea-sured on the same scale as the variable of interest favors thisgoal We also note that it is the Wasserstein distance betweentrimmed versions of the original distributions This allows us to

700 Journal of the American Statistical Association June 2008

handle the very nice properties of this distance (see eg Bickeland Freedman 1981) in a friendly way in connection with ourproblem

Let X1 Xn (resp Y1 Ym) be iid observations withcommon distribution function F (resp G) and let X(1)

X(n) (resp Y(1) Y(m)) be the corresponding ordered sam-ples We base our test of H0 τα(FG) gt 2

0 against Ha τα(FG) le 2

0 on the empirical counterparts of τα(FG)namely Tnα = τα(FnG) where Fn denotes the empirical dis-tribution function based on the data in the one-sample problemand Tnmα = τα(FnGm) in the two-sample case Our next re-sults show that under some mild assumptions on F and G Tnα

and Tnmα are asymptotically normal a fact that we use laterto approximate the critical values of H0 against Ha For nota-tional reasons in the Appendix we give the proof only of theone-sample statement

To obtain the asymptotic behavior of our statistics we as-sume that

F and G have absolute moments of order 4 + δfor some δ gt 0

(12)

A further regularity assumption is that F has a continuouslydifferentiable density F prime = f such that

supxisinR

∣∣∣∣F(x)(1 minus F(x))f prime(x)

f 2(x)

∣∣∣∣ lt infin (13)

Additional notation includes h0 as defined in (8) and

l(t) =int Fminus1(t)

Fminus1(12)

(x minus Gminus1(F (x))

)hprime

0(F (x)) dx (14)

and

s2nα(G) = 4

(1 minus α)2

1

n

nminus1sum

ij=1

(

mini j minus ij

n

)

anianj (15)

where

ani = (X(i+1) minus X(i)

)((X(i+1) + X(i)

)2 minus Gminus1(in)

)

times I(|X(i)minusGminus1(in)|leLminus1

FnG(1minusα)) (16)

Theorem 2 Assume that F and G satisfy (12) and (13)and that LFG is continuous at Lminus1

FG(1 minus α) Thenradic

n(Tnα minusτα(FG)) is asymptotically centered normal with variance

σ 2α (FG) = 4

(int 1

0l2(t) dt minus

(int 1

0l(t) dt

)2)

(17)

This asymptotic variance can be consistently estimated bys2nα(G) given by (15) If G also satisfies (13) and n

n+mrarr λ isin

(01) thenradic

nmn+m

(Tnmα minus τα(FG)) is asymptotically cen-

tered normal with variance (1minusλ)σ 2α (FG)+λσ 2

α (GF ) Thisvariance can be consistently estimated by s2

nmα =m

n+ms2nα(Gm) + n

n+ms2mα(Fn)

If τα(FG) = 0 then Theorem 2 reduces toradic

nTnα rarr0 in probability note that τα(FG) = 0 implies that (x minusGminus1(F (x)))2hprime

0(F (x)) = 0 for almost every x and thus σ 2α (F

G) = 0 This generally would suffice for our applications butwe also give the exact rate and the limiting distribution in theAppendix

(a) (b)

(c) (d)

Figure 2 Histograms for variable GPA (a) Males (b) females(c) computer science students (d) engineering students

3 EXAMPLE AND SIMULATIONS

Our analysis is based on the variable college grade pointaverage (GPA) collected from a group of 234 students Thisvariable takes values of 0ndash4 The students are classified by thevariables gender and major (1 = computer science 2 = engi-neering 3 = other sciences) We are interested in studying thedistributional similarity of the GPA between males (n = 117)and females (m = 117) and also between students with a ma-jor in computer sciences (n = 78) and students with a major inengineering (m = 78) Figure 2 shows the histogram for eachsample

Comparisons of these samples using classical proceduresproduce the results displayed in Table 1 Because the ShapirondashWilks tests reject the normality of the four samples we use non-parametric methods like the KolmogorovndashSmirnov test (KS)or the WilcoxonndashMannndashWhitney test (WMW) to analyze thenull hypothesis that both samples come from the same distribu-tion in the comparisons of GPA by sex and GPA by major Thep values of these tests clearly reject the null hypotheses

Under the possibility of impartially trimming both samplesas described in Section 2 we obtain the optimal trimming func-tions displayed in Figure 3 In this figure and for each compar-ison we plot the value of |Fminus1

n (t) minus Gminus1m (t)| and the cutting

values Lminus1FnGm

(1 minusα) for α = 05 1 and 2 Figure 3(a) showsthat the optimal trimming involves the lower tail but not ex-actly from the lower end point When the trimming level grows

Table 1 Two-sample p values for classical tests

p value

Test GPA by gender GPA by major

ShapirondashWilks (sample 1) 0176 0360ShapirondashWilks (sample 2) 0217 0001KS 0028 0040WMW 0004 0175

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 701

(a) (b)

Figure 3 Trimming functions (a) GPA by gender (b) GPA by major ( α = 05 α = 1 α = 2)

(α = 1 and 2) the trimmed zone is not an interval and it in-cludes points around percentiles 20 40 60 and 70Figure 3(b) shows that the points that should be trimmed tomake both samples more similar are between percentiles 10and 30 This example illustrates a nonsymmetrical dissimi-larity between samples in fact in the first comparison the lesssimilar zone is close to the lower tail but not to the upper tailwhere the values are more similar

31 p Value Curve

To gain some insight into the assessment of the similarityor dissimilarity of the underlying distributions we can use thep value curve to test the null hypothesis H0 τα(FG) gt 2

0against Ha τα(FG) le 2

0 In the two-sample comparisoncase we use the statistic

Znmα =radic

nm

n + m

(Tnmα minus 20)

snmα

(18)

To obtain the values of Tnmα we compute |Fminus1n (t) minus

Gminus1m (t)|2 over a grid in [01] using the (1 minus α)-quantile of

these values to determine Lminus1FnGm

(1 minus α) The integral is then

calculated numerically The computation of snmα is done sim-ilarly

The asymptotic p value curve P(0) is defined as

P(0) = sup(FG)isinH0

limnmrarrinfinPFG(Znmα le z0) = (z0)

where z0 is the observed value of Znmα (Note that the supre-mum is attained when the distance between both distributionsis exactly 0) These asymptotic p value curves can be used intwo ways On one hand given a fixed value of 0 that con-trols the degree of dissimilarity it is possible to find the p

value associated with the corresponding null hypothesis to de-cide whether or not the distributions are similar On the otherhand given a fixed test level (p value) we can find the valueof 0 such that for every ge 0 we should reject the hy-pothesis H0 τα(FG) ge 2 In this way we can get a soundidea of the degree of dissimilarity between the distributions Tohandle the values of 0 the experimenter should take into ac-count how to interpret the Wasserstein distance recalling thatin the case where F and G belong to the same location fam-ily their Wasserstein distance is the absolute difference of theirlocations

Figure 4 illustrates the improved assessment obtained by im-partial trimming over the Munk and Czado methodology It dis-

(a) (b)

Figure 4 p value curves using impartial and symmetrical (MC) trimmings (a) GPA by gender (b) GPA by major [ α = 051 α = 051(MC) α = 102 α = 102 (MC) α = 205 α = 205 (MC)]

702 Journal of the American Statistical Association June 2008

plays the p value curves using impartial trimming and symmet-rical trimming for both comparisons for different trimming lev-els (α = 05 1 and 2) For each plot a horizontal line marksa reference level for the test (05) The GPAs of males and fe-males are similar up to 0 ranging from 32 to 36 (dependingon the trimming size) when impartial trimmings are used Thesevalues represent between 100 times 322815 = 114 and 128of the average of the medians of the samples But when us-ing symmetrical trimmings the horizontal line cuts the p valuecurves for 0 ranging from 56 to 59 between 20 and 21of the average of the medians A similar analysis of the com-parison of GPAs by major leads us to values of 0 rangingfrom 29 to 36 which represent between a 96 and a 119of the average of the medians when using impartial trimmingInstead when using symmetrical trimming these percentagesrange from 166 to 195

32 Simulations

We end this section by reporting a small simulation studyto illustrate our procedurersquos performance for finite samples fortesting H0 τα(FG) gt 2

0 against Ha τα(FG) le 20 in the

two-sample problem We considered two different contami-nated normal models two different trimming sizes and sev-eral values of the threshold 0 In each situation we gener-ated 10000 replicas of the trimmed score Znmα as defined in(18) for several values of n = m We compared these replicaswith the 05 theoretical quantile of the standard normal distrib-ution rejecting H0 for observed values smaller than this quan-tity Table 2 shows the observed rejection frequencies We findgood agreement with our asymptotic results even for moder-ate sample sizes with low rejection frequencies for thresholds0 smaller than the true distance and high rejection frequenciesotherwise When the threshold equals the true distance we alsocan see how the observed frequency approximates the nominallevel

4 CONCLUSIONS AND POSSIBLE EXTENSIONS

We have introduced a procedure to compare two samples orprobability distributions on the real line based on the impartialtrimming methodology The procedure is designed mainly toassess similarity of the core of two samples by discarding thatpart of the data that has a greater influence on the dissimilarityof the distributions Our method is based on trimming the corre-sponding samples according to the same trimming function butit allows nonsymmetrical trimming thus it can greatly improvethe previous methodology based on simply trimming the tailsWe have evaluated the performance of our procedure throughan analysis of some real data samples that emphasized the ap-pealing possibilities in data analysis and the significance of theanalysis of the p value curves for assessing similarities A sim-ulation study has provided also evidence about the behavior ofthe procedure for finite samples in agreement with asymptoticresults Although we treated only dissimilarities based on theWasserstein distance other metrics or dissimilarities could behandled under the same scheme

Representation of trimmings of any distribution in terms ofthose of the uniform distribution is no longer possible in themultivariate setting However under very general assumptionsit has been proven (see Cuesta and Matraacuten 1989) that given twoprobabilities P and Q on R

k there exists an ldquooptimal transportmaprdquo T such that Q = P T minus1 and if X is any random vectorwith law P then EXminusT (X)2 = W2

2 (PQ) Moreover if Pα

is an α-trimming of P then Pα T minus1 is an α-trimming of Q

and T is an optimal map between Pα and Pα T minus1 so the mul-tivariate version of (4) would be the minimization over the setof α-trimmings of P of the expression W2

2 (PαPα T minus1) Wealso should mention that obtaining the optimal map T remainsan open problem for k gt 1 Although these are troubling factsobtaining the optimal trimming between two samples is alreadypossible through standard optimization procedures A final dif-ficulty concerns the asymptotic behavior of the involved statis-tics to which the techniques used in our proofs do not extend

Table 2 Simulated powers for the trimmed scores Znmα

P = 95 N(01) + 05 N(51) P = 9 N(01) + 1 N(51)

Q = 95 N(01) + 05 N(minus51) Q = 09 N(01) + 01 N(minus51)

α = 05 (1 minus α)20 n Frequency α = 1 (1 minus α)2

0 n Frequency

[(1 minus α)τα(PQ) = 384] 25 100 0320 [(1 minus α)τα(PQ) = 1004] 25 100 0028200 0268 200 0500 0086 500 0

1000 0021 1000 05000 0 5000 0

5 100 1412 5 100 0109200 1633 200 0031500 2264 500 0002

1000 3134 1000 05000 7648 5000 0

1 100 4912 1 100 0850200 6957 200 0727500 9474 500 0657

1000 9989 1000 05845000 10000 5000 0486

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimmingfunctions would provide an interesting alternative to our presentproposal Through our research still in progress we have iden-tified a radically different behavior to that presented in this ar-ticle for identical trimming in both samples

APPENDIX PROOFS AND FURTHER RESULTS

In this appendix ρn(t) = radicnf (Fminus1(t))(Fminus1

n (t) minus Fminus1(t)) de-notes the weighted quantile process where f is the density functionof F

Proof of Proposition 1

Let A= P lowast isinP P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cα For P lowast isinA absolute continuity of h entails

P lowast(s t] = h(P (minusinfin t]) minus h(P (minusinfin s])

=int P(minusinfint]P(minusinfins]

hprime(x) dx le 1

1 minus αP (s t]

Thus P lowast P and dP lowastdP

le 11minusα

and therefore P lowast isin T α(P )Conversely given P lowast isin T α(P ) if F is the distribution function of

P and we define h(t) = int t0

dP lowastdP

(Fminus1(s)) ds then it is immediate thath isin Cα and

P lowast(minusinfin t] =int t

minusinfindP lowastdP

(s) dF (s)

=int F(t)

0

dP lowastdP

(Fminus1(s)) ds = h(P (minusinfin t])

Therefore P lowast isin A and part a is proven The proof of part b is imme-diate from the proof of part a

The following lemmas collect some results which we use in ourproofs of Theorems 2 and A1 These results can be easily proven us-ing Schwarzrsquos inequality standard arguments in empirical processestheory or the ArzelaacutendashAscoli theorem

Lemma A1 If F and G have finite absolute moment of order r gt 4then the following hold

aradic

nint 1n

0 (Fminus1(t))2 dt rarr 0 andradic

nint 1

1minus1n(Fminus1(t))2 dt rarr 0

bradic

nint 1n

0 (Fminus1n (t))2 dt rarr 0 and

radicn

int 11minus1n(Fminus1

n (t))2 dt rarr 0 inprobability

cint 1

0radic

t (1 minus t)g(Gminus1(t))|Fminus1(t) minus Gminus1(t)|dt lt infin

d Furthermore if G satisfies (13) then 1radicn

int 1minus1n1n

t (1 minus t)

g2(Gminus1(t)) dt rarr 0

Lemma A2 Under the middot infin topology the set Cα in Proposition 1and the set Cα(FG) = h isin Cα

int 10 (Fminus1(t)minusGminus1(t))2hprime(t) dt = 0

for FG isinP2 are compact

Proof of Theorem 2

From theorem 621 of Csoumlrgouml and Horvaacuteth (1993) and (13) we canassume that there exist Brownian bridges Bn satisfying

n12minusν sup1nletle1minus1n

|ρn(t) minus Bn(t)|(t (1 minus t))ν

=

OP (logn) if ν = 0

OP (1) if 0 lt ν le 12(A1)

Now we set Mn(h) = radicn

int 10 (Fminus1

n (t) minus Gminus1(t))2hprime(t) dt and

Nn(h) = 2int 1minus1n

1n

Bn(t)

f (Fminus1(t))(Gminus1(t) minus Fminus1(t))hprime(t) dt

+ radicn

int 1minus1n

1n(Gminus1(t) minus Fminus1(t))2hprime(t) dt

Observe that

suphisinCα

|Mn(h) minus Nn(h)|

le radicn

int 1n

0(Fminus1

n (t) minus Gminus1(t))2 dt

+ radicn

int 1

1minus1n(Fminus1

n (t) minus Gminus1(t))2 dt

+ 1radicn

int 1minus1n

1n

|ρn(t) minus Bn(t)|2f 2(Fminus1(t))

dt

+ 1radicn

int 1minus1n

1n

Bn(t)2

f 2(Fminus1(t))dt

+ 2int 1minus1n

1n

|ρn(t) minus Bn(t)|f (Fminus1(t))

|Gminus1(t) minus Fminus1(t)|dt

= An1 + An2 + An3 + An4 + An5

Lemma A1 implies that An1 rarr 0 and An2 rarr 0 in probability

From (A1) we get An3 le OP (1) 1radicn

int 1minus1n1n

t (1minust)

f 2(Fminus1(t))dt and

this last integral tends to 0 by Lemma A1 Thus An3 rarr 0 inprobability Similarly An4 rarr 0 in probability Finally (A1) yields

An5 le OP (1)nνminus12 int 1minus1n1n

(t (1minust))ν

f (Fminus1(t))|Gminus1(t) minus Fminus1(t)|dt for

some ν isin (012) Lemma A1 shows thatint 1

0(t (1minust))12

f (Fminus1(t))|Gminus1(t) minus

Fminus1(t)|dt lt infin Thus by dominated convergence we obtain thatAn5 rarr 0 in probability Collecting the foregoing estimates we obtainsuphisinCα

|Mn(h) minus Nn(h)| rarr 0 in probability and thusradic

n(Tnα minusSnα) rarr 0 in probability where

radicnSnα = infhisinCα

Nn(h) Therefore

we need only show thatradic

n(Snα minus τα(FG)) rarrw N(0 σ 2α(FG))

where

radicnSnα = inf

hisinCα

[

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime(t) dt

]

Let us denote

hn = arg minhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radicn

int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

Clearly hprimen(t) rarr hprime

0(t) for almost every t Furthermore optimality ofhn shows that

Bn = radicnSnα minus

(

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime

0(t) dt

)

le 0

but in contrast

Bn = radicn

(int 1

0(Fminus1(t) minus Gminus1(t))2hprime

n(t) dt

minusint 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt

)

+ 2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))(hprime

n(t) minus hprime0(t)) dt

= Bn1 + Bn2

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 699

The infimum in (6) is easily attained at the function h0 below(see Gordaliza 1991) associated with a set with Lebesgue mea-sure 1minusα We call this minimizer h0 an impartial α-trimmingbetween P and Q Obviously after Proposition 1 h0(F (x))

and h0(G(x)) are the distribution functions of the impartiallyα-trimmed probabilities

To analyze (6) let us consider the map t rarr |Fminus1(t) minusGminus1(t)| as a random variable defined on (01) endowed withthe Lebesgue measure Let

LFG(x) = t isin (01) |Fminus1(t) minus Gminus1(t)| le x

x ge 0

denote its distribution function and write Lminus1FG for the corre-

sponding quantile inverse If LFG is continuous at Lminus1FG(1 minus

α) then LFG(Lminus1FG(1 minus α)) = 1 minus α and

infhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

=int 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt (7)

where

hprime0(t) = 1

1 minus αI(|Fminus1(t)minusGminus1(t)|leLminus1

FG(1minusα)) (8)

In this case h0 is in fact the unique mimimizer of the criterionfunctional

Even if LFG is not continuous at Lminus1FG(1 minus α) we can en-

sure the existence of a set A0 (not necessarily unique) such that(A0) = 1 minus α andt isin (01) |Fminus1(t) minus Gminus1(t)| lt Lminus1

FG(1 minus α)

sub A0 sub t isin (01) |Fminus1(t) minus Gminus1(t)| le Lminus1

FG(1 minus α)

Obviously if for any such A0 we consider the function h0 isin Cα

with hprime0 = 1

1minusαIA0 then the infimum in (6) is attained at h0

Therefore problem (6) is equivalent to

infA

(1

1 minus α

int

A

(Fminus1(t) minus Gminus1(t))2 dt

)12

(9)

where A varies on the Borel sets in (01) with Lebesgue mea-sure equal to 1 minus α

21 Comparison With Symmetric Trimming

Munk and Czado (1998) (see also Czado and Munk 1998Freitag Czado and Munk 2007) considered a trimmed versionof the Wasserstein distance for the assessment of similarity be-tween the distribution functions F and G as

α(FG) =(

1

1 minus α

int 1minusα2

α2(Fminus1(t) minus Gminus1(t))2 dt

)12

(10)

Note that the right side of the foregoing expression equalsW2(PαQα) where Pα is the probability measure with distrib-ution function

Fα(t) = 1

1 minus α(F (t) minus α2)

Fminus1(α2) le t lt Fminus1(1 minus α2) (11)

Figure 1 Histograms of trimmed data (cholesterol levels) in twoclinical centers The white part in the bars shows the trimming propor-tion in the associated zone

and similarly for Qα When comparing two samples this corre-sponds to the distance between the sample distributions associ-ated with the symmetrically trimmed samples This naif way oftrimming is widely used and confers protection against conta-mination by outliers However the arbitrariness in the choice ofthe trimming zones has been largely reported as a serious draw-back of procedures based on this method (see eg Cuesta et al1997 Garciacutea-Escudero et al 2008 Gordaliza 1991 Rousseeuw1985) In our setting the question is why two distributions thatare very different in their tails are considered similar but onesthat differ in their central parts are considered nonsimilar

To get an idea of the differences between our approach andthe symmetrical trimming let us recall example 1 of Munk andCzado (1998) which corresponds to a multiclinical study oncholesterol and fibrinogen levels in two sets of patients (of size116 and 141) in two clinical centers For the fibrinogen data ourimpartial trimming proposal for α = 1 essentially coincideswith the symmetrical trimming But Figure 1 displays the ef-fects of our trimming proposal for the cholesterol data showinga significant trimming in the middle part of the histograms aswell corresponding to both centers This even improves on thelevel of similarity shown by Munk and Czado (1998) strength-ening their assessment of similarity on these data but also pro-vides a descriptive look at the way in which both populations(dis)agree The posterior analysis of the trimmed data can bevery useful in a global comparison of the populations

22 Nonparametric Test of Similarity

As is usual in many statistical analyses the interest of statisti-cians when analyzing similarity of distributions relies on assert-ing the equivalence of the involved probability distributions Inhypothesis testing this is achieved by taking equivalence or sim-ilarity as the alternative hypothesis whereas dissimilarity is thenull hypothesis In agreement with this point of view Munk andCzado (1998) considered the testing problem with the null hy-pothesis that the trimmed distance (10) exceeds some valuea threshold to be analyzed by the experimenters and statisticiansin an ad hoc way Graphics of p values for different values(see Fig 4 in Sec 3) play a key role in this analysis and thefact that our measure of dissimilarity (τα(FG))12 is mea-sured on the same scale as the variable of interest favors thisgoal We also note that it is the Wasserstein distance betweentrimmed versions of the original distributions This allows us to

700 Journal of the American Statistical Association June 2008

handle the very nice properties of this distance (see eg Bickeland Freedman 1981) in a friendly way in connection with ourproblem

Let X1 Xn (resp Y1 Ym) be iid observations withcommon distribution function F (resp G) and let X(1)

X(n) (resp Y(1) Y(m)) be the corresponding ordered sam-ples We base our test of H0 τα(FG) gt 2

0 against Ha τα(FG) le 2

0 on the empirical counterparts of τα(FG)namely Tnα = τα(FnG) where Fn denotes the empirical dis-tribution function based on the data in the one-sample problemand Tnmα = τα(FnGm) in the two-sample case Our next re-sults show that under some mild assumptions on F and G Tnα

and Tnmα are asymptotically normal a fact that we use laterto approximate the critical values of H0 against Ha For nota-tional reasons in the Appendix we give the proof only of theone-sample statement

To obtain the asymptotic behavior of our statistics we as-sume that

F and G have absolute moments of order 4 + δfor some δ gt 0

(12)

A further regularity assumption is that F has a continuouslydifferentiable density F prime = f such that

supxisinR

∣∣∣∣F(x)(1 minus F(x))f prime(x)

f 2(x)

∣∣∣∣ lt infin (13)

Additional notation includes h0 as defined in (8) and

l(t) =int Fminus1(t)

Fminus1(12)

(x minus Gminus1(F (x))

)hprime

0(F (x)) dx (14)

and

s2nα(G) = 4

(1 minus α)2

1

n

nminus1sum

ij=1

(

mini j minus ij

n

)

anianj (15)

where

ani = (X(i+1) minus X(i)

)((X(i+1) + X(i)

)2 minus Gminus1(in)

)

times I(|X(i)minusGminus1(in)|leLminus1

FnG(1minusα)) (16)

Theorem 2 Assume that F and G satisfy (12) and (13)and that LFG is continuous at Lminus1

FG(1 minus α) Thenradic

n(Tnα minusτα(FG)) is asymptotically centered normal with variance

σ 2α (FG) = 4

(int 1

0l2(t) dt minus

(int 1

0l(t) dt

)2)

(17)

This asymptotic variance can be consistently estimated bys2nα(G) given by (15) If G also satisfies (13) and n

n+mrarr λ isin

(01) thenradic

nmn+m

(Tnmα minus τα(FG)) is asymptotically cen-

tered normal with variance (1minusλ)σ 2α (FG)+λσ 2

α (GF ) Thisvariance can be consistently estimated by s2

nmα =m

n+ms2nα(Gm) + n

n+ms2mα(Fn)

If τα(FG) = 0 then Theorem 2 reduces toradic

nTnα rarr0 in probability note that τα(FG) = 0 implies that (x minusGminus1(F (x)))2hprime

0(F (x)) = 0 for almost every x and thus σ 2α (F

G) = 0 This generally would suffice for our applications butwe also give the exact rate and the limiting distribution in theAppendix

(a) (b)

(c) (d)

Figure 2 Histograms for variable GPA (a) Males (b) females(c) computer science students (d) engineering students

3 EXAMPLE AND SIMULATIONS

Our analysis is based on the variable college grade pointaverage (GPA) collected from a group of 234 students Thisvariable takes values of 0ndash4 The students are classified by thevariables gender and major (1 = computer science 2 = engi-neering 3 = other sciences) We are interested in studying thedistributional similarity of the GPA between males (n = 117)and females (m = 117) and also between students with a ma-jor in computer sciences (n = 78) and students with a major inengineering (m = 78) Figure 2 shows the histogram for eachsample

Comparisons of these samples using classical proceduresproduce the results displayed in Table 1 Because the ShapirondashWilks tests reject the normality of the four samples we use non-parametric methods like the KolmogorovndashSmirnov test (KS)or the WilcoxonndashMannndashWhitney test (WMW) to analyze thenull hypothesis that both samples come from the same distribu-tion in the comparisons of GPA by sex and GPA by major Thep values of these tests clearly reject the null hypotheses

Under the possibility of impartially trimming both samplesas described in Section 2 we obtain the optimal trimming func-tions displayed in Figure 3 In this figure and for each compar-ison we plot the value of |Fminus1

n (t) minus Gminus1m (t)| and the cutting

values Lminus1FnGm

(1 minusα) for α = 05 1 and 2 Figure 3(a) showsthat the optimal trimming involves the lower tail but not ex-actly from the lower end point When the trimming level grows

Table 1 Two-sample p values for classical tests

p value

Test GPA by gender GPA by major

ShapirondashWilks (sample 1) 0176 0360ShapirondashWilks (sample 2) 0217 0001KS 0028 0040WMW 0004 0175

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 701

(a) (b)

Figure 3 Trimming functions (a) GPA by gender (b) GPA by major ( α = 05 α = 1 α = 2)

(α = 1 and 2) the trimmed zone is not an interval and it in-cludes points around percentiles 20 40 60 and 70Figure 3(b) shows that the points that should be trimmed tomake both samples more similar are between percentiles 10and 30 This example illustrates a nonsymmetrical dissimi-larity between samples in fact in the first comparison the lesssimilar zone is close to the lower tail but not to the upper tailwhere the values are more similar

31 p Value Curve

To gain some insight into the assessment of the similarityor dissimilarity of the underlying distributions we can use thep value curve to test the null hypothesis H0 τα(FG) gt 2

0against Ha τα(FG) le 2

0 In the two-sample comparisoncase we use the statistic

Znmα =radic

nm

n + m

(Tnmα minus 20)

snmα

(18)

To obtain the values of Tnmα we compute |Fminus1n (t) minus

Gminus1m (t)|2 over a grid in [01] using the (1 minus α)-quantile of

these values to determine Lminus1FnGm

(1 minus α) The integral is then

calculated numerically The computation of snmα is done sim-ilarly

The asymptotic p value curve P(0) is defined as

P(0) = sup(FG)isinH0

limnmrarrinfinPFG(Znmα le z0) = (z0)

where z0 is the observed value of Znmα (Note that the supre-mum is attained when the distance between both distributionsis exactly 0) These asymptotic p value curves can be used intwo ways On one hand given a fixed value of 0 that con-trols the degree of dissimilarity it is possible to find the p

value associated with the corresponding null hypothesis to de-cide whether or not the distributions are similar On the otherhand given a fixed test level (p value) we can find the valueof 0 such that for every ge 0 we should reject the hy-pothesis H0 τα(FG) ge 2 In this way we can get a soundidea of the degree of dissimilarity between the distributions Tohandle the values of 0 the experimenter should take into ac-count how to interpret the Wasserstein distance recalling thatin the case where F and G belong to the same location fam-ily their Wasserstein distance is the absolute difference of theirlocations

Figure 4 illustrates the improved assessment obtained by im-partial trimming over the Munk and Czado methodology It dis-

(a) (b)

Figure 4 p value curves using impartial and symmetrical (MC) trimmings (a) GPA by gender (b) GPA by major [ α = 051 α = 051(MC) α = 102 α = 102 (MC) α = 205 α = 205 (MC)]

702 Journal of the American Statistical Association June 2008

plays the p value curves using impartial trimming and symmet-rical trimming for both comparisons for different trimming lev-els (α = 05 1 and 2) For each plot a horizontal line marksa reference level for the test (05) The GPAs of males and fe-males are similar up to 0 ranging from 32 to 36 (dependingon the trimming size) when impartial trimmings are used Thesevalues represent between 100 times 322815 = 114 and 128of the average of the medians of the samples But when us-ing symmetrical trimmings the horizontal line cuts the p valuecurves for 0 ranging from 56 to 59 between 20 and 21of the average of the medians A similar analysis of the com-parison of GPAs by major leads us to values of 0 rangingfrom 29 to 36 which represent between a 96 and a 119of the average of the medians when using impartial trimmingInstead when using symmetrical trimming these percentagesrange from 166 to 195

32 Simulations

We end this section by reporting a small simulation studyto illustrate our procedurersquos performance for finite samples fortesting H0 τα(FG) gt 2

0 against Ha τα(FG) le 20 in the

two-sample problem We considered two different contami-nated normal models two different trimming sizes and sev-eral values of the threshold 0 In each situation we gener-ated 10000 replicas of the trimmed score Znmα as defined in(18) for several values of n = m We compared these replicaswith the 05 theoretical quantile of the standard normal distrib-ution rejecting H0 for observed values smaller than this quan-tity Table 2 shows the observed rejection frequencies We findgood agreement with our asymptotic results even for moder-ate sample sizes with low rejection frequencies for thresholds0 smaller than the true distance and high rejection frequenciesotherwise When the threshold equals the true distance we alsocan see how the observed frequency approximates the nominallevel

4 CONCLUSIONS AND POSSIBLE EXTENSIONS

We have introduced a procedure to compare two samples orprobability distributions on the real line based on the impartialtrimming methodology The procedure is designed mainly toassess similarity of the core of two samples by discarding thatpart of the data that has a greater influence on the dissimilarityof the distributions Our method is based on trimming the corre-sponding samples according to the same trimming function butit allows nonsymmetrical trimming thus it can greatly improvethe previous methodology based on simply trimming the tailsWe have evaluated the performance of our procedure throughan analysis of some real data samples that emphasized the ap-pealing possibilities in data analysis and the significance of theanalysis of the p value curves for assessing similarities A sim-ulation study has provided also evidence about the behavior ofthe procedure for finite samples in agreement with asymptoticresults Although we treated only dissimilarities based on theWasserstein distance other metrics or dissimilarities could behandled under the same scheme

Representation of trimmings of any distribution in terms ofthose of the uniform distribution is no longer possible in themultivariate setting However under very general assumptionsit has been proven (see Cuesta and Matraacuten 1989) that given twoprobabilities P and Q on R

k there exists an ldquooptimal transportmaprdquo T such that Q = P T minus1 and if X is any random vectorwith law P then EXminusT (X)2 = W2

2 (PQ) Moreover if Pα

is an α-trimming of P then Pα T minus1 is an α-trimming of Q

and T is an optimal map between Pα and Pα T minus1 so the mul-tivariate version of (4) would be the minimization over the setof α-trimmings of P of the expression W2

2 (PαPα T minus1) Wealso should mention that obtaining the optimal map T remainsan open problem for k gt 1 Although these are troubling factsobtaining the optimal trimming between two samples is alreadypossible through standard optimization procedures A final dif-ficulty concerns the asymptotic behavior of the involved statis-tics to which the techniques used in our proofs do not extend

Table 2 Simulated powers for the trimmed scores Znmα

P = 95 N(01) + 05 N(51) P = 9 N(01) + 1 N(51)

Q = 95 N(01) + 05 N(minus51) Q = 09 N(01) + 01 N(minus51)

α = 05 (1 minus α)20 n Frequency α = 1 (1 minus α)2

0 n Frequency

[(1 minus α)τα(PQ) = 384] 25 100 0320 [(1 minus α)τα(PQ) = 1004] 25 100 0028200 0268 200 0500 0086 500 0

1000 0021 1000 05000 0 5000 0

5 100 1412 5 100 0109200 1633 200 0031500 2264 500 0002

1000 3134 1000 05000 7648 5000 0

1 100 4912 1 100 0850200 6957 200 0727500 9474 500 0657

1000 9989 1000 05845000 10000 5000 0486

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimmingfunctions would provide an interesting alternative to our presentproposal Through our research still in progress we have iden-tified a radically different behavior to that presented in this ar-ticle for identical trimming in both samples

APPENDIX PROOFS AND FURTHER RESULTS

In this appendix ρn(t) = radicnf (Fminus1(t))(Fminus1

n (t) minus Fminus1(t)) de-notes the weighted quantile process where f is the density functionof F

Proof of Proposition 1

Let A= P lowast isinP P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cα For P lowast isinA absolute continuity of h entails

P lowast(s t] = h(P (minusinfin t]) minus h(P (minusinfin s])

=int P(minusinfint]P(minusinfins]

hprime(x) dx le 1

1 minus αP (s t]

Thus P lowast P and dP lowastdP

le 11minusα

and therefore P lowast isin T α(P )Conversely given P lowast isin T α(P ) if F is the distribution function of

P and we define h(t) = int t0

dP lowastdP

(Fminus1(s)) ds then it is immediate thath isin Cα and

P lowast(minusinfin t] =int t

minusinfindP lowastdP

(s) dF (s)

=int F(t)

0

dP lowastdP

(Fminus1(s)) ds = h(P (minusinfin t])

Therefore P lowast isin A and part a is proven The proof of part b is imme-diate from the proof of part a

The following lemmas collect some results which we use in ourproofs of Theorems 2 and A1 These results can be easily proven us-ing Schwarzrsquos inequality standard arguments in empirical processestheory or the ArzelaacutendashAscoli theorem

Lemma A1 If F and G have finite absolute moment of order r gt 4then the following hold

aradic

nint 1n

0 (Fminus1(t))2 dt rarr 0 andradic

nint 1

1minus1n(Fminus1(t))2 dt rarr 0

bradic

nint 1n

0 (Fminus1n (t))2 dt rarr 0 and

radicn

int 11minus1n(Fminus1

n (t))2 dt rarr 0 inprobability

cint 1

0radic

t (1 minus t)g(Gminus1(t))|Fminus1(t) minus Gminus1(t)|dt lt infin

d Furthermore if G satisfies (13) then 1radicn

int 1minus1n1n

t (1 minus t)

g2(Gminus1(t)) dt rarr 0

Lemma A2 Under the middot infin topology the set Cα in Proposition 1and the set Cα(FG) = h isin Cα

int 10 (Fminus1(t)minusGminus1(t))2hprime(t) dt = 0

for FG isinP2 are compact

Proof of Theorem 2

From theorem 621 of Csoumlrgouml and Horvaacuteth (1993) and (13) we canassume that there exist Brownian bridges Bn satisfying

n12minusν sup1nletle1minus1n

|ρn(t) minus Bn(t)|(t (1 minus t))ν

=

OP (logn) if ν = 0

OP (1) if 0 lt ν le 12(A1)

Now we set Mn(h) = radicn

int 10 (Fminus1

n (t) minus Gminus1(t))2hprime(t) dt and

Nn(h) = 2int 1minus1n

1n

Bn(t)

f (Fminus1(t))(Gminus1(t) minus Fminus1(t))hprime(t) dt

+ radicn

int 1minus1n

1n(Gminus1(t) minus Fminus1(t))2hprime(t) dt

Observe that

suphisinCα

|Mn(h) minus Nn(h)|

le radicn

int 1n

0(Fminus1

n (t) minus Gminus1(t))2 dt

+ radicn

int 1

1minus1n(Fminus1

n (t) minus Gminus1(t))2 dt

+ 1radicn

int 1minus1n

1n

|ρn(t) minus Bn(t)|2f 2(Fminus1(t))

dt

+ 1radicn

int 1minus1n

1n

Bn(t)2

f 2(Fminus1(t))dt

+ 2int 1minus1n

1n

|ρn(t) minus Bn(t)|f (Fminus1(t))

|Gminus1(t) minus Fminus1(t)|dt

= An1 + An2 + An3 + An4 + An5

Lemma A1 implies that An1 rarr 0 and An2 rarr 0 in probability

From (A1) we get An3 le OP (1) 1radicn

int 1minus1n1n

t (1minust)

f 2(Fminus1(t))dt and

this last integral tends to 0 by Lemma A1 Thus An3 rarr 0 inprobability Similarly An4 rarr 0 in probability Finally (A1) yields

An5 le OP (1)nνminus12 int 1minus1n1n

(t (1minust))ν

f (Fminus1(t))|Gminus1(t) minus Fminus1(t)|dt for

some ν isin (012) Lemma A1 shows thatint 1

0(t (1minust))12

f (Fminus1(t))|Gminus1(t) minus

Fminus1(t)|dt lt infin Thus by dominated convergence we obtain thatAn5 rarr 0 in probability Collecting the foregoing estimates we obtainsuphisinCα

|Mn(h) minus Nn(h)| rarr 0 in probability and thusradic

n(Tnα minusSnα) rarr 0 in probability where

radicnSnα = infhisinCα

Nn(h) Therefore

we need only show thatradic

n(Snα minus τα(FG)) rarrw N(0 σ 2α(FG))

where

radicnSnα = inf

hisinCα

[

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime(t) dt

]

Let us denote

hn = arg minhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radicn

int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

Clearly hprimen(t) rarr hprime

0(t) for almost every t Furthermore optimality ofhn shows that

Bn = radicnSnα minus

(

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime

0(t) dt

)

le 0

but in contrast

Bn = radicn

(int 1

0(Fminus1(t) minus Gminus1(t))2hprime

n(t) dt

minusint 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt

)

+ 2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))(hprime

n(t) minus hprime0(t)) dt

= Bn1 + Bn2

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297

700 Journal of the American Statistical Association June 2008

handle the very nice properties of this distance (see eg Bickeland Freedman 1981) in a friendly way in connection with ourproblem

Let X1 Xn (resp Y1 Ym) be iid observations withcommon distribution function F (resp G) and let X(1)

X(n) (resp Y(1) Y(m)) be the corresponding ordered sam-ples We base our test of H0 τα(FG) gt 2

0 against Ha τα(FG) le 2

0 on the empirical counterparts of τα(FG)namely Tnα = τα(FnG) where Fn denotes the empirical dis-tribution function based on the data in the one-sample problemand Tnmα = τα(FnGm) in the two-sample case Our next re-sults show that under some mild assumptions on F and G Tnα

and Tnmα are asymptotically normal a fact that we use laterto approximate the critical values of H0 against Ha For nota-tional reasons in the Appendix we give the proof only of theone-sample statement

To obtain the asymptotic behavior of our statistics we as-sume that

F and G have absolute moments of order 4 + δfor some δ gt 0

(12)

A further regularity assumption is that F has a continuouslydifferentiable density F prime = f such that

supxisinR

∣∣∣∣F(x)(1 minus F(x))f prime(x)

f 2(x)

∣∣∣∣ lt infin (13)

Additional notation includes h0 as defined in (8) and

l(t) =int Fminus1(t)

Fminus1(12)

(x minus Gminus1(F (x))

)hprime

0(F (x)) dx (14)

and

s2nα(G) = 4

(1 minus α)2

1

n

nminus1sum

ij=1

(

mini j minus ij

n

)

anianj (15)

where

ani = (X(i+1) minus X(i)

)((X(i+1) + X(i)

)2 minus Gminus1(in)

)

times I(|X(i)minusGminus1(in)|leLminus1

FnG(1minusα)) (16)

Theorem 2 Assume that F and G satisfy (12) and (13)and that LFG is continuous at Lminus1

FG(1 minus α) Thenradic

n(Tnα minusτα(FG)) is asymptotically centered normal with variance

σ 2α (FG) = 4

(int 1

0l2(t) dt minus

(int 1

0l(t) dt

)2)

(17)

This asymptotic variance can be consistently estimated bys2nα(G) given by (15) If G also satisfies (13) and n

n+mrarr λ isin

(01) thenradic

nmn+m

(Tnmα minus τα(FG)) is asymptotically cen-

tered normal with variance (1minusλ)σ 2α (FG)+λσ 2

α (GF ) Thisvariance can be consistently estimated by s2

nmα =m

n+ms2nα(Gm) + n

n+ms2mα(Fn)

If τα(FG) = 0 then Theorem 2 reduces toradic

nTnα rarr0 in probability note that τα(FG) = 0 implies that (x minusGminus1(F (x)))2hprime

0(F (x)) = 0 for almost every x and thus σ 2α (F

G) = 0 This generally would suffice for our applications butwe also give the exact rate and the limiting distribution in theAppendix

(a) (b)

(c) (d)

Figure 2 Histograms for variable GPA (a) Males (b) females(c) computer science students (d) engineering students

3 EXAMPLE AND SIMULATIONS

Our analysis is based on the variable college grade pointaverage (GPA) collected from a group of 234 students Thisvariable takes values of 0ndash4 The students are classified by thevariables gender and major (1 = computer science 2 = engi-neering 3 = other sciences) We are interested in studying thedistributional similarity of the GPA between males (n = 117)and females (m = 117) and also between students with a ma-jor in computer sciences (n = 78) and students with a major inengineering (m = 78) Figure 2 shows the histogram for eachsample

Comparisons of these samples using classical proceduresproduce the results displayed in Table 1 Because the ShapirondashWilks tests reject the normality of the four samples we use non-parametric methods like the KolmogorovndashSmirnov test (KS)or the WilcoxonndashMannndashWhitney test (WMW) to analyze thenull hypothesis that both samples come from the same distribu-tion in the comparisons of GPA by sex and GPA by major Thep values of these tests clearly reject the null hypotheses

Under the possibility of impartially trimming both samplesas described in Section 2 we obtain the optimal trimming func-tions displayed in Figure 3 In this figure and for each compar-ison we plot the value of |Fminus1

n (t) minus Gminus1m (t)| and the cutting

values Lminus1FnGm

(1 minusα) for α = 05 1 and 2 Figure 3(a) showsthat the optimal trimming involves the lower tail but not ex-actly from the lower end point When the trimming level grows

Table 1 Two-sample p values for classical tests

p value

Test GPA by gender GPA by major

ShapirondashWilks (sample 1) 0176 0360ShapirondashWilks (sample 2) 0217 0001KS 0028 0040WMW 0004 0175

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 701

(a) (b)

Figure 3 Trimming functions (a) GPA by gender (b) GPA by major ( α = 05 α = 1 α = 2)

(α = 1 and 2) the trimmed zone is not an interval and it in-cludes points around percentiles 20 40 60 and 70Figure 3(b) shows that the points that should be trimmed tomake both samples more similar are between percentiles 10and 30 This example illustrates a nonsymmetrical dissimi-larity between samples in fact in the first comparison the lesssimilar zone is close to the lower tail but not to the upper tailwhere the values are more similar

31 p Value Curve

To gain some insight into the assessment of the similarityor dissimilarity of the underlying distributions we can use thep value curve to test the null hypothesis H0 τα(FG) gt 2

0against Ha τα(FG) le 2

0 In the two-sample comparisoncase we use the statistic

Znmα =radic

nm

n + m

(Tnmα minus 20)

snmα

(18)

To obtain the values of Tnmα we compute |Fminus1n (t) minus

Gminus1m (t)|2 over a grid in [01] using the (1 minus α)-quantile of

these values to determine Lminus1FnGm

(1 minus α) The integral is then

calculated numerically The computation of snmα is done sim-ilarly

The asymptotic p value curve P(0) is defined as

P(0) = sup(FG)isinH0

limnmrarrinfinPFG(Znmα le z0) = (z0)

where z0 is the observed value of Znmα (Note that the supre-mum is attained when the distance between both distributionsis exactly 0) These asymptotic p value curves can be used intwo ways On one hand given a fixed value of 0 that con-trols the degree of dissimilarity it is possible to find the p

value associated with the corresponding null hypothesis to de-cide whether or not the distributions are similar On the otherhand given a fixed test level (p value) we can find the valueof 0 such that for every ge 0 we should reject the hy-pothesis H0 τα(FG) ge 2 In this way we can get a soundidea of the degree of dissimilarity between the distributions Tohandle the values of 0 the experimenter should take into ac-count how to interpret the Wasserstein distance recalling thatin the case where F and G belong to the same location fam-ily their Wasserstein distance is the absolute difference of theirlocations

Figure 4 illustrates the improved assessment obtained by im-partial trimming over the Munk and Czado methodology It dis-

(a) (b)

Figure 4 p value curves using impartial and symmetrical (MC) trimmings (a) GPA by gender (b) GPA by major [ α = 051 α = 051(MC) α = 102 α = 102 (MC) α = 205 α = 205 (MC)]

702 Journal of the American Statistical Association June 2008

plays the p value curves using impartial trimming and symmet-rical trimming for both comparisons for different trimming lev-els (α = 05 1 and 2) For each plot a horizontal line marksa reference level for the test (05) The GPAs of males and fe-males are similar up to 0 ranging from 32 to 36 (dependingon the trimming size) when impartial trimmings are used Thesevalues represent between 100 times 322815 = 114 and 128of the average of the medians of the samples But when us-ing symmetrical trimmings the horizontal line cuts the p valuecurves for 0 ranging from 56 to 59 between 20 and 21of the average of the medians A similar analysis of the com-parison of GPAs by major leads us to values of 0 rangingfrom 29 to 36 which represent between a 96 and a 119of the average of the medians when using impartial trimmingInstead when using symmetrical trimming these percentagesrange from 166 to 195

32 Simulations

We end this section by reporting a small simulation studyto illustrate our procedurersquos performance for finite samples fortesting H0 τα(FG) gt 2

0 against Ha τα(FG) le 20 in the

two-sample problem We considered two different contami-nated normal models two different trimming sizes and sev-eral values of the threshold 0 In each situation we gener-ated 10000 replicas of the trimmed score Znmα as defined in(18) for several values of n = m We compared these replicaswith the 05 theoretical quantile of the standard normal distrib-ution rejecting H0 for observed values smaller than this quan-tity Table 2 shows the observed rejection frequencies We findgood agreement with our asymptotic results even for moder-ate sample sizes with low rejection frequencies for thresholds0 smaller than the true distance and high rejection frequenciesotherwise When the threshold equals the true distance we alsocan see how the observed frequency approximates the nominallevel

4 CONCLUSIONS AND POSSIBLE EXTENSIONS

We have introduced a procedure to compare two samples orprobability distributions on the real line based on the impartialtrimming methodology The procedure is designed mainly toassess similarity of the core of two samples by discarding thatpart of the data that has a greater influence on the dissimilarityof the distributions Our method is based on trimming the corre-sponding samples according to the same trimming function butit allows nonsymmetrical trimming thus it can greatly improvethe previous methodology based on simply trimming the tailsWe have evaluated the performance of our procedure throughan analysis of some real data samples that emphasized the ap-pealing possibilities in data analysis and the significance of theanalysis of the p value curves for assessing similarities A sim-ulation study has provided also evidence about the behavior ofthe procedure for finite samples in agreement with asymptoticresults Although we treated only dissimilarities based on theWasserstein distance other metrics or dissimilarities could behandled under the same scheme

Representation of trimmings of any distribution in terms ofthose of the uniform distribution is no longer possible in themultivariate setting However under very general assumptionsit has been proven (see Cuesta and Matraacuten 1989) that given twoprobabilities P and Q on R

k there exists an ldquooptimal transportmaprdquo T such that Q = P T minus1 and if X is any random vectorwith law P then EXminusT (X)2 = W2

2 (PQ) Moreover if Pα

is an α-trimming of P then Pα T minus1 is an α-trimming of Q

and T is an optimal map between Pα and Pα T minus1 so the mul-tivariate version of (4) would be the minimization over the setof α-trimmings of P of the expression W2

2 (PαPα T minus1) Wealso should mention that obtaining the optimal map T remainsan open problem for k gt 1 Although these are troubling factsobtaining the optimal trimming between two samples is alreadypossible through standard optimization procedures A final dif-ficulty concerns the asymptotic behavior of the involved statis-tics to which the techniques used in our proofs do not extend

Table 2 Simulated powers for the trimmed scores Znmα

P = 95 N(01) + 05 N(51) P = 9 N(01) + 1 N(51)

Q = 95 N(01) + 05 N(minus51) Q = 09 N(01) + 01 N(minus51)

α = 05 (1 minus α)20 n Frequency α = 1 (1 minus α)2

0 n Frequency

[(1 minus α)τα(PQ) = 384] 25 100 0320 [(1 minus α)τα(PQ) = 1004] 25 100 0028200 0268 200 0500 0086 500 0

1000 0021 1000 05000 0 5000 0

5 100 1412 5 100 0109200 1633 200 0031500 2264 500 0002

1000 3134 1000 05000 7648 5000 0

1 100 4912 1 100 0850200 6957 200 0727500 9474 500 0657

1000 9989 1000 05845000 10000 5000 0486

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimmingfunctions would provide an interesting alternative to our presentproposal Through our research still in progress we have iden-tified a radically different behavior to that presented in this ar-ticle for identical trimming in both samples

APPENDIX PROOFS AND FURTHER RESULTS

In this appendix ρn(t) = radicnf (Fminus1(t))(Fminus1

n (t) minus Fminus1(t)) de-notes the weighted quantile process where f is the density functionof F

Proof of Proposition 1

Let A= P lowast isinP P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cα For P lowast isinA absolute continuity of h entails

P lowast(s t] = h(P (minusinfin t]) minus h(P (minusinfin s])

=int P(minusinfint]P(minusinfins]

hprime(x) dx le 1

1 minus αP (s t]

Thus P lowast P and dP lowastdP

le 11minusα

and therefore P lowast isin T α(P )Conversely given P lowast isin T α(P ) if F is the distribution function of

P and we define h(t) = int t0

dP lowastdP

(Fminus1(s)) ds then it is immediate thath isin Cα and

P lowast(minusinfin t] =int t

minusinfindP lowastdP

(s) dF (s)

=int F(t)

0

dP lowastdP

(Fminus1(s)) ds = h(P (minusinfin t])

Therefore P lowast isin A and part a is proven The proof of part b is imme-diate from the proof of part a

The following lemmas collect some results which we use in ourproofs of Theorems 2 and A1 These results can be easily proven us-ing Schwarzrsquos inequality standard arguments in empirical processestheory or the ArzelaacutendashAscoli theorem

Lemma A1 If F and G have finite absolute moment of order r gt 4then the following hold

aradic

nint 1n

0 (Fminus1(t))2 dt rarr 0 andradic

nint 1

1minus1n(Fminus1(t))2 dt rarr 0

bradic

nint 1n

0 (Fminus1n (t))2 dt rarr 0 and

radicn

int 11minus1n(Fminus1

n (t))2 dt rarr 0 inprobability

cint 1

0radic

t (1 minus t)g(Gminus1(t))|Fminus1(t) minus Gminus1(t)|dt lt infin

d Furthermore if G satisfies (13) then 1radicn

int 1minus1n1n

t (1 minus t)

g2(Gminus1(t)) dt rarr 0

Lemma A2 Under the middot infin topology the set Cα in Proposition 1and the set Cα(FG) = h isin Cα

int 10 (Fminus1(t)minusGminus1(t))2hprime(t) dt = 0

for FG isinP2 are compact

Proof of Theorem 2

From theorem 621 of Csoumlrgouml and Horvaacuteth (1993) and (13) we canassume that there exist Brownian bridges Bn satisfying

n12minusν sup1nletle1minus1n

|ρn(t) minus Bn(t)|(t (1 minus t))ν

=

OP (logn) if ν = 0

OP (1) if 0 lt ν le 12(A1)

Now we set Mn(h) = radicn

int 10 (Fminus1

n (t) minus Gminus1(t))2hprime(t) dt and

Nn(h) = 2int 1minus1n

1n

Bn(t)

f (Fminus1(t))(Gminus1(t) minus Fminus1(t))hprime(t) dt

+ radicn

int 1minus1n

1n(Gminus1(t) minus Fminus1(t))2hprime(t) dt

Observe that

suphisinCα

|Mn(h) minus Nn(h)|

le radicn

int 1n

0(Fminus1

n (t) minus Gminus1(t))2 dt

+ radicn

int 1

1minus1n(Fminus1

n (t) minus Gminus1(t))2 dt

+ 1radicn

int 1minus1n

1n

|ρn(t) minus Bn(t)|2f 2(Fminus1(t))

dt

+ 1radicn

int 1minus1n

1n

Bn(t)2

f 2(Fminus1(t))dt

+ 2int 1minus1n

1n

|ρn(t) minus Bn(t)|f (Fminus1(t))

|Gminus1(t) minus Fminus1(t)|dt

= An1 + An2 + An3 + An4 + An5

Lemma A1 implies that An1 rarr 0 and An2 rarr 0 in probability

From (A1) we get An3 le OP (1) 1radicn

int 1minus1n1n

t (1minust)

f 2(Fminus1(t))dt and

this last integral tends to 0 by Lemma A1 Thus An3 rarr 0 inprobability Similarly An4 rarr 0 in probability Finally (A1) yields

An5 le OP (1)nνminus12 int 1minus1n1n

(t (1minust))ν

f (Fminus1(t))|Gminus1(t) minus Fminus1(t)|dt for

some ν isin (012) Lemma A1 shows thatint 1

0(t (1minust))12

f (Fminus1(t))|Gminus1(t) minus

Fminus1(t)|dt lt infin Thus by dominated convergence we obtain thatAn5 rarr 0 in probability Collecting the foregoing estimates we obtainsuphisinCα

|Mn(h) minus Nn(h)| rarr 0 in probability and thusradic

n(Tnα minusSnα) rarr 0 in probability where

radicnSnα = infhisinCα

Nn(h) Therefore

we need only show thatradic

n(Snα minus τα(FG)) rarrw N(0 σ 2α(FG))

where

radicnSnα = inf

hisinCα

[

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime(t) dt

]

Let us denote

hn = arg minhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radicn

int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

Clearly hprimen(t) rarr hprime

0(t) for almost every t Furthermore optimality ofhn shows that

Bn = radicnSnα minus

(

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime

0(t) dt

)

le 0

but in contrast

Bn = radicn

(int 1

0(Fminus1(t) minus Gminus1(t))2hprime

n(t) dt

minusint 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt

)

+ 2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))(hprime

n(t) minus hprime0(t)) dt

= Bn1 + Bn2

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 701

(a) (b)

Figure 3 Trimming functions (a) GPA by gender (b) GPA by major ( α = 05 α = 1 α = 2)

(α = 1 and 2) the trimmed zone is not an interval and it in-cludes points around percentiles 20 40 60 and 70Figure 3(b) shows that the points that should be trimmed tomake both samples more similar are between percentiles 10and 30 This example illustrates a nonsymmetrical dissimi-larity between samples in fact in the first comparison the lesssimilar zone is close to the lower tail but not to the upper tailwhere the values are more similar

31 p Value Curve

To gain some insight into the assessment of the similarityor dissimilarity of the underlying distributions we can use thep value curve to test the null hypothesis H0 τα(FG) gt 2

0against Ha τα(FG) le 2

0 In the two-sample comparisoncase we use the statistic

Znmα =radic

nm

n + m

(Tnmα minus 20)

snmα

(18)

To obtain the values of Tnmα we compute |Fminus1n (t) minus

Gminus1m (t)|2 over a grid in [01] using the (1 minus α)-quantile of

these values to determine Lminus1FnGm

(1 minus α) The integral is then

calculated numerically The computation of snmα is done sim-ilarly

The asymptotic p value curve P(0) is defined as

P(0) = sup(FG)isinH0

limnmrarrinfinPFG(Znmα le z0) = (z0)

where z0 is the observed value of Znmα (Note that the supre-mum is attained when the distance between both distributionsis exactly 0) These asymptotic p value curves can be used intwo ways On one hand given a fixed value of 0 that con-trols the degree of dissimilarity it is possible to find the p

value associated with the corresponding null hypothesis to de-cide whether or not the distributions are similar On the otherhand given a fixed test level (p value) we can find the valueof 0 such that for every ge 0 we should reject the hy-pothesis H0 τα(FG) ge 2 In this way we can get a soundidea of the degree of dissimilarity between the distributions Tohandle the values of 0 the experimenter should take into ac-count how to interpret the Wasserstein distance recalling thatin the case where F and G belong to the same location fam-ily their Wasserstein distance is the absolute difference of theirlocations

Figure 4 illustrates the improved assessment obtained by im-partial trimming over the Munk and Czado methodology It dis-

(a) (b)

Figure 4 p value curves using impartial and symmetrical (MC) trimmings (a) GPA by gender (b) GPA by major [ α = 051 α = 051(MC) α = 102 α = 102 (MC) α = 205 α = 205 (MC)]

702 Journal of the American Statistical Association June 2008

plays the p value curves using impartial trimming and symmet-rical trimming for both comparisons for different trimming lev-els (α = 05 1 and 2) For each plot a horizontal line marksa reference level for the test (05) The GPAs of males and fe-males are similar up to 0 ranging from 32 to 36 (dependingon the trimming size) when impartial trimmings are used Thesevalues represent between 100 times 322815 = 114 and 128of the average of the medians of the samples But when us-ing symmetrical trimmings the horizontal line cuts the p valuecurves for 0 ranging from 56 to 59 between 20 and 21of the average of the medians A similar analysis of the com-parison of GPAs by major leads us to values of 0 rangingfrom 29 to 36 which represent between a 96 and a 119of the average of the medians when using impartial trimmingInstead when using symmetrical trimming these percentagesrange from 166 to 195

32 Simulations

We end this section by reporting a small simulation studyto illustrate our procedurersquos performance for finite samples fortesting H0 τα(FG) gt 2

0 against Ha τα(FG) le 20 in the

two-sample problem We considered two different contami-nated normal models two different trimming sizes and sev-eral values of the threshold 0 In each situation we gener-ated 10000 replicas of the trimmed score Znmα as defined in(18) for several values of n = m We compared these replicaswith the 05 theoretical quantile of the standard normal distrib-ution rejecting H0 for observed values smaller than this quan-tity Table 2 shows the observed rejection frequencies We findgood agreement with our asymptotic results even for moder-ate sample sizes with low rejection frequencies for thresholds0 smaller than the true distance and high rejection frequenciesotherwise When the threshold equals the true distance we alsocan see how the observed frequency approximates the nominallevel

4 CONCLUSIONS AND POSSIBLE EXTENSIONS

We have introduced a procedure to compare two samples orprobability distributions on the real line based on the impartialtrimming methodology The procedure is designed mainly toassess similarity of the core of two samples by discarding thatpart of the data that has a greater influence on the dissimilarityof the distributions Our method is based on trimming the corre-sponding samples according to the same trimming function butit allows nonsymmetrical trimming thus it can greatly improvethe previous methodology based on simply trimming the tailsWe have evaluated the performance of our procedure throughan analysis of some real data samples that emphasized the ap-pealing possibilities in data analysis and the significance of theanalysis of the p value curves for assessing similarities A sim-ulation study has provided also evidence about the behavior ofthe procedure for finite samples in agreement with asymptoticresults Although we treated only dissimilarities based on theWasserstein distance other metrics or dissimilarities could behandled under the same scheme

Representation of trimmings of any distribution in terms ofthose of the uniform distribution is no longer possible in themultivariate setting However under very general assumptionsit has been proven (see Cuesta and Matraacuten 1989) that given twoprobabilities P and Q on R

k there exists an ldquooptimal transportmaprdquo T such that Q = P T minus1 and if X is any random vectorwith law P then EXminusT (X)2 = W2

2 (PQ) Moreover if Pα

is an α-trimming of P then Pα T minus1 is an α-trimming of Q

and T is an optimal map between Pα and Pα T minus1 so the mul-tivariate version of (4) would be the minimization over the setof α-trimmings of P of the expression W2

2 (PαPα T minus1) Wealso should mention that obtaining the optimal map T remainsan open problem for k gt 1 Although these are troubling factsobtaining the optimal trimming between two samples is alreadypossible through standard optimization procedures A final dif-ficulty concerns the asymptotic behavior of the involved statis-tics to which the techniques used in our proofs do not extend

Table 2 Simulated powers for the trimmed scores Znmα

P = 95 N(01) + 05 N(51) P = 9 N(01) + 1 N(51)

Q = 95 N(01) + 05 N(minus51) Q = 09 N(01) + 01 N(minus51)

α = 05 (1 minus α)20 n Frequency α = 1 (1 minus α)2

0 n Frequency

[(1 minus α)τα(PQ) = 384] 25 100 0320 [(1 minus α)τα(PQ) = 1004] 25 100 0028200 0268 200 0500 0086 500 0

1000 0021 1000 05000 0 5000 0

5 100 1412 5 100 0109200 1633 200 0031500 2264 500 0002

1000 3134 1000 05000 7648 5000 0

1 100 4912 1 100 0850200 6957 200 0727500 9474 500 0657

1000 9989 1000 05845000 10000 5000 0486

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimmingfunctions would provide an interesting alternative to our presentproposal Through our research still in progress we have iden-tified a radically different behavior to that presented in this ar-ticle for identical trimming in both samples

APPENDIX PROOFS AND FURTHER RESULTS

In this appendix ρn(t) = radicnf (Fminus1(t))(Fminus1

n (t) minus Fminus1(t)) de-notes the weighted quantile process where f is the density functionof F

Proof of Proposition 1

Let A= P lowast isinP P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cα For P lowast isinA absolute continuity of h entails

P lowast(s t] = h(P (minusinfin t]) minus h(P (minusinfin s])

=int P(minusinfint]P(minusinfins]

hprime(x) dx le 1

1 minus αP (s t]

Thus P lowast P and dP lowastdP

le 11minusα

and therefore P lowast isin T α(P )Conversely given P lowast isin T α(P ) if F is the distribution function of

P and we define h(t) = int t0

dP lowastdP

(Fminus1(s)) ds then it is immediate thath isin Cα and

P lowast(minusinfin t] =int t

minusinfindP lowastdP

(s) dF (s)

=int F(t)

0

dP lowastdP

(Fminus1(s)) ds = h(P (minusinfin t])

Therefore P lowast isin A and part a is proven The proof of part b is imme-diate from the proof of part a

The following lemmas collect some results which we use in ourproofs of Theorems 2 and A1 These results can be easily proven us-ing Schwarzrsquos inequality standard arguments in empirical processestheory or the ArzelaacutendashAscoli theorem

Lemma A1 If F and G have finite absolute moment of order r gt 4then the following hold

aradic

nint 1n

0 (Fminus1(t))2 dt rarr 0 andradic

nint 1

1minus1n(Fminus1(t))2 dt rarr 0

bradic

nint 1n

0 (Fminus1n (t))2 dt rarr 0 and

radicn

int 11minus1n(Fminus1

n (t))2 dt rarr 0 inprobability

cint 1

0radic

t (1 minus t)g(Gminus1(t))|Fminus1(t) minus Gminus1(t)|dt lt infin

d Furthermore if G satisfies (13) then 1radicn

int 1minus1n1n

t (1 minus t)

g2(Gminus1(t)) dt rarr 0

Lemma A2 Under the middot infin topology the set Cα in Proposition 1and the set Cα(FG) = h isin Cα

int 10 (Fminus1(t)minusGminus1(t))2hprime(t) dt = 0

for FG isinP2 are compact

Proof of Theorem 2

From theorem 621 of Csoumlrgouml and Horvaacuteth (1993) and (13) we canassume that there exist Brownian bridges Bn satisfying

n12minusν sup1nletle1minus1n

|ρn(t) minus Bn(t)|(t (1 minus t))ν

=

OP (logn) if ν = 0

OP (1) if 0 lt ν le 12(A1)

Now we set Mn(h) = radicn

int 10 (Fminus1

n (t) minus Gminus1(t))2hprime(t) dt and

Nn(h) = 2int 1minus1n

1n

Bn(t)

f (Fminus1(t))(Gminus1(t) minus Fminus1(t))hprime(t) dt

+ radicn

int 1minus1n

1n(Gminus1(t) minus Fminus1(t))2hprime(t) dt

Observe that

suphisinCα

|Mn(h) minus Nn(h)|

le radicn

int 1n

0(Fminus1

n (t) minus Gminus1(t))2 dt

+ radicn

int 1

1minus1n(Fminus1

n (t) minus Gminus1(t))2 dt

+ 1radicn

int 1minus1n

1n

|ρn(t) minus Bn(t)|2f 2(Fminus1(t))

dt

+ 1radicn

int 1minus1n

1n

Bn(t)2

f 2(Fminus1(t))dt

+ 2int 1minus1n

1n

|ρn(t) minus Bn(t)|f (Fminus1(t))

|Gminus1(t) minus Fminus1(t)|dt

= An1 + An2 + An3 + An4 + An5

Lemma A1 implies that An1 rarr 0 and An2 rarr 0 in probability

From (A1) we get An3 le OP (1) 1radicn

int 1minus1n1n

t (1minust)

f 2(Fminus1(t))dt and

this last integral tends to 0 by Lemma A1 Thus An3 rarr 0 inprobability Similarly An4 rarr 0 in probability Finally (A1) yields

An5 le OP (1)nνminus12 int 1minus1n1n

(t (1minust))ν

f (Fminus1(t))|Gminus1(t) minus Fminus1(t)|dt for

some ν isin (012) Lemma A1 shows thatint 1

0(t (1minust))12

f (Fminus1(t))|Gminus1(t) minus

Fminus1(t)|dt lt infin Thus by dominated convergence we obtain thatAn5 rarr 0 in probability Collecting the foregoing estimates we obtainsuphisinCα

|Mn(h) minus Nn(h)| rarr 0 in probability and thusradic

n(Tnα minusSnα) rarr 0 in probability where

radicnSnα = infhisinCα

Nn(h) Therefore

we need only show thatradic

n(Snα minus τα(FG)) rarrw N(0 σ 2α(FG))

where

radicnSnα = inf

hisinCα

[

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime(t) dt

]

Let us denote

hn = arg minhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radicn

int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

Clearly hprimen(t) rarr hprime

0(t) for almost every t Furthermore optimality ofhn shows that

Bn = radicnSnα minus

(

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime

0(t) dt

)

le 0

but in contrast

Bn = radicn

(int 1

0(Fminus1(t) minus Gminus1(t))2hprime

n(t) dt

minusint 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt

)

+ 2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))(hprime

n(t) minus hprime0(t)) dt

= Bn1 + Bn2

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297

702 Journal of the American Statistical Association June 2008

plays the p value curves using impartial trimming and symmet-rical trimming for both comparisons for different trimming lev-els (α = 05 1 and 2) For each plot a horizontal line marksa reference level for the test (05) The GPAs of males and fe-males are similar up to 0 ranging from 32 to 36 (dependingon the trimming size) when impartial trimmings are used Thesevalues represent between 100 times 322815 = 114 and 128of the average of the medians of the samples But when us-ing symmetrical trimmings the horizontal line cuts the p valuecurves for 0 ranging from 56 to 59 between 20 and 21of the average of the medians A similar analysis of the com-parison of GPAs by major leads us to values of 0 rangingfrom 29 to 36 which represent between a 96 and a 119of the average of the medians when using impartial trimmingInstead when using symmetrical trimming these percentagesrange from 166 to 195

32 Simulations

We end this section by reporting a small simulation studyto illustrate our procedurersquos performance for finite samples fortesting H0 τα(FG) gt 2

0 against Ha τα(FG) le 20 in the

two-sample problem We considered two different contami-nated normal models two different trimming sizes and sev-eral values of the threshold 0 In each situation we gener-ated 10000 replicas of the trimmed score Znmα as defined in(18) for several values of n = m We compared these replicaswith the 05 theoretical quantile of the standard normal distrib-ution rejecting H0 for observed values smaller than this quan-tity Table 2 shows the observed rejection frequencies We findgood agreement with our asymptotic results even for moder-ate sample sizes with low rejection frequencies for thresholds0 smaller than the true distance and high rejection frequenciesotherwise When the threshold equals the true distance we alsocan see how the observed frequency approximates the nominallevel

4 CONCLUSIONS AND POSSIBLE EXTENSIONS

We have introduced a procedure to compare two samples orprobability distributions on the real line based on the impartialtrimming methodology The procedure is designed mainly toassess similarity of the core of two samples by discarding thatpart of the data that has a greater influence on the dissimilarityof the distributions Our method is based on trimming the corre-sponding samples according to the same trimming function butit allows nonsymmetrical trimming thus it can greatly improvethe previous methodology based on simply trimming the tailsWe have evaluated the performance of our procedure throughan analysis of some real data samples that emphasized the ap-pealing possibilities in data analysis and the significance of theanalysis of the p value curves for assessing similarities A sim-ulation study has provided also evidence about the behavior ofthe procedure for finite samples in agreement with asymptoticresults Although we treated only dissimilarities based on theWasserstein distance other metrics or dissimilarities could behandled under the same scheme

Representation of trimmings of any distribution in terms ofthose of the uniform distribution is no longer possible in themultivariate setting However under very general assumptionsit has been proven (see Cuesta and Matraacuten 1989) that given twoprobabilities P and Q on R

k there exists an ldquooptimal transportmaprdquo T such that Q = P T minus1 and if X is any random vectorwith law P then EXminusT (X)2 = W2

2 (PQ) Moreover if Pα

is an α-trimming of P then Pα T minus1 is an α-trimming of Q

and T is an optimal map between Pα and Pα T minus1 so the mul-tivariate version of (4) would be the minimization over the setof α-trimmings of P of the expression W2

2 (PαPα T minus1) Wealso should mention that obtaining the optimal map T remainsan open problem for k gt 1 Although these are troubling factsobtaining the optimal trimming between two samples is alreadypossible through standard optimization procedures A final dif-ficulty concerns the asymptotic behavior of the involved statis-tics to which the techniques used in our proofs do not extend

Table 2 Simulated powers for the trimmed scores Znmα

P = 95 N(01) + 05 N(51) P = 9 N(01) + 1 N(51)

Q = 95 N(01) + 05 N(minus51) Q = 09 N(01) + 01 N(minus51)

α = 05 (1 minus α)20 n Frequency α = 1 (1 minus α)2

0 n Frequency

[(1 minus α)τα(PQ) = 384] 25 100 0320 [(1 minus α)τα(PQ) = 1004] 25 100 0028200 0268 200 0500 0086 500 0

1000 0021 1000 05000 0 5000 0

5 100 1412 5 100 0109200 1633 200 0031500 2264 500 0002

1000 3134 1000 05000 7648 5000 0

1 100 4912 1 100 0850200 6957 200 0727500 9474 500 0657

1000 9989 1000 05845000 10000 5000 0486

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimmingfunctions would provide an interesting alternative to our presentproposal Through our research still in progress we have iden-tified a radically different behavior to that presented in this ar-ticle for identical trimming in both samples

APPENDIX PROOFS AND FURTHER RESULTS

In this appendix ρn(t) = radicnf (Fminus1(t))(Fminus1

n (t) minus Fminus1(t)) de-notes the weighted quantile process where f is the density functionof F

Proof of Proposition 1

Let A= P lowast isinP P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cα For P lowast isinA absolute continuity of h entails

P lowast(s t] = h(P (minusinfin t]) minus h(P (minusinfin s])

=int P(minusinfint]P(minusinfins]

hprime(x) dx le 1

1 minus αP (s t]

Thus P lowast P and dP lowastdP

le 11minusα

and therefore P lowast isin T α(P )Conversely given P lowast isin T α(P ) if F is the distribution function of

P and we define h(t) = int t0

dP lowastdP

(Fminus1(s)) ds then it is immediate thath isin Cα and

P lowast(minusinfin t] =int t

minusinfindP lowastdP

(s) dF (s)

=int F(t)

0

dP lowastdP

(Fminus1(s)) ds = h(P (minusinfin t])

Therefore P lowast isin A and part a is proven The proof of part b is imme-diate from the proof of part a

The following lemmas collect some results which we use in ourproofs of Theorems 2 and A1 These results can be easily proven us-ing Schwarzrsquos inequality standard arguments in empirical processestheory or the ArzelaacutendashAscoli theorem

Lemma A1 If F and G have finite absolute moment of order r gt 4then the following hold

aradic

nint 1n

0 (Fminus1(t))2 dt rarr 0 andradic

nint 1

1minus1n(Fminus1(t))2 dt rarr 0

bradic

nint 1n

0 (Fminus1n (t))2 dt rarr 0 and

radicn

int 11minus1n(Fminus1

n (t))2 dt rarr 0 inprobability

cint 1

0radic

t (1 minus t)g(Gminus1(t))|Fminus1(t) minus Gminus1(t)|dt lt infin

d Furthermore if G satisfies (13) then 1radicn

int 1minus1n1n

t (1 minus t)

g2(Gminus1(t)) dt rarr 0

Lemma A2 Under the middot infin topology the set Cα in Proposition 1and the set Cα(FG) = h isin Cα

int 10 (Fminus1(t)minusGminus1(t))2hprime(t) dt = 0

for FG isinP2 are compact

Proof of Theorem 2

From theorem 621 of Csoumlrgouml and Horvaacuteth (1993) and (13) we canassume that there exist Brownian bridges Bn satisfying

n12minusν sup1nletle1minus1n

|ρn(t) minus Bn(t)|(t (1 minus t))ν

=

OP (logn) if ν = 0

OP (1) if 0 lt ν le 12(A1)

Now we set Mn(h) = radicn

int 10 (Fminus1

n (t) minus Gminus1(t))2hprime(t) dt and

Nn(h) = 2int 1minus1n

1n

Bn(t)

f (Fminus1(t))(Gminus1(t) minus Fminus1(t))hprime(t) dt

+ radicn

int 1minus1n

1n(Gminus1(t) minus Fminus1(t))2hprime(t) dt

Observe that

suphisinCα

|Mn(h) minus Nn(h)|

le radicn

int 1n

0(Fminus1

n (t) minus Gminus1(t))2 dt

+ radicn

int 1

1minus1n(Fminus1

n (t) minus Gminus1(t))2 dt

+ 1radicn

int 1minus1n

1n

|ρn(t) minus Bn(t)|2f 2(Fminus1(t))

dt

+ 1radicn

int 1minus1n

1n

Bn(t)2

f 2(Fminus1(t))dt

+ 2int 1minus1n

1n

|ρn(t) minus Bn(t)|f (Fminus1(t))

|Gminus1(t) minus Fminus1(t)|dt

= An1 + An2 + An3 + An4 + An5

Lemma A1 implies that An1 rarr 0 and An2 rarr 0 in probability

From (A1) we get An3 le OP (1) 1radicn

int 1minus1n1n

t (1minust)

f 2(Fminus1(t))dt and

this last integral tends to 0 by Lemma A1 Thus An3 rarr 0 inprobability Similarly An4 rarr 0 in probability Finally (A1) yields

An5 le OP (1)nνminus12 int 1minus1n1n

(t (1minust))ν

f (Fminus1(t))|Gminus1(t) minus Fminus1(t)|dt for

some ν isin (012) Lemma A1 shows thatint 1

0(t (1minust))12

f (Fminus1(t))|Gminus1(t) minus

Fminus1(t)|dt lt infin Thus by dominated convergence we obtain thatAn5 rarr 0 in probability Collecting the foregoing estimates we obtainsuphisinCα

|Mn(h) minus Nn(h)| rarr 0 in probability and thusradic

n(Tnα minusSnα) rarr 0 in probability where

radicnSnα = infhisinCα

Nn(h) Therefore

we need only show thatradic

n(Snα minus τα(FG)) rarrw N(0 σ 2α(FG))

where

radicnSnα = inf

hisinCα

[

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime(t) dt

]

Let us denote

hn = arg minhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radicn

int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

Clearly hprimen(t) rarr hprime

0(t) for almost every t Furthermore optimality ofhn shows that

Bn = radicnSnα minus

(

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime

0(t) dt

)

le 0

but in contrast

Bn = radicn

(int 1

0(Fminus1(t) minus Gminus1(t))2hprime

n(t) dt

minusint 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt

)

+ 2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))(hprime

n(t) minus hprime0(t)) dt

= Bn1 + Bn2

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297

Aacutelvarez-Esteban del Barrio Cuesta-Albertos and Matraacuten Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimmingfunctions would provide an interesting alternative to our presentproposal Through our research still in progress we have iden-tified a radically different behavior to that presented in this ar-ticle for identical trimming in both samples

APPENDIX PROOFS AND FURTHER RESULTS

In this appendix ρn(t) = radicnf (Fminus1(t))(Fminus1

n (t) minus Fminus1(t)) de-notes the weighted quantile process where f is the density functionof F

Proof of Proposition 1

Let A= P lowast isinP P lowast(minusinfin t] = h(P (minusinfin t]) h isin Cα For P lowast isinA absolute continuity of h entails

P lowast(s t] = h(P (minusinfin t]) minus h(P (minusinfin s])

=int P(minusinfint]P(minusinfins]

hprime(x) dx le 1

1 minus αP (s t]

Thus P lowast P and dP lowastdP

le 11minusα

and therefore P lowast isin T α(P )Conversely given P lowast isin T α(P ) if F is the distribution function of

P and we define h(t) = int t0

dP lowastdP

(Fminus1(s)) ds then it is immediate thath isin Cα and

P lowast(minusinfin t] =int t

minusinfindP lowastdP

(s) dF (s)

=int F(t)

0

dP lowastdP

(Fminus1(s)) ds = h(P (minusinfin t])

Therefore P lowast isin A and part a is proven The proof of part b is imme-diate from the proof of part a

The following lemmas collect some results which we use in ourproofs of Theorems 2 and A1 These results can be easily proven us-ing Schwarzrsquos inequality standard arguments in empirical processestheory or the ArzelaacutendashAscoli theorem

Lemma A1 If F and G have finite absolute moment of order r gt 4then the following hold

aradic

nint 1n

0 (Fminus1(t))2 dt rarr 0 andradic

nint 1

1minus1n(Fminus1(t))2 dt rarr 0

bradic

nint 1n

0 (Fminus1n (t))2 dt rarr 0 and

radicn

int 11minus1n(Fminus1

n (t))2 dt rarr 0 inprobability

cint 1

0radic

t (1 minus t)g(Gminus1(t))|Fminus1(t) minus Gminus1(t)|dt lt infin

d Furthermore if G satisfies (13) then 1radicn

int 1minus1n1n

t (1 minus t)

g2(Gminus1(t)) dt rarr 0

Lemma A2 Under the middot infin topology the set Cα in Proposition 1and the set Cα(FG) = h isin Cα

int 10 (Fminus1(t)minusGminus1(t))2hprime(t) dt = 0

for FG isinP2 are compact

Proof of Theorem 2

From theorem 621 of Csoumlrgouml and Horvaacuteth (1993) and (13) we canassume that there exist Brownian bridges Bn satisfying

n12minusν sup1nletle1minus1n

|ρn(t) minus Bn(t)|(t (1 minus t))ν

=

OP (logn) if ν = 0

OP (1) if 0 lt ν le 12(A1)

Now we set Mn(h) = radicn

int 10 (Fminus1

n (t) minus Gminus1(t))2hprime(t) dt and

Nn(h) = 2int 1minus1n

1n

Bn(t)

f (Fminus1(t))(Gminus1(t) minus Fminus1(t))hprime(t) dt

+ radicn

int 1minus1n

1n(Gminus1(t) minus Fminus1(t))2hprime(t) dt

Observe that

suphisinCα

|Mn(h) minus Nn(h)|

le radicn

int 1n

0(Fminus1

n (t) minus Gminus1(t))2 dt

+ radicn

int 1

1minus1n(Fminus1

n (t) minus Gminus1(t))2 dt

+ 1radicn

int 1minus1n

1n

|ρn(t) minus Bn(t)|2f 2(Fminus1(t))

dt

+ 1radicn

int 1minus1n

1n

Bn(t)2

f 2(Fminus1(t))dt

+ 2int 1minus1n

1n

|ρn(t) minus Bn(t)|f (Fminus1(t))

|Gminus1(t) minus Fminus1(t)|dt

= An1 + An2 + An3 + An4 + An5

Lemma A1 implies that An1 rarr 0 and An2 rarr 0 in probability

From (A1) we get An3 le OP (1) 1radicn

int 1minus1n1n

t (1minust)

f 2(Fminus1(t))dt and

this last integral tends to 0 by Lemma A1 Thus An3 rarr 0 inprobability Similarly An4 rarr 0 in probability Finally (A1) yields

An5 le OP (1)nνminus12 int 1minus1n1n

(t (1minust))ν

f (Fminus1(t))|Gminus1(t) minus Fminus1(t)|dt for

some ν isin (012) Lemma A1 shows thatint 1

0(t (1minust))12

f (Fminus1(t))|Gminus1(t) minus

Fminus1(t)|dt lt infin Thus by dominated convergence we obtain thatAn5 rarr 0 in probability Collecting the foregoing estimates we obtainsuphisinCα

|Mn(h) minus Nn(h)| rarr 0 in probability and thusradic

n(Tnα minusSnα) rarr 0 in probability where

radicnSnα = infhisinCα

Nn(h) Therefore

we need only show thatradic

n(Snα minus τα(FG)) rarrw N(0 σ 2α(FG))

where

radicnSnα = inf

hisinCα

[

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime(t) dt

]

Let us denote

hn = arg minhisinCα

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radicn

int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime(t) dt

Clearly hprimen(t) rarr hprime

0(t) for almost every t Furthermore optimality ofhn shows that

Bn = radicnSnα minus

(

2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

+ radicn

int 1

0(Gminus1(t) minus Fminus1(t))2hprime

0(t) dt

)

le 0

but in contrast

Bn = radicn

(int 1

0(Fminus1(t) minus Gminus1(t))2hprime

n(t) dt

minusint 1

0(Fminus1(t) minus Gminus1(t))2hprime

0(t) dt

)

+ 2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))(hprime

n(t) minus hprime0(t)) dt

= Bn1 + Bn2

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297

704 Journal of the American Statistical Association June 2008

and Bn1 ge 0 by optimality of h0 whereas Bn2 = oP (1) by the dom-inated convergence theorem Therefore Bn rarr 0 in probability whichshows that

radicn(Tnα minus τα(FG))

wrarr2int 1

0B(t)

Gminus1(t) minus Fminus1(t)

f (Fminus1(t))hprime

0(t) dt

(A2)

Integrating by parts we obtainint 1

0 B(t)Gminus1(t)minusFminus1(t)

f (Fminus1(t))hprime

0(t) dt =minus int 1

0 l(t) dB(t) which proves the asymptotic normality and the ex-pression (17) for the variance The claim about the variance es-timator readily follows by noting that s2

nα(G) = 4(int 1

0 l2n(t) dt minus(int 1

0 ln(t) dt)2) where ln(t) = int Fminus1n (t)

Fminus1n (12)

(x minus Gminus1(Fn(x))) timeshprimen(Fn(x)) dx and hn(t) = arg minhisinCα

int(Fminus1

n minus Gminus1)hprime It can beshown that with probability 1 ln(t) rarr l(t) for almost every t isin (01)A standard uniform integrability argument completes the proof

The final result in this section establishes the asymptotic behaviorof nTnα when F and G are equivalent at trimming level α Recall thedefinition of Cα(FG) in Lemma A2 and note that Cα(FF ) = Cα but also note that for F = G we have that Cα(FG) is a proper subsetof Cα Also note that Cα(FG) = empty if and only if τα(FG) = 0 In factthe size of Cα(FG) depends on the Lebesgue measure of the set t isin(01) Fminus1(t) = Gminus1(t) τα(FG) = 0 if and only if the measureof this last set is at most α if it equals α then the only function inCα(FG) corresponds to hprime(t) = 1

1minusαI(Fminus1(t)=Gminus1(t))

Theorem A1 If τα(FG) = 0 F satisfies (13) andint 1

0

t (1 minus t)

f 2(Fminus1(t))dt lt infin (A3)

then

nTnαwrarr min

hisinCα(FG)

int 1

0

B(t)2

f 2(Fminus1(t))hprime(t) dt

where B(t)0lttlt1 is a Brownian bridge Because h rarr int 10 B2(t)

f 2(Fminus1(t))hprime(t) dt is middot infin-continuous as a function of h it attainsits minimum value on Cα(FG)

Proof We define Dn(h) = nint 1

0 (Fminus1n (t) minus Gminus1(t))2hprime(t) dt and

D(h) = int 10 B2(t)f 2(Fminus1(t))hprime(t) dt for h isin Cα Note that

Dn(h) =int 1

0

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt

+ n

int 1

0(Fminus1(t) minus Gminus1(t))2hprime(t) dt

+ 2radic

n

int 1

0

ρn(t)

f (Fminus1(t))(Fminus1(t) minus Gminus1(t))hprime(t) dt

Also observe that nTnα = Dn(hn) for some hn isin Cα If h isin Cα(FG)then the second and third summands on the right side vanish and

Dn(h) = int 10

ρ2n(t)

f 2(Fminus1(t))hprime(t) dt By (13) (A3) and as representa-

tion of weak convergence versions of ρn(middot)f (Fminus1(middot)) and B(middot)

f (Fminus1(middot)) exist (for which we keep the same notation) such that∥∥ρn(middot)f (Fminus1(middot)) minus B(middot)f (Fminus1(middot))∥∥2 rarr 0 as

Now for these versions we have

suphisinCα(FG)

|Dn(h) minus D(h)|

le 1

1 minus α

int 1

0

∣∣∣∣

ρ2n(t)

f 2(Fminus1(t))minus B2(t)

f 2(Fminus1(t))

∣∣∣∣dt rarr 0 as

whereas for h0 isin Cα minus Cα(FG) we have as that Dn(h) rarr infin uni-formly in a sufficiently small neighborhood of h0 Furthermore ifhn rarr h isin Cα(FG) then we can extract a subsequence such thatn

int 10 (Fminus1(t) minus Gminus1(t))2hprime

n(t) dt rarr 0 The result follows from thenext technical lemma the easy proof of which is omitted

Lemma A3 Let (Xd) be a compact metric space let A be a com-pact subset of X and let fn and f be realvalued continuous func-tions on X such that the following hold

a supxisinA |fn(x) minus f (x)| rarr 0 as n rarr infinb For x isin X minus A there exists εx gt 0 such that

infd(yx)ltεxfn(y) rarr infin as n rarr infin

c If xn rarr x isin A there exists a subsequence xm such thatfm(xm) rarr f (x)

Then minxisinX fn(x) rarr minxisinA f (x)

[Received October 2007 Revised January 2008]

REFERENCES

Bickel P and Freedman D (1981) ldquoSome Asymptotic Theory for the Boot-straprdquo The Annals of Statistics 9 1196ndash1217

Csoumlrgouml M and Horvaacuteth L (1993) Weighted Approximations in Probabilityand Statistics New York Wiley

Cuesta J A and Matraacuten C (1989) ldquoNotes on the Wasserstein Metric inHilbert Spacesrdquo The Annals of Probability 17 1264ndash1276

Cuesta J Gordaliza A and Matraacuten C (1997) ldquoTrimmed k-Means An At-tempt to Robustify Quantizersrdquo The Annals of Statistics 25 553ndash576

Czado C and Munk A (1998) ldquoAssessing the Similarity of DistributionsmdashFinite-Sample Performance of the Empirical Mallows Distancerdquo Journal ofStatistical Computation and Simulation 60 319ndash346

Freitag G Czado C and Munk A (2007) ldquoA Nonparametric Test for Simi-larity of Marginals With Applications to the Assessment of Population Bioe-quivalencerdquo Journal of Statistical Planning and Inference 137 697ndash711

Garciacutea-Escudero L Gordaliza A Matraacuten C and Mayo-Iscar A (2008)ldquoA General Trimming Approach to Robust Cluster Analysisrdquo The Annals ofStatistics to appear

Gordaliza A (1991) ldquoBest Approximations to Random Variables Based onTrimming Proceduresrdquo Journal of Approximation Theory 64 162ndash180

Maronna R (2005) ldquoPrincipal Components and Orthogonal Regression Basedon Robust Scalesrdquo Technometrics 47 264ndash273

Moore D S and McCabe G P (2003) Introduction to the Practice of Statis-tics (4th ed) New York WH Freeman

Munk A and Czado C (1998) ldquoNonparametric Validation of Similar Dis-tributions and Assessment of Goodness of Fitrdquo Journal of Royal StatisticalSociety Ser B 60 223ndash241

Rousseeuw P (1985) ldquoMultivariate Estimation With High Breakdown Pointrdquoin Mathematical Statistics and Applications Vol B eds W GrossmannG Pflug I Vincze and W Werz Dordrecht Reidel pp 283ndash297