K n -nearest neighbor estimators of entropy

17
ISSN 1066-5307, Mathematical Methods of Statistics, 2008, Vol. 17, No. 3, pp. 261–277. c Allerton Press, Inc., 2008. k n -Nearest Neighbor Estimators of Entropy R. M. Mnatsakanov 1* , N. Misra 2 , Sh. Li 1 , and E. J. Harner 3 1 West Virginia Univ. and National Inst. for Occupational Safety and Health, Morgantown, USA 2 West Virginia Univ., Morgantown, USA, and Indian Inst. of Technology, Kanpur, India 3 West Virginia Univ., Morgantown, USA Received October 21, 2006; in nal form, August 7, 2008 AbstractFor estimating the entropy of an absolutely continuous multivariate distribution, we propose nonparametric estimators based on the Euclidean distances between the n sample points and their k n -nearest neighbors, where {k n : n =1, 2,... } is a sequence of positive integers varying with n. The proposed estimators are shown to be asymptotically unbiased and consistent. Key words: entropy estimator, k n -nearest neighbor. 2000 Mathematics Subject Classication: primary 62G05; secondary 94A17. DOI: 10.3103/S106653070803006X 1. INTRODUCTION AND PRELIMINARIES Problems in molecular science, uid mechanics, chemistry, physics, etc., often require the estimation of the unknown dierential entropy H (f ) (see (1) below) of the probability density function (p.d.f.) f , also known as the Shannon entropy. Researchers at the National Institute for Occupational Safety and Health are modeling the random uctuations of molecules by means of molecular dynamics simulation in order to study their properties and functions. We focus on estimating the congurational entropy, which measures the freedom of the molecular system to explore its available conguration space. Specically, entropy evaluation is necessary to understand the stability of a conformation and the changes that take place in going from one conformation to another. The entropy of a molecular conformation depends on the following variables: bond lengths, bond angles, and torsional (or dihedral) angles. Entropy is mainly determined by the uctuations in torsional angles since bond lengths and bond angles are rather stable variables. The diculty is that the number of torsional angles can be quite large, i.e., the dimension p of the conformation space can be high. An attempt to construct models describing the uctuations of torsional angles in molecules (dierent from the normal distribution), was presented in Demchuk and Singh [1]. It was shown that the von Mises distribution can be used when modeling the torsional angles in certain cases. The form of f is complex and unknown, except for certain simple molecules such as methanol. Since the marginal distributions of the uctuations of molecular dihedral angles generally exhibit multimodes and skewness, a nonparametric approach to entropy estimation seems appropriate. The construction of entropy estimates based on k-nearest neighbor (kNN) techniques will help us to overcome the dimensionality problem. kNN estimation techniques have been studied extensively in both the computer science and statistical literature; the estimators have a simple construction and can be easily implemented computationally. However, algorithms implementing smoothing kernel density estimation techniques generally have fallen short in high-dimensional cases with massive data sets (see, for example, Scott [2]). To describe the construction of the k-nearest neighbor estimator of entropy, let X 1 ,...,X n be n independent copies of a p-dimensional random variable (r.v.) X with a common distribution having a * E-mail: [email protected] 261

Transcript of K n -nearest neighbor estimators of entropy

ISSN 1066-5307, Mathematical Methods of Statistics, 2008, Vol. 17, No. 3, pp. 261–277. c© Allerton Press, Inc., 2008.

kn-Nearest Neighbor Estimators of Entropy

R. M. Mnatsakanov1*, N. Misra2, Sh. Li1, and E. J. Harner3

1West Virginia Univ. and National Inst. for Occupational Safety and Health, Morgantown, USA2West Virginia Univ., Morgantown, USA, and Indian Inst. of Technology, Kanpur, India

3West Virginia Univ., Morgantown, USAReceived October 21, 2006; in final form, August 7, 2008

Abstract—For estimating the entropy of an absolutely continuous multivariate distribution, wepropose nonparametric estimators based on the Euclidean distances between the n sample pointsand their kn-nearest neighbors, where {kn : n = 1, 2, . . .} is a sequence of positive integers varyingwith n. The proposed estimators are shown to be asymptotically unbiased and consistent.

Key words: entropy estimator, kn-nearest neighbor.

2000 Mathematics Subject Classification: primary 62G05; secondary 94A17.

DOI: 10.3103/S106653070803006X

1. INTRODUCTION AND PRELIMINARIES

Problems in molecular science, fluid mechanics, chemistry, physics, etc., often require the estimationof the unknown differential entropy H(f) (see (1) below) of the probability density function (p.d.f.) f ,also known as the Shannon entropy. Researchers at the National Institute for Occupational Safety andHealth are modeling the random fluctuations of molecules by means of molecular dynamics simulation inorder to study their properties and functions. We focus on estimating the configurational entropy, whichmeasures the freedom of the molecular system to explore its available configuration space. Specifically,entropy evaluation is necessary to understand the stability of a conformation and the changes that takeplace in going from one conformation to another. The entropy of a molecular conformation depends onthe following variables: bond lengths, bond angles, and torsional (or dihedral) angles. Entropy is mainlydetermined by the fluctuations in torsional angles since bond lengths and bond angles are rather stablevariables. The difficulty is that the number of torsional angles can be quite large, i.e., the dimension p ofthe conformation space can be high.

An attempt to construct models describing the fluctuations of torsional angles in molecules (differentfrom the normal distribution), was presented in Demchuk and Singh [1]. It was shown that the vonMises distribution can be used when modeling the torsional angles in certain cases.

The form of f is complex and unknown, except for certain simple molecules such as methanol. Sincethe marginal distributions of the fluctuations of molecular dihedral angles generally exhibit multimodesand skewness, a nonparametric approach to entropy estimation seems appropriate. The constructionof entropy estimates based on k-nearest neighbor (kNN) techniques will help us to overcome thedimensionality problem.

kNN estimation techniques have been studied extensively in both the computer science and statisticalliterature; the estimators have a simple construction and can be easily implemented computationally.However, algorithms implementing smoothing kernel density estimation techniques generally havefallen short in high-dimensional cases with massive data sets (see, for example, Scott [2]).

To describe the construction of the k-nearest neighbor estimator of entropy, let X1, . . . ,Xn be nindependent copies of a p-dimensional random variable (r.v.) X with a common distribution having a

*E-mail: [email protected]

261

262 R. M. MNATSAKANOV et al.

p.d.f. f (with respect to the Lebesgue measure defined on a support of f ). The entropy of f (or of ther.v. X) is defined by

H(f) = Ef (− log f(X)) = −∫

Rp

f(x) log f(x) dx, (1)

where Rp denotes the p-dimensional space. In the sequel, to avoid the uncertainty (when f(x) = 0),

we consider the integration in (1) over the set D = {x ∈ Rp : f(x) > 0}. Based on a random sample

X1, . . . ,Xn, our goal is to estimate the entropy H(f), as defined in (1), and establish the L2-consistencyof corresponding k-nearest neighbor estimators as k = kn → ∞ and n → ∞.

At first, let us introduce the construction based on the k-nearest neighbor density estimators studiedby Kozachenko and Leonenko [3] for k = 1, and by Singh et al. [4] and Goria et al. [5], for k ≥ 1. Namely,the following asymptotically unbiased and consistent estimator has been proposed:

Hk,n =p

n

n∑i=1

log di,k,n + log(

πp/2

Γ(p/2 + 1)

)+ γ − Lk−1 + log n, (2)

where, for each k ≤ n − 1, di,k,n is the Euclidean distance between Xi and its kth nearest neighboramong X1, . . . ,Xi−1,Xi+1, . . . ,Xn, and γ = 0.5772 . . . is the Euler constant, L0 = 0, Lj =

∑ji=1 i−1,

j = 1, 2, . . .

Note that the rate of convergence of Hk,n is still not derived when p ≥ 2. Another very importantquestion related to this construction is the derivation of the optimal (for instance, in the sense of RootMean Squared Error (RMSE)) values of k as a function of the sample size n and dimension p. Currently,we are working on these questions, which are addressed partially via the simulation studies in Section 3.

In the one-dimensional case (p = 1), several authors have proposed estimators of entropy in thecontext of goodness-of-fit testing (Vasicek [6], Dudewicz and van der Meulen [7]):

Vm,n =1

n − m

n−m∑i=1

log(X(i+m) − X(i)) − ψ(m) + log n,

where m (≤ n− 1) is a fixed positive integer, X(1), . . . ,X(n) are the order statistics, and ψ(·) = Γ′(·)/Γ(·)is the digamma function (see Abramowitz and Stegun [8] for the properties of the gamma Γ(·) anddigamma functions). Under certain conditions the asymptotic normality of Vm,n is proved. Namely, asn → ∞,

√n

(Vm,n − H(f)

) d−→ N(0, (2m2 − 2m + 1)ψ′(m) − 2m + 1 + Var{log f(X1)}

). (3)

Let us note that the asymptotic variance of the estimator Hk,n, when k is fixed and n → ∞, has thefollowing form:

Var{Hk,n} ∼ ψ′(k) + Var{log f(X1)}n

(4)

(see, Singh et al. [4], formula (15)). Here the function ψ′(k) =∑∞

j=k j−2 is a decreasing function of k,

so that the larger the value of k, the smaller the variance of Hk,n. That is why it is reasonable to assumethat k should be an increasing function of n. Monte Carlo simulations can be done to obtain practicalrules of thumb for the order of k. For example, in simulation studies (see Goria et al. [5]), a heuristicformula k = [

√n + 0.5] was suggested for small sample sizes in the range of 10–102. They choose

the beta, gamma, Laplace, Cauchy, and Student-t distributions for simulating the data. In Section 3,the simulations from the uniform, beta, gamma, normal, von Mises, bivariate circular, and log-normaldistributions with sample sizes n in the range 50–104 are conducted. We approximated the optimal ratesof kn (via the minimization of RMSE) for different distributions and derived kn = [c log2 n], where, forexample, in the univariate uniform model, c = 0.44. For the choices of c in other models see the linesbefore Table 1 in Section 3.

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 263

Now compare the asymptotic variances of Vm,n and Hk,n defined in (3) and (4), respectively, asn → ∞, and k = m, p = 1. The comparison shows that an increasing k yields a better behavior for Hk,n

(compared with Vm,n). Indeed, in Singh et al. [4], it was proved that Var{Vm,n} < Var{Hm,n}, whileVar{H3m,n} < Var{Vm,n} for sufficiently large n, and for each fixed m ≤ n − 1. Hence, the estimateVm,n based on the m-spacings X(i+m) −X(i) is better than the m-nearest neighbor in one-dimensional

models. Unfortunately the construction of Vm,n cannot be extended to the multidimensional case.Note also that the class of entropy estimates based on m-spacings of increasing order m = m(n) wasproposed in van Es [9], where their almost sure consistency and asymptotic normality were derived underthe condition that f is bounded. In Goria et al. [5] it is also assumed that f is bounded while we do notrequire this condition (see Theorems 2.1 and 2.2) in Section 2.

Now let Ri,n be the distance between Xi and its knth nearest neighbor among X1, . . . ,Xi−1,Xi+1, . . . ,Xn, i = 1, . . . , n, i.e., Ri,n := di,kn,n. For a positive real number r and x ∈ D, let Nx(r) denotethe ball of radius r centered at x (with respect to the Euclidean distance denoted as ‖x − y‖):

Nx(r) = {y ∈ Rp : ‖x − y‖ ≤ r}, (5)

and let V (r) be the volume of Nx(r), i.e.,

V (r) =∫

Nx(r)

dy = cprp. (6)

Here cp = πp/2/Γ(p/2 + 1) is the volume of the unit ball in Rp. Then,

limr↓0

1V (r)

Nx(r)

f(y) dy = f(x), (7)

for almost all x (with respect to the probability measure induced by the p.d.f. f ), where Nx(r) and V (r)are defined in (5) and (6), respectively.

Equation (7) suggests (for small values of r) a histogram type approximation of the p.d.f. f(x) atpoint x ∈ D:

f (1)n (x) =

|Nx(r)|nV (r)

.

Here |Nx(r)| denotes the cardinality of the set {i : Xi ∈ Nx(r)}. Namely, to construct a nonparametricestimate of H(f) = Ef

(− log f(X)

)one can consider the following estimator:

Hn = − 1n

n∑i=1

log[f (1)

n (Xi)].

For the case, where the distance between the points x and y in Rp is defined as

‖x − y‖ = max1≤j≤p

|xj − yj|,

the histogram type estimate Hn with V (r) = 2p rp was studied in Hall and Morton [10].A common approach to estimating the entropy of a multivariate distribution is to replace the p.d.f. f in

definition (1) by its nonparametric kernel or histogram density estimator (Beirlant et al. [11], Scott [2]).However, in many practical situations, the implementation of such estimates in high dimensions (whichis the case with the dimension of dihedral angles of macromolecules) becomes difficult. The asymptoticproperties of kernel-type entropy estimators of H(f) have been studied in Hall and Morton [10], Joe [12],Ivanov and Rozhkova [13], among others. It was mentioned (see, for example, Hall and Morton [10]) thatthe rate of convergence is n−1/2 when the dimension is p ≤ 3 for kernel-density entropy estimators, whilefor histogram-type constructions Hn, the rate of convergence is n−1/2 only if p = 1, 2. In particular,

√n-

consistency of the histogram entropy estimator was proved under certain tail conditions and smoothnessproperties of f when p = 1.

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

264 R. M. MNATSAKANOV et al.

It was mentioned above that the asymptotic behavior of k-nearest neighbor entropy estimators isrelated to the construction based on spacings. That is why one may expect to have

√n-consistency

for our estimator Gn in (9) for univariate non-uniform models, and (nkn)−1/2-consistency when f isa uniform density (cf. van Es [9]) as n → ∞ and k = kn → ∞. Table 2 in Section 3 demonstrates thisconjecture. For distributions with unbounded support, Tsybakov and van der Meulen [14] studied the√

n-consistency of nearest neighbor (k = 1) entropy estimators based on a truncation technique, whenp = 1 and the density function f has exponentially decreasing tails. The

√n-consistency of k-nearest

neighbor entropy estimators with unbounded support in Rp is still an open question when p ≥ 2. The

solution of this problem is connected with derivation of the rate of convergence for Cov(T1,n, T2,n) → 0(see (21)), which is not easy to show. The simulation studies below show that, when p = 2, it is stillpossible to have

√n-consistency for k-nearest neighbor entropy estimators (see Table 3 in Section 3).

Finally, let us introduce the entropy estimators based on the kn-nearest neighbor density estimatorof f(Xi) proposed by Loftsgaarden and Quesenberry [15]:

f (2)n (Xi) =

kn

nDi,n(X), i = 1, . . . , n,

where X = (X1, . . . ,Xn)′ and

Di,n(X) =∫

NXi(Ri,n)

dy =πp/2Rp

i,n

Γ(p/2 + 1)= cp Rp

i,n.

Denote the corresponding estimator of H(f) by

Gn = − 1n

n∑i=1

log(f (2)n (Xi)) =

1n

n∑i=1

log(

nDi,n(X)kn

)=

1n

n∑i=1

log(

ncpRpi,n

kn

). (8)

Combining (21), (23), and (25) (see Section 2), one can compare the asymptotic variance of Gn withthe asymptotic variance of Hk,n given in (4). We can see that asymptotically, Var{Gn} is smaller thanVar{Hk,n} as n → ∞ and k is fixed.

Note that, when kn ≡ k is a fixed integer, the asymptotic bias of the estimator Gn is equal toψ(k) − log k (see Singh et al. [4], Goria et al. [5], and Leonenko et al. [16]), where ψ(k) = Lk−1 − γ.The form of the bias term can be explained by observing that the limiting distribution of the summandsin (8), given Xi = x, represents the gamma distribution with mean − log f(x) + ψ(k) − log k.

Note also that the estimators studied in Goria et al. [5] and Hk,n defined in (2) can be rewritten in theform (8) with bn := exp(ψ(kn)) instead of kn in the denominator:

Gn =1n

n∑i=1

log(

ncp Rpi,n

bn

). (9)

Remark 1. The proofs of Theorems 2.1 and 2.2 in Section 2 are valid for constructions (8) and (9)as kn → ∞, but we will focus only on the L2-consistency of the estimator (9). Observe also thatψ(k) ∼ log k, when k is large, and hence bn ∼ kn as n → ∞. Our proof is based on the property that thelimiting conditional distributions of the summands in (9), given Xi = x, are degenerate at − log f(x) askn → ∞.

After submitting the present paper, we discovered the work of Leonenko et al. [16], where theconsistency of the kNN estimators of the Tsallis, Renyi, and Shannon entropies have been derived whenk is a fixed integer and f is bounded (in the cases of estimating the Shannon entropy H(f) and thefunctional Iq =

∫Rp f q(x) dx, q > 1). The authors also made a conjecture concerning the almost sure

consistency of Gn when f belongs to the class of uniformly continuous functions F such that 0 < c1 ≤f(x) ≤ c2 < ∞ for some c1, c2, and kn → ∞, kn/n → 0, kn/ log n → ∞ as n → ∞. In Theorems 2.1and 2.2 below we do not require f to be a bounded and uniformly continuous function when estimatingthe Shannon entropy H(f).

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 265

Remark 2. Using our approach from Section 2, the weak and L2-consistency of the kn-nearestneighbor estimators of the Renyi and Tsallis entropies given by

H∗q =

11 − q

log Iq and Hq =1

q − 1(1 − Iq), q �= 1,

respectively, can be derived as k = kn → ∞ without requiring that the function f is bounded. See, forexample, Corollary 2.1, where the asymptotic unbiasedness of the Tsallis estimator Hn,q is proved. Theproof of L2-consistency of Hn,q is long and will be presented in a forthcoming paper. Note also thatthe limits of both entropies, Hq and H∗

q , as q → 1 are equal to H(f). So, to estimate the Shannonentropy H(f), one can apply the kNN estimators of the Tsallis or Renyi entropies with properly chosenk = kn → ∞, q = qn → 1 as n → ∞. This question is currently under investigation. When k a fixedinteger, the estimation of H(f) by means of limq→1 Hn,k,q (where Hn,k,q is the Tsallis entropy estimatorfrom Leonenko et al. [16]) is reduced to (9), where kn = k.

2. ASYMPTOTIC UNBIASEDNESS AND CONSISTENCY

Standard results stated in the following lemma will be useful in the proofs of Theorems 2.1 and 2.2,which study the asymptotic properties of the estimators.

Lemma 2.1 (see Billingsley [17], p. 105). (i) Let Ym ∼ Bin(m, pm), m = 1, 2, . . . , where {pm, m =1, 2, . . . } is a sequence of real numbers satisfying

0 < pm < 1, m = 1, 2, . . . , pm → 0 and mpm → ∞ as m → ∞. (10)

Then the sequence of r.v.’s

Zm =Ym − mpm√mpm(1 − pm)

, m = 1, 2, . . . ,

has a limiting (m → ∞) normal distribution with mean 0 and variance 1.

(ii) Let Ym = (Y1,m, Y2,m) ∼ Mult(m, p1,m, p2,m, p3,m), m = 1, 2, . . . , where p1,m and p2,m, m =1, 2, . . ., satisfy the conditions in (10) and p3,m = 1 − p1,m − p2,m > 0. Define,

Zi,m =Yi,m − mpi,m√mpi,m(1 − pi,m)

, i = 1, 2, m = 1, 2, . . .

Then, the sequence Zm = (Z1,m, Z2,m)′, m = 1, 2, . . . , of r.v.’s has a limiting bivariate normaldistribution with mean vector (0, 0)′ and covariance matrix ((σi,j)), where σ1,1 = σ2,2 = 1 andσ1,2 = σ2,1 = 0.

In the following theorem, we establish that the estimator Gn defined in (9) is asymptotically unbiasedfor estimating the entropy H(f).

Theorem 2.1. Suppose that there exists an ε > 0 such that∫

Rp

| log f(x)|1+εf(x) dx < ∞ (11)

and ∫

Rp

Rp

∣∣ log(‖x − y‖)∣∣1+ε

f(x)f(y) dx dy < ∞. (12)

Then

limn→∞

Ef (Gn) = H(f) as kn → ∞, kn/√

n → 0, n → ∞.

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

266 R. M. MNATSAKANOV et al.

Proof. For bn = exp(ψ(kn)) define,

Ti,n = log(

nDi,n(X)bn

)= log

(ncpR

pi,n

bn

), i = 1, 2, . . . , n,

so that

Gn =1n

n∑i=1

Ti,n.

Clearly, T1,n, T2,n, . . . , Tn,n are identically distributed. Therefore,

Ef (Gn) = Ef (T1,n).

For a fixed x ∈ D, let Sx,n be a r.v. whose distribution is the conditional distribution of T1,n given X1 = x.Let Fx,n(·) be the corresponding d.f. Then

Ef (Gn) = Ef (T1,n) =∫

Rp

Ef (Sx,n)f(x) dx.

Note that, for a fixed x ∈ D and u ∈ (−∞,∞),

Fx,n(u) = Pf

{R1,n ≤ ρn(u) | X1 = x

},

where

ρn(u) =(

bn Γ(p/2 + 1) eu

nπp/2

)1/p

. (13)

Therefore,

Fx,n(u) = Pf{R1,n ≤ ρn(u) | X1 = x}= Pf{at least kn of X2, . . . ,Xn ∈ Nx(ρn(u))} = P{Bn−1(x, u) ≥ kn}

= P{

Zn ≥√

kn1 − (n − 1)pn(x, u)/kn√

(n − 1)pn(x, u)(1 − pn(x, u))/kn

}, (14)

where Bn−1(x, u) ∼ Bin(n − 1, pn(x, u)) and

Zn =Bn−1(x, u) − (n − 1)pn(x, u)√(n − 1)pn(x, u)(1 − pn(x, u))

with pn(x, u) =∫

Nx(ρn(u))

f(y) dy. (15)

Note that, for every fixed u ∈ (−∞,∞), limn→∞ ρn(u) = 0. On combining (7) and (13), we get

limn→∞

[n

knpn(x, u)

]= lim

n→∞

[bn

kneu 1

V (ρn(u))

Nx(ρn(u))

f(y) dy

]= eu f(x), (16)

for almost all x. Consequently, since kn → ∞, it follows from (16) that, for a fixed u ∈ (−∞,∞)

pn(x, u) → 0 and n pn(x, u) → ∞ as n → ∞,

and, when n is sufficiently large and u < − log f(x), we have

1 − (n − 1)pn(x, u)kn

> 0, (17)

for almost all x. Now on using Lemma 2.1 (i) in combination with (17) and (14), where kn → ∞, itfollows that, for almost all x, Fx,n(u) → 0 as n → ∞. Similarly, for each fixed u > − log f(x), we haveFx,n(u) → 1 as n → ∞ for almost all x. Hence, for almost all x, the limiting (n → ∞) distribution of

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 267

Sx,n is degenerate at − log f(x). Furthermore, an application of the Lebesgue’s dominated convergencetheorem (see Billigsley [18], p. 209) in the right-hand side of

Fn(u) = P (T1,n ≤ u) =∫

Fx,n(u)f(x) dx,

yields T1,nd−→ T . Here T denotes a r.v. distributed according to

FT (u) =∫

I{u > − log f(x)}f(x) dx := Qf (e−u).

The function Qf (·) =∫

I{f(x) > ·}f(x) dx on the right-hand side of the previous equation representsthe so-called Q-structural function of f . Using the properties of Qf (see Khmaladze [19]), we derive

Ef (T ) =∫

u dFT (u) = −∫

log λdQf (λ) = −∫

f(x) log f(x) dx = H(f).

To prove Theorem 2.1, it is sufficient to show Ef (T1,n) → Ef (T ) as n → ∞. Under the conditions(11) and (12) it can be shown (for details see the Appendix) that, for almost all x, there exists aconstant C1 (not depending on n) such that for all sufficiently large n

Ef

(|Sx,n|1+ε

)< C1 and Ef (|T1,n|1+ε) < C1. (18)

Consequently, it follows that (see, for example, Corollary of Theorem 25.12 in Billingsley [18], p. 338)):

limn→∞

Ef (T1,n) = Ef (T ) = H(f).

The following theorem establishes the consistency of the estimator Gn.

Theorem 2.2. Suppose that there exists an ε > 0 such that∫

Rp

| log f(x)|2+εf(x) dx < ∞ (19)

and ∫

Rp

Rp

| log(‖x − y‖)|2+εf(x)f(y) dx dy < ∞. (20)

Then GnL2−→ H(f), i.e., Gn is a consistent estimator of H(f) as n → ∞ and kn → ∞, kn/

√n → 0.

Proof. Using the notation in the proof of Theorem 2.1, we have Gn = 1n

∑ni=1 Ti,n. Since the distribu-

tion of the random vector (T1,n, T2,n, . . . , Tn,n)′ is the same for any its permutation, we have

Varf (Gn) =1n

Varf (T1,n) +n(n − 1)

n2Covf (T1,n, T2,n). (21)

Also,

Ef (T 21,n) =

Rp

Ef (S2x,n)f(x) dx,

where Sx,n has the same distribution as the conditional distribution of T1,n given X1 = x. On usingarguments similar to those used in the Appendix, it can be shown that, for almost all x, there exists aconstant C2 (not depending on n) such that

Ef (|Sx,n|2+ε) < C2 and Ef (|T1,n|2+ε) < C2 (22)

for all sufficiently large n.

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

268 R. M. MNATSAKANOV et al.

In a way similar to that in the proof of Theorem 2.1, using condition (19), we get

limn→∞

Ef (T 21,n) = Ef (T 2) =

∫u2 dQf (e−u) =

∫(log λ)2 dQf (λ) =

∫(log f(x))2f(x) dx

and

limn→∞

Varf (T1,n) = limn→∞

Ef (T 21,n) − lim

n→∞

(Ef (T1,n)

)2

=∫

Rp

(log f(x))2f(x) dx − (H(f))2 = Varf (log f(X1)). (23)

Also, we can write

Ef (T1,n T2,n) =∫

Rp

Rp

Ef (S(1)x,y,nS(2)

x,y,n)f(x)f(y) dx dy,

where Sx,y,n =(S

(1)x,y,n, S

(2)x,y,n

)′ has the same distribution as the conditional distribution of (T1,n, T2,n)′

given (X1,X2) = (x, y). For fixed x, y ∈ D and u1, u2 ∈ (−∞,∞)

Pf

{S(1)

x,y,n > u1, S(2)x,y,n > u2

}= Pf

{R1,n > ρn(u1), R2,n > ρn(u2) | X1 = x, X2 = y

}with ρn(·) defined in (13). Since ρn(u1) and ρn(u2) tend to zero as n → ∞, for x �= y we may assumethat Nx(ρn(u1)) ∩ Nx(ρn(u2)) = ∅, the empty set, for large n. Thus, for large n,

Pf

{S(1)

x,y,n > u1, S(2)x,y,n > u2

}= Pf

{at most kn − 1 of X3, . . . ,Xn ∈ Nx(ρn(u1))

and at most kn − 1 of X3, . . . ,Xn ∈ Nx(ρn(u2))}

= Pf

{M1,n−2 ≤ kn − 1, M2,n−2 ≤ kn − 1

},

where (M1,n−2,M2,n−2) ∼ Mult(n− 2, pn(x, u1), pn(y, u2), 1− pn(x, u1)− pn(y, u2)

)with pn(·, ·) de-

fined in (15). Note that, for fixed u1, u2 ∈ (−∞,∞), according to (7) and (13) pn(x, u1) and pn(y, u2)satisfy condition (16) with u = u1 and u = u2, for almost all x and y. Consequently, for fixed u1, u2 ∈(−∞,∞), it follows that for almost all x and y

pn(x, u1) → 0, pn(y, u2) → 0 and npn(x, u1) → ∞, npn(y, u2) → ∞as n → ∞. Now on using Lemma 2.1 (ii) it follows that, for almost all x and y, the limiting distribution

of(S

(1)x,y,n, S

(2)x,y,n

)is degenerate at

(− log f(x),− log f(y)

).

Under the assumptions of the theorem, using arguments similar to those used in the Appendix, it canbe shown that for almost all x and y there exists a constant C3 (not depending on n) such that

Ef

(|S(1)

x,y,n S(2)x,y,n|1+ε

)< C3 (24)

for all sufficiently large n. Then using Corollary of Theorem 25.12 in Billingsley [17], it follows that

limn→∞

Ef

(S(1)

x,y,n S(2)x,y,n

)= log f(x) log f(y).

Again, combining Fatou’s Lemma (see Billingsley [18], p. 338) and condition (19), we obtain

limn→∞

Ef

(T1,n T2,n

)= lim

n→∞

Rp

Rp

Ef

(S(1)

x,y,nS(2)x,y,n

)f(x)f(y) dx dy

=∫

Rp

Rp

limn→∞

Ef

(S(1)

x,y,nS(2)x,y,n

)f(x)f(y) dx dy

=∫

Rp

Rp

log f(x) log f(y)f(x)f(y) dx dy =(H(f)

)2

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 269

and

limn→∞

Covf

(T1,n, T2,n

)= lim

n→∞Ef

(T1,n T2,n

)− lim

n→∞

(Ef (T1,n)Ef (T2,n)

)

=(H(f)

)2 −(H(f)

)2 = 0. (25)

Finally, since Ef (Gn −H(f))2 = Varf (Gn) + (Ef (Gn)−H(f))2, the statement of Theorem 2.2 followsfrom Theorem 2.1, (21), (23), and (25).

Let b∗n = {Γ(kn + 1 − q)/Γ(kn)}1/(1−q). In a similar way as in Leonenko et al. [16], define theestimator of the Tsallis entropy as follows:

Hn,q =1

q − 1{1 − In,q} with In,q =

1n

n∑i=1

(ncpR

pi,n

b∗n

)1−q

, q �= 1.

Corollary 2.1. (i) If q < 1 and Iq < ∞, then

limn→∞

Ef (In,q) = Iq and limn→∞

Ef (Hn,q) = Hq as kn → ∞, kn/n → 0, n → ∞. (26)

(ii) If Is < ∞ for 1 < s ≤ 2, then (26) holds for 1 < q < 2.

Proof. To show that In,q is asymptotically unbiased, let

Ti,n = ncpRpi,n/b∗n and ρ∗n(u) = {(ub∗n)/(ncp)}1/p.

Using an argument similar to the one in the proof of Theorem 2.1, we see that the conditional limitingdistribution of T1,n, given X1 = x, is degenerate at 1/f(x) and the limiting distribution of T1,n is definedas follows: FT (u) = Qf (u−1). To justify this let us notice that b∗n/kn → 1, and instead of (16) we have:

limn→∞

[n

knpn(x, u)

]= lim

n→∞

[b∗n u

kn

1V (ρ∗n(u))

Nx(ρ∗n(u))

f(y) dy

]= uf(x).

Case (i): Since the function u → u1−q is bounded on any bounded interval from (0,∞), an applicationof the generalized Herry–Bray Lemma, see Loeve [20], p. 187, gives

limn→∞

Ef (T 1−q1,n ) = Ef (T 1−q) =

∫u1−q dFT (u) = −

∫λq−1 dQf (λ) = −

∫f q−1(x)f(x) dx = Iq.

Case (ii): Since for any 0 < u < 1, we have ρ∗n(u) ≤ ρ∗n(1) → 0 as n → ∞, we conclude:

limn→∞

[n

uknpn(x, u)

]= f(x)

for almost all x, uniformly in 0 < u < 1. Hence, for any positive δ, we can take sufficiently large n suchthat pn(x, u) ≤ n u (f(x) + δ)/kn for any 0 < u < 1, and obtain

Fx,n(u) = P{Bkn,n−kn ≤ pn(x, u)} ≤ P{

Bkn,n−kn ≤ kn

n(f(x) + δ)u

}

= P(Qn ≥ kn

)≤ E(Qn)

kn=

n − 1n

(f(x) + δ)u ≤ (f(x) + δ)u,

for all sufficiently large n. Here Qn ∼ Bin(n − 1, kn

(f(x) + δ

)u/n

)(cf. (40) in the Appendix). Now let

Sx,n be a r.v. whose distribution is the same as the conditional distribution of T1,n given X1 = x. Usingthe previous inequality and integration by parts, as in Leonenko et al. [16], we derive

Ef (Sβx,n) =

∞∫

0

uβ dFn,x(u) ≤ 1 − β(f(x) + δ)(β + 1)

= Kε(x),

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

270 R. M. MNATSAKANOV et al.

for almost all x and for all sufficiently large n. Here β = (1 − q)(1 + ε) < 0 is such that β + 1 > 0 if 0 ≤ε < (2 − q)/(q − 1), and

∫Rp Kε(x)f(x) dx < ∞. Hence the conditions similar to (18) are satisfied. An

application of Corollary from Theorem 25.12 in Billingsley [18], p. 338, yields limn→∞ Ef (T 1−q1,n ) = Iq.

Remark 3. Denote the unit torus by Sp1 , where S1 = {x ∈ R

2 : ‖x‖ = 1} is a unit circle. Each circularobservation X = (X1, . . . ,Xp) on Sp

1 can be specified by an angular observation θ = (θ1, . . . , θp) ∈[0, 2π)p and vice versa, since S1 = {(cos θj, sin θj) : θj ∈ [0, 2π)} for j = 1, . . . , p. For measuringdistances between angular random variables, say θ = (θ1, . . . , θp) ∈ [0, 2π)p and μ = (μ1, . . . , μp) ∈[0, 2π)p two distance functions of interest are

d1(θ, μ) =

√√√√ p∑j=1

(π −

∣∣π − |θj − μj|∣∣)2

and

d2(θ, μ) =

√√√√2p∑

j=1

(1 − cos(θj − μj)

)

(see also the formulas (2.3.13) and (2.3.5) in Mardia and Jupp [21], when p = 1). For estimating theentropy of circular distributions, Misra et al. [22] used distance functions d1(·, ·) and d2(·, ·) to constructestimators based on distances between the n sample points and their kth nearest neighbors, where k(≤ n − 1) is a fixed positive integer (not depending on n). The results of Section 2 can be extended, in astraightforward manner, to circular distances d1(·, ·) and d2(·, ·) as well.

3. SIMULATION STUDY

In this section we study the RMSE of Gn via simulations from different distributions, i.e., theuniform, beta, gamma, normal, von Mises, bivariate circular, and the univariate and bivariate log-normaldistributions.

First, recall that the density function f(θ;μ, κ) (with respect to the Lebesgue measure on [0, 2π)) ofvon Mises circular distribution is

f(θ;μ, κ) =1

2πI0(κ)eκ cos(θ−μ), 0 ≤ θ < 2π,

where μ and κ are the mean direction and concentration parameters (see, for example, Singh at al. [23]).The value of the entropy of f(θ;μ, κ) is given by H(f) = log(2πI0(κ))− κI1(κ)

I0(κ) , where the Ip(·) represent

the Bessel functions of order p = 0, 1. In a similar way let us define on [0, 2π)2 a bivariate circulardistribution as introduced in Singh at al. [23]:

f(θ1, θ2;μ1, μ2, κ1, κ2, λ) =1

4π2Ceκ1 cos(θ1−μ1)+κ2 cos(θ2−μ2)+2λ sin(θ1−μ1) sin(θ2−μ2).

Here

C =∞∑

p=0

(2pp

)(λ2

κ1κ2

)p

Ip(κ1)Ip(κ2).

Its theoretical entropy is H(f) = 2 log 2π + log C − D/C, where

D =∞∑

p=0

(2pp

)(λ2

κ1κ2

)p(κ1Ip+1(κ1)Ip(κ2) + κ2Ip(κ1)Ip+1(κ2) + 2pIp(κ1)Ip(κ2)

)

and the Ip(·) represent the Bessel functions of order p = 0, 1, . . .

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 271

The Monte Carlo simulations justify, for distributions like the circular and log-normal, that theoptimal k (in the sense of RMSE) is a slowly increasing function of n for the dimension p = 1, 2. SeeFigs. 1 and 2 and Table 1, where the optimal k’s, corresponding RMSE’s, and bias terms are recorded.The sample sizes for the simulations were: n = 50, n = 500, and n = 1000, and the number of repetitionswas N = 10, 000. Actually, when the sample size is n = 50, there are two minima of RMSE for thebivariate circular distribution model with values of 0.223 and 0.212 at k = 3 and at k = 14, respectively(see the first curve in Fig. 1). In Table 1 we recorded only the first minimum value.

Table 1. Distribution models used for simulations

Optimal k (RMSE, BIAS)

Model n = 50 n = 500 n = 1000

von Mises with 7 (0.149, −0.0575) 11 (0.0426, −0.00764) 18 (0.0282, −0.00542)

κ = 2, μ = 3π2

Bivariate circular with 3 (0.223, 0.0542) 10 (0.0580, 0.0164) 15 (0.0404, 0.0111)

κ = (2, 2), μ = (3π2 , 3π

2 ), λ = 0.5

Log-Normal with 7 (0.194, −0.0524) 13 (0.0585, −0.0130) 13 (0.0408, −0.00702)

μ = 10, σ2 = 1

Bivariate Log-Normal with 6 (0.296, 0.0274) 12 (0.0899, 0.00453) 15 (0.0629, 0.00167)

μ1 = μ2 = 10, σ1 = σ2 = 1, ρ = 0.6

During the simulation studies we found that, for example in one-dimensional cases, the rate ofconvergence of kn-nearest neighbor entropy estimators is n−1/2 for non-uniform distributions, and(nkn)−1/2 for the uniform one (cf. van Es [9]). In addition, in Table 2 we calculated the values of√

n RMSE for all models except the Uniform [0, 1], and the values of√

nkn RMSE for the Uniform [0, 1].We can see from Table 2 that these products are almost constants for each model. Also, we approximatedthe optimal kn as follows: kn,opt = [c log2 n], where c = 0.44 for the Uniform [0, 1], c = 0.67 for thevon Mises, c = 0.03 for the Normal (0, 1) and Log-Normal (10, 1), and c = 0.31 for the Gamma (6, 1)distributions, respectively.

Table 2.√

n kn · RMSE for the uniform distribution and√

n · RMSE for the other univariate distributions

n 50 100 200 500 700 1000 2000 5000 7000

Uniform 1.78 1.78 1.73 1.73 1.61 1.71 1.68 1.67 1.63

Normal 0.91 0.89 0.88 0.84 0.82 0.83 0.80 0.78 0.77

Log-Normal 1.38 1.35 1.34 1.31 1.30 1.29 1.29 1.29 1.26

Gamma (6, 1) 0.92 0.90 0.89 0.85 0.84 0.83 0.83 0.81 0.79

Beta (3, 2) 0.65 0.66 0.62 0.50 0.58 0.48 0.56 0.54 0.53

von Mises 1.06 1.03 1.00 0.95 0.92 0.89 0.88 0.85 0.84

Finally, in Table 3 the values of√

n RMSE for two-dimensional uniform, circular, normal and log-normal distributions are presented. We conclude that, for these distributions, one can expect to have√

n-consistency as well.

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

272 R. M. MNATSAKANOV et al.

Table 3.√

n · RMSE for bivariate distributions

n 50 500 1000 10000

Uniform 1.46 1.36 1.34 1.40

Circular 1.50 1.30 1.28 1.49

Log Normal 2.09 2.01 1.99 2.07

Normal 1.12 1.35 1.32 1.27

5 10 15 20

0.2

10

.23

0.2

50

.27

5 10 15 20

0.0

60

0.0

70

0.0

80

k

RM

SE

0 5 10 15 20 25 30

0.0

40

0.0

45

0.0

50

0.0

55

0.0

60

Fig. 1. Bivariate circular distribution, n = 50, 500 and 1000.

5 10 15 20

0.3

00

.31

0.3

20

.33

0.3

40

.35

0 10 20 30 40 50

0.0

90

0.0

95

0.1

00

0.1

05

0.1

10

k

RM

SE

0 10 20 30 40 50

0.0

65

0.0

70

0.0

75

Fig. 2. Bivariate Log-Normal distribution, n = 50, 500 and 1000.

APPENDIX

Here we provide the proof of the first inequality in (18). The proofs of the first inequalities in (22) and(24), being cumbersome and virtually identical to the proof of (18), are omitted.

Let R1 = ‖X1 − X2‖,

V (R1) =∫

Nx(R1)

dy =πp/2 Rp

1

Γ(p/2 + 1),

and

T1 = log(2V (R1)

)= log

(2πp/2Rp

1

Γ(p/2 + 1)

).

Further, for a fixed x ∈ D, let Sx be a r.v. having the same distribution as the conditional distributionof T1 given X1 = x. To prove the first inequality in (18), we will first establish that

Ef

(|Sx|1+ε

)< ∞ (27)

for almost all x. We haveEf (|Sx|1+ε) = Ef

(| log(2V (R1))|1+ε | X1 = x

)

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 273

≤ 2ε

[ ∣∣∣∣ log(

2πp/2

Γ(p/2 + 1)

)∣∣∣∣1+ε

+ p1+ε Ef

(| log R1|1+ε | X1 = x

)]

= 2ε

[ ∣∣∣∣ log(

2πp/2

Γ(p/2 + 1)

)∣∣∣∣1+ε

+ pε

Rp

∣∣ log(‖x − y‖)∣∣1+ε

f(y) dy

].

In view of (12), it follows that ∫

Rp

| log(‖x − y‖)|1+εf(y) dy < ∞

for almost all x. Therefore (27) is established.Now we will establish the first inequality in (18). We have

Ef

(|Sx,n|1+ε

)=

0∫

−∞

|u|1+ε dFx,n(u) +

∞∫

0

|u|1+ε dFx,n(u). (28)

We can write∞∫

0

|u|1+ε dFx,n(u) = (1 + ε)

∞∫

0

uε(1 − Fx,n(u)) du

= (1 + ε)[ log

√n∫

0

uε(1 − Fx,n(u)) du +

∞∫

log√

n

uε(1 − Fx,n(u)) du

]

= (1 + ε)[I1(x, n) + I2(x, n)

], say. (29)

We have

I2(x, n) =

∞∫

log√

n

uε(1 − Fx,n(u)) du =

∞∫

log√

n

uεP{Bn−1(x, u) ≤ kn − 1} du

=

∞∫

log√

n

[ kn−1∑j=0

(n − 1

j

)(pn(x, u))j(1 − pn(x, u))n−1−j

]du.

For j ∈ {0, 1, . . . , kn − 1}, we have (n − 1

j

)≤ n − 1

n − kn

(n − 2

j

).

Therefore,

I2(x, n) ≤ n − 1n − kn

∞∫

log√

n

uε(1 − pn(x, u))[ kn−1∑

j=0

(n − 2

j

)(pn(x, u))j(1 − pn(x, u))n−2−j

]du

=n − 1n − kn

∞∫

log√

n

uε(1 − pn(x, u))P{Bkn,n−kn−1 ≥ pn(x, u)} du,

where Ba,b denotes the beta r.v. with parameters (a, b), a > 0, b > 0. For u > log√

n, we havepn(x, u) > pn(x, log

√n). Therefore,

I2(x, n) ≤ n − 1n − kn

P{Bkn,n−kn−1 ≥ pn(x, log√

n)}∞∫

log√

n

uε(1 − pn(x, u)) du.

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

274 R. M. MNATSAKANOV et al.

Note that

ρn(log√

n) =(

bn Γ(p/2 + 1)√n πp/2

)1/p

→ 0 asbn√n→ 0, n → ∞.

Therefore,

limn→∞

[√n

knpn(x, log

√n)

]= lim

n→∞

[bn

kn

1V (ρn(log

√n))

Nx(ρn(log√

n))

f(y) dy

]= f(x),

for almost all x. Let us now, for x ∈ D, choose a δ ∈ (0, f(x)). Then, for all sufficiently large n,

pn(x, log√

n ) >kn√n

(f(x) − δ)

and therefore

P{Bkn,n−kn−1 ≥ pn(x, log

√n)

}≤ P

{Bkn,n−kn−1 ≥ kn√

n(f(x) − δ)

}

≤E(B2

kn,n−kn−1)(kn√

n(f(x) − δ)

)2 =kn + 1

(n − 1) kn (f(x) − δ)2.

Thus, for almost all x,

I2(x, n) ≤ kn + 1(n − kn) kn (f(x) − δ)2

∞∫

log√

n

uε(1 − pn(x, u)) du, (30)

for all sufficiently large n. On making the change of variable z = log(2bn/n

)+ u in the integral in (30),

we get∞∫

log√

n

uε(1 − pn(x, u)) du =

∞∫

log(2bn/√

n)

(u + log

(n

2bn

))ε (1 − pn

(x, u + log

(n

2bn

)))du.

Note that

ρn

(u + log

(n

2bn

))=

(Γ(p/2 + 1)eu

2πp/2

)1/p

and thus

pn

(x, u + log

(n

2bn

))=

Nx(ρn(u+log(n/(2bn))))

f(y) dy = q(x, u), say,

does not depend on n. Therefore,∞∫

log√

n

uε(1 − pn(x, u)) du =

∞∫

log(2bn/√

n)

(u + log

(n

2bn

))ε(1 − q(x, u)

)du

=

0∫

log(2bn/√

n)

(u + log

(n

2bn

))ε(1 − q(x, u)

)du

+

∞∫

0

(u + log

(n

2bn

))ε(1 − q(x, u)

)du. (31)

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 275

For the first term in the last equation we have from (31)0∫

log(2bn/√

n)

(u + log

(n

2bn

))ε(1 − q(x, u)

)du ≤

(log

(n

2bn

))ε0∫

log(2bn/√

n)

(1 − q(x, u)

)du

=(

log(

n

2bn

))ε1∫

2bn/√

n

(1 − q(x, log u)

)du

u≤

√n

2bn

(log

(n

2bn

))ε1∫

2bn/√

n

(1 − q(x, log u)

)du

≤√

n

2bn

(log

(n

2bn

))ε

. (32)

To estimate the second term note that∞∫

0

(u + log

(n

2bn

))ε(1 − q(x, u)

)du

≤ Fε

[(log

(n

2bn

))ε∞∫

0

(1 − q(x, u)

)du +

∞∫

0

uε(1 − q(x, u)

)du

], (33)

where Fε = max(1, 2ε−1). Note now that, for u ∈ (−∞,∞) and for R1 and Sx defined in the beginningof this Appendix,

Fx(u) = Pf (Sx ≤ u) =∫

Nx(ρn(u+log(n/2bn)))

f(y) dy = q(x, u).

Therefore (33) yields∞∫

0

(u + log

(n

2bn

))ε(1 − Fx(u)

)du

≤ Fε

[(log

(n

2bn

))ε∞∫

0

(1 − Fx(u)

)du +

∞∫

0

uε(1 − Fx(u)

)du

]. (34)

In view of (27),∞∫

0

(1 − Fx(u)

)du < ∞ and

∞∫

0

uε(1 − Fx(u)

)du < ∞. (35)

On using (30)–(35), we conclude thatlim

n→∞I2(x, n) = 0 (36)

for almost all x. Now let us consider

I1(x, n) =

log√

n∫

0

uε(1 − Fx,n(u)

)du

=

log√

n∫

0

[ kn−1∑j=0

(n − 1

j

)(pn(x, u))j(1 − pn(x, u))n−1−j

]du

=

log√

n∫

0

uεP(Bkn,n−kn ≥ pn(x, u)

)du. (37)

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

276 R. M. MNATSAKANOV et al.

For u < log√

n,

0 ≤ ρn(u) ≤ ρn(log√

n) =(

bnΓ(p/2 + 1)√nπp/2

)1/p

→ 0 asbn√n→ 0, n → ∞.

Therefore, for almost all x,

limn→∞

(n

kneupn(x, u)

)= lim

n→∞

[bn

kn

1V (ρn(u))

Nx(ρn(u))

f(y) dy

]= f(x),

uniformly in u.For x ∈ D, let δ ∈ (0, f(x)). Then, for all sufficiently large n, u < log

√n, and for almost all x, we

have pn(x, u) > knn (f(x) − δ)eu and therefore

P{Bkn,n−kn ≥ pn(x, u)} ≤ P

{Bkn,n−kn ≥ kn

n(f(x) − δ)eu

}≤ E(Bkn,n−kn)

knn (f(x) − δ)eu

=e−u

f(x) − δ.

On using the above inequality in (37), for almost all x and for all sufficiently large n, we get

I1(x, n) ≤ 1f(x) − δ

log√

n∫

0

uεe−u du ≤ 1f(x) − δ

∞∫

0

uεe−u du < ∞. (38)

From (29), (36), and (38) we conclude that there exists a constant D1 such that for almost all x and forall sufficiently large n

∞∫

0

|u|1+ε dFx,n(u) < D1. (39)

Now consider0∫

−∞

|u|1+ε dFx,n(u) =

0∫

−∞

(−u)1+ε dFx,n(u) = (1 + ε)

0∫

−∞

(−u)εFx,n(u) du.

For u < 0,

0 ≤ ρn(u) ≤(

bn Γ(p/2 + 1)n πp/2

)1/p

→ 0 as n → ∞.

Thus, for u < 0, the convergence ρn(u) → 0 as n → ∞ is uniform. Therefore, for almost all x,

limn→∞

(n

kneupn(x, u)

)= f(x)

uniformly in u < 0, i.e., for almost all x, u ∈ (−∞, 0), and every δ > 0 we have pn(x, u) <knn (f(x) + δ)eu for all sufficiently large n. Let Qn ∼ Bin

(n − 1, kn

(f(x) + δ

)eu/n

). Then

Fx,n(u) = P{Bkn,n−kn ≤ pn(x, u)} ≤ P

{Bkn,n−kn ≤ kn

n(f(x) + δ)eu

}

= P(Qn ≥ kn

)≤ E(Qn)

kn=

n − 1n

(f(x) + δ) eu ≤ (f(x) + δ)eu. (40)

Thus, for almost all x,0∫

−∞

|u|1+ε dFx,n(u) = (1 + ε)

0∫

−∞

(−u)εFx,n(u) du ≤ (1 + ε)(f(x) + δ)

0∫

−∞

(−u)εeu du < ∞ (41)

for all sufficiently large n. Now on using (39) and (41) in (28), we conclude the first inequality in (18).

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008

kn-NEAREST NEIGHBOR ESTIMATORS 277

ACKNOWLEDGMENTS

The authors thank the referee for his helpful comments and suggestions. The authors are alsothankful to Vladimir Hnizdo, Dan Sharp, and Cecil Burchfiel for helpful discussions. The work wassupported by the National Institute for Occupational Safety and Health, contract no. 212-2005-M-12857. The findings and conclusions in this paper are those of the authors and do not necessarilyrepresent the views of the National Institute for Occupational Safety and Health.

REFERENCES1. E. Demchuk and H. Singh, “Statistical Thermodynamics of Hindered Rotation from Computer Simulations”,

Molecular Physics 99, 627–636 (2001).2. D. Scott, Multivariate Density Estimation: Theory, Practice and Visualization (Wiley, New York, 1992).3. L. F. Kozachenko and N. N. Leonenko, “Sample Estimates of Entropy of a Random Vector”, Probl. Inform.

Transmission 23, 95–101 (1987).4. H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and F. Demchuk, “Nearest Neighbor Estimates of Entropy”,

Amer. J. Math. and Management Sci. 23, 301–321 (2003).5. M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. Novi Inverardi, “A New Class of Random Vector

Entropy Estimators and Its Applications in Testing Statistical Hypotheses”, J. Nonparam. Statist. 17, 277–297 (2005).

6. O. Vasicek, “A Test of Normality Based on Sample Entropy”, J. Roy. Statist. Soc. Ser. B 38, 254–256 (1976).7. E. J. Dudewicz and E. C. van der Meulen, (1981). “Entropy-Based Tests of Uniformity”, J. Amer. Statist.

Assos. 76, 967–974 (1981).8. M. Abramowitz and A. Stegun, Handbook of Mathematical Functions (Dover, New York, 1965).9. B. van Es, “Estimating Functionals Related to a Density by Class of Statistics Based on Spacing”, Scand.

J. Statist. 19, 61–72 (1992).10. P. Hall and S. C. Morton, “On the Estimation of Entropy”, Ann. Inst. Statist. Math. 45, 69–88 (1993).11. J. Beirlant, E. J. Dudewicz, L. Gyorfi, and E. C. van der Meulen, “Nonparametric Entropy Estimation: an

Overview”, Internat. J. Math. Statist. Sci. 6, 17–39 (1997).12. H. Joe, “Estimation of Entropy and Other Functionals of a Multivariate Density”, Ann. Inst. Statist. Math.

41, 683–697 (1989).13. A. V. Ivanov and M. N. Rozhkova, “Properties of the Statistical Estimate of the Entropy of a Random Vector

with a Probability Density”, Probl. Inform. Transmission 17, 171–178 (1981).14. A. B. Tsybakov and E. C. van der Meulen, “Root-n Consistent Estimators of Entropy for Densities with

Unbounded Support”, Scand. J. Statist. 23, 75–83 (1996).15. D. O. Loftsgaarden and C. P. Quesenberry, “A Non-Parametric Estimate of a Multivariate Density Function”,

Ann. Math. Statist. 36, 1049–1051 (1965).16. N. Leonenko, L. Pronzato, and V. Savani, “A Class of Renyi Information Estimators for Multidimensional

Densities”, Ann. Statist. (2007) (in press).17. P. Billingsley, Convergence of Probability Measures (Wiley, New York, 1968).18. P. Billingsley, Probability and Measure (Wiley, New York, 1995).19. E. V. Khmaladze, The Statistical Analysis of a Large Number of Rare Events, Technical Report MS-

R8804 (Centre for Mathematics and Computer Science, Amsterdam, 1988).20. M. Loeve, Probability Theory (Springer, New York, 1977), Vol. I.21. K. V. Mardia and P. E. Jupp, Directional Statistics (Wiley, New York, 2000).22. N. Misra, H. Singh, and V. Hnizdo, Nearest Neighbor Estimates of Entropy for Multivariate Circular

Distributions, Preprint (2006).23. H. Singh, V. Hnizdo, and E. Demchuk, “Probabilistic Model for Two Dependent Circular Variables”,

Biometrika 89, 719–723 (2002).

MATHEMATICAL METHODS OF STATISTICS Vol. 17 No. 3 2008