Local likelihood method: a bridge over parametric and nonparametric regression

20
Nonparametric Statistics Vol. 15(6), December 2003, pp. 665–683 LOCAL LIKELIHOOD METHOD: A BRIDGE OVER PARAMETRICAND NONPARAMETRIC REGRESSION SHINTO EGUCHI a , TAE YOON KIM b and BYEONG U. PARK c,a Institute of Statistical Mathematics, Tokyo, Japan; b Department of Statistics, Keimyung University, Taegu, South Korea; c Department of Statistics, Seoul National University, Seoul, South Korea (Received June 2002; In final form July 2003) This paper discusses local likelihood method for estimating a regression function in a setting which includes generalized linear models. The local likelihood function is constructed by first considering a parametric model for the regression function. It is defined as a locally weighted log-likelihood with weights determined by a kernel function and a bandwidth. When a large bandwidth is chosen, the resulting estimator would be close to the fully parametric maximum likelihood estimator, so that a large bandwidth would be a relevant choice in the case where the true regression function is near the parametric family. On the other hand, when a small bandwidth is chosen, the performance of the resulting estimator would not depend much on the assumed parametric model, thus a small bandwidth would be desirable if the parametric model is largely misspecified. In this paper, we detail the way in which the risk of the local likelihood estimator is affected by bandwidth selection and model misspecification. We derive explicit formulas for the bias and variance of the local likelihood estimator for both large and small bandwidths. We look into higher order asymptotic expansions for the risk of the local likelihood estimator in the case where the bandwidth is large, which enables us to determine the optimal size of the bandwidth depending on the degree of model misspecification. Keywords: Local likelihood; Near-parametric model; Near-nonparametric model; Deviance function; Generalized linear model AMS 2000 subject classifications: Primary 62G08; Secondary 62G20 1 INTRODUCTION This paper is concerned with statistical inference on regression models. The parametric approach is still one of main streams in statistics, where the regression function is usually modeled by a finite number of parameters. In the case of parametric generalized linear models, asymptotic theory for statistical inference has been established in a standard framework. Indeed, the maximum likelihood estimator enjoys asymptotic normality under some regularity conditions, while the maximum log-likelihood ratio, or the deviance function has an asymptotic chi-square approximation. See Lindsey (1996) for a broad range of perspectives. Alternatively, the nonparametric approach has been also extensively studied, where the regression function is assumed to have no specific parametric form except that it satisfies some smoothness conditions. The kernel method is a general class of techniques for nonparametric estimation Corresponding author. E-mail: [email protected] ISSN 1048-5252 print; ISSN 1029-0311 online c 2003 Taylor & Francis Ltd DOI: 10.1080/10485250310001624756

Transcript of Local likelihood method: a bridge over parametric and nonparametric regression

Nonparametric StatisticsVol. 15(6), December 2003, pp. 665–683

LOCAL LIKELIHOOD METHOD: A BRIDGE OVERPARAMETRIC AND NONPARAMETRIC REGRESSION

SHINTO EGUCHIa, TAE YOON KIMb and BYEONG U. PARKc,∗

aInstitute of Statistical Mathematics, Tokyo, Japan; bDepartment of Statistics, Keimyung University,Taegu, South Korea; cDepartment of Statistics, Seoul National University, Seoul, South Korea

(Received June 2002; In final form July 2003)

This paper discusses local likelihood method for estimating a regression function in a setting which includes generalizedlinear models. The local likelihood function is constructed by first considering a parametric model for the regressionfunction. It is defined as a locally weighted log-likelihood with weights determined by a kernel function and abandwidth. When a large bandwidth is chosen, the resulting estimator would be close to the fully parametric maximumlikelihood estimator, so that a large bandwidth would be a relevant choice in the case where the true regression functionis near the parametric family. On the other hand, when a small bandwidth is chosen, the performance of the resultingestimator would not depend much on the assumed parametric model, thus a small bandwidth would be desirable ifthe parametric model is largely misspecified. In this paper, we detail the way in which the risk of the local likelihoodestimator is affected by bandwidth selection and model misspecification. We derive explicit formulas for the bias andvariance of the local likelihood estimator for both large and small bandwidths. We look into higher order asymptoticexpansions for the risk of the local likelihood estimator in the case where the bandwidth is large, which enables us todetermine the optimal size of the bandwidth depending on the degree of model misspecification.

Keywords: Local likelihood; Near-parametric model; Near-nonparametric model; Deviance function; Generalizedlinear model

AMS 2000 subject classifications: Primary 62G08; Secondary 62G20

1 INTRODUCTION

This paper is concerned with statistical inference on regression models. The parametricapproach is still one of main streams in statistics, where the regression function is usuallymodeled by a finite number of parameters. In the case of parametric generalized linear models,asymptotic theory for statistical inference has been established in a standard framework.Indeed, the maximum likelihood estimator enjoys asymptotic normality under some regularityconditions, while the maximum log-likelihood ratio, or the deviance function has an asymptoticchi-square approximation. See Lindsey (1996) for a broad range of perspectives. Alternatively,the nonparametric approach has been also extensively studied, where the regression functionis assumed to have no specific parametric form except that it satisfies some smoothnessconditions. The kernel method is a general class of techniques for nonparametric estimation

∗ Corresponding author. E-mail: [email protected]

ISSN 1048-5252 print; ISSN 1029-0311 online c© 2003 Taylor & Francis LtdDOI: 10.1080/10485250310001624756

666 S. EGUCHI et al.

of functions. See Wand and Jones (1995) or Fan and Gijbels (1996) for a comprehensiveunderstanding of kernel method in regression function estimation.

The local likelihood method has been introduced to get a flexible estimator by locally fittinga parametric (Tibshirani and Hastie, 1987). The procedure for estimating a regression functionat a point x in the space of covariate vector is based on a localized version of the log-likelihoodwhich gives more weights to the observations near x but less to those far away from it. Theweights are determined by a kernel function centered at x with a bandwidth h. The bandwidthh controls the shape of the resulting estimator changing from the interpolation of observedvalues to the parametric maximum likelihood estimator as h ranges from zero to infinity. Fornonparametric consistency one is advised to choose the bandwidth h tending to zero as thesample size tends to infinity. However, there are two contradictory aspects for selection ofthe bandwidth. When h is small, the resulting estimator is insensitive to the choice of theparametric model, but it pays loss of efficiency in return if the parametric model is correctlyspecified. When h is large, the estimator shares asymptotic efficiency, to a certain extent, withthe parametric maximum likelihood estimator if the parametric model is correctly specified,but it suffers from model misspecification otherwise. Let us consider a simple example whichrelates bandwidth selection with model misspecification.

Suppose that the true regression function is given by µ(x) = sin x . Consider the two para-metric models:

m1(x, β) = β0 + β1x, β ∈ R2; m2(x, β) = β0 + β1x + β2x2 + β3x3, β ∈ R4.

The degree of model misspecification may be quantified by the minimal integrated squareddistance. We found that

minβ

∫ 3

−3[µ(x)− m1(x, β)]

2 dx = 0.989;

minβ

∫ 3

−3[µ(x)− m2(x, β)]2 dx = 0.018.

Thus, with the cubic model m2 one may get much better approximation to µ than the linearm1. The two panels (a) and (b) in Figure 1 depict the best global fits from the two models.We generated 100 pseudo data (x1, y1), . . . , (x100, y100) from yi = µ(xi)+ εi , where {xi} and

FIGURE 1 Best global fits (solid curves) with the true function (dotted curve) for the parametric (a) linear m1 and(b) cubic model m2.

LOCAL LIKELIHOOD METHOD 667

FIGURE 2 The average squared error curves against the bandwidth h for the local (a) linear and (b) cubic estimates.

{εi} are independent with a uniform distribution on (−3, 3) and N(0, 0.22), respectively. Weapplied to this data set the local likelihood method described in Section 2. Figure 2 shows howthe average squared error

ASE(h) = 1

100

100∑i=1

[mh(xi)− µ(xi)]2

for each model changes as h varies, where mh is the local likelihood estimator ofµ. TheASE(h)is a natural approximation to

∫[mh(x)− µ(x)]2 dF(x), where F is the covariate distribution.

The bandwidths which minimize ASE(h) are found to be 0.19 and 1.3 for the models m1 andm2, respectively. This is a typical situation which one may encounter with other data sets too:the case where the true regression function is remote from the parametric family leads to achoice of small h to overcome the model misspecification, while a large h is preferred in thecase where the true function is near the parametric family. The two panels (a) and (b) in Figure 3depict the local likelihood estimates where the optimal bandwidths were used.

The central question to be examined in this paper is how the degree of model misspecifica-tion, when it changes, alters the way in which bandwidth selection determines the performance

FIGURE 3 Local (a) linear and (b) cubic estimates (solid curves) with the true function (dotted curve).

668 S. EGUCHI et al.

of the local likelihood estimator. For this we introduce a basic notion of “α-neighborhood”as defined in Eq. (5), which tells how far the true regression function µ is from the assumedparametric model {m(·, β): β ∈ Rq}. The index α specifies the degree of departure. It rangesfrom −1 to ∞ with α = −1 implying the fully nonparametric model and α = ∞ indicatingthe pure parametric case.

The main results are presented in the contexts of the bandwidth, h, tending to zero, andinfinity, as the sample size tends to infinity. If α is large, we stand on a “near-parametric”model and the local likelihood procedure is supported by tuning the bandwidth h to grow toinfinity as the sample size increases. On the other hand, if α is small, then we move on a“near-nonparametric” model and it is argued that the bandwidth h should decrease to zero asthe sample size tends to infinity. We present the large and the small h asymptotic risk analysisfor any α ≥ −1 in a setting where the conditional distribution of the response given covariatebelongs to a one-parameter exponential family. We find the value of α where a transition froma large to a small h is desirable. It turns out that it depends on dimension of the covariatevector and smoothness of the true µ and the working parametric regression function m(·, β).We derive higher order asymptotic expansions for the risk of the local likelihood estimator todetermine the optimal size of the bandwidth at a given degree of model misspecification.

The small h asymptotic analysis presented in this paper includes the work done by Fan et al.(1995) as a special case, where they considered only the case where α = −1 and m(x, β) =g−1(β0 + βT

1 x) for some link function g. In the closely related density estimation setting, thelocal likelihood method has been extensively studied. See, among others, Copas (1995) andEguchi and Copas (1998) for the large h asymptotic properties, and Loader (1996), Hjort andJones (1996) and Park et al. (2002) for the small h asymptotics. In fact, Eguchi and Copas(1998) and Park et al. (2002) gave unified formulations for a general class of local likelihooddensity estimators. While Park et al. (2002) provide a nice recursive formula which may beused to calculate the bias for each q(the number of the parameters in the working parametricmodel) one by one, they fail to give an explicit bias expression for a general q . This paperpresents, in the regression setting, an explicit bias formula not only for a general q but also foran arbitrary dimension of the covariate vector.

This paper is organized as follows. In Section 2, we introduce the local likelihood estimatorsand the α-neighborhood. A stochastic expansion of the local likelihood estimator which isvalid for any sequence of h is contained in Section 3. Section 4 concentrates on the large hasymptotics. In Section 5, we discuss the small h asymptotic properties. In Section 6, we givea theoretical answer to the question of how to tune the bandwidth depending on the degree ofmodel misspecification. Concluding remarks are also given there.

2 LOCAL LIKELIHOOD ESTIMATION AND α-NEIGHBORHOOD

We explore the local likelihood method for estimating a regression function. Let Y be a responsevariable and X be a covariate vector of dimension d , thus the regression function is given byµ(x) = E(Y |X = x). We assume that the conditional density function of Y given X = xbelongs to a one-parameter exponential family, i.e.

p(y|x) = exp

{yθ(x)− κ[θ(x)]

a(φ)− b(y, φ)

}, (1)

for a known function κ where φ is the dispersion parameter. The canonical parameter θ(·)is connected to the regression function µ(·) by the relation µ(x) = κ(1)(θ(x)) where κ( j)(θ)

is the j th derivative of κ with respect to θ . If one models a transformation of the regressionfunction µ(x) as linear, i.e.g[µ(x)] = β0 + βT

1 x , then the resulting model incorporating (1)

LOCAL LIKELIHOOD METHOD 669

is called a generalized linear model and g is the link function. With various choices of the linkfunction, this yields many important regression models as special cases. See McCullagh andNelder (1989) for an extensive account of the analysis of this model. The conditional varianceof Y given X = x is given by var(Y |X = x) = a(φ)κ(2)[θ(x)].

The local likelihood procedure begins by specifying a working parametric model which wewrite

P0 = {m(·, β): β ∈ Rq} (2)

where m has a given functional form. The expected deviance E{D[µ(X),m(X)]} is used formeasuring the distance of m from the true regression function µ, where

D(µ,m) = [ϑ(µ)− ϑ(m)]µ− κ[ϑ(µ)] + κ[ϑ(m)] (3)

and ϑ = (κ(1))−1. In the case of the normal family of distributions, the deviance defined inEq. (3) is simply (1/2)(µ− m)2 with ϑ(µ) = µ.

Recall that the global maximum likelihood estimator under the modelP0 is given by m(x) =m(x, β)where β maximizes the log-likelihood function

∑ni=1{Yi s(Xi , β)− κ[s(Xi , β)]} with

respect to β, or equivalently minimizes∑n

i=1 D[Yi ,m(Xi , β)]. Here and below s(x, β) =ϑ[m(x, β)]. The Fisher information matrix for estimating β is given by

I (β) = 1

a(φ)E[m(1)(X, β)s(1)(X, β)T] = 1

a(φ)E

[m(1)(X, β)m(1)(X, β)T

σ 2(X, β)

],

where g(1)(x, β) denotes the vector of the partial derivatives of g with respect to βj’s andσ 2(x, β) = κ(2)[s(x, β)]. In this way, m(x, β) and s(x, β) play dualistic roles on the expres-sions for the log-likelihood, the score functions and the information matrix. Under the modelP0 it is well known that

√n(β − β) converges to a normal distribution with mean zero and

variance I (β)−1

The local likelihood estimator is obtained by maximizing the locally weighted log-likelihood, or equivalently by minimizing the locally weighted deviance function. It is definedby

βh(x) = arg minβ

{n∑

i=1

K

(x − Xi

h

)D[Yi ,m(Xi , β)]

}, (4)

where K is a multivariate kernel function, usually a symmetric probability density functiondefined on Rd , and h is the scalar bandwidth which controls the degree of localisation. Thelocal likelihood estimator of µ(x) is then given by

mh(x) = m[x, βh(x)].

It is clear that the properties of βh(x) and mh(x) depend on how far the true functionµ is awayfrom P0 as well as the choice of the bandwidth h.

We introduce theα-neighborhood of the working parametric modelP0 to quantify the degreeof discrepancy between P0 and the true µ. Roughly speaking, it is a tubular neighborhood ofP0 at distance of order n−(1+α)/2, where n is the sample size. It is formally defined by⋃

β

{µ: E[D(µ(X),m(X, β))] = O(n−(1+α))}. (5)

670 S. EGUCHI et al.

In the density estimation setting, the notion of α-neighborhood has been introduced by Eguchiand Copas (1998).

The α-neighborhood traverses from the fully nonparametric case to the fully parametricmodel as αmoves from −1 to ∞. One should forsake or exploit, fully or partly, the informationprovided by the model P0 according to the value of α. This could be done by tuning thebandwidth in the local likelihood procedure. The subsequent sections are designed to providea theoretical solution to the problem of tuning the bandwidth in an optimal way. Before closingthis section, we would like to give a brief discussion on the statistical meaning of the index αin relation to the bandwidth.

Suppose it is desired to test the hypothesis

H : µ ∈ P0. (6)

Then, the standard theory of testing hypotheses suggests that one cannot asymptotically detectany departure from the null hypothesis if the trueµ belongs to the α-neighborhood with α > 0.For example, the deviance

D = 1

n

n∑i=1

D[Yi ,m(Xi , β)] (7)

as a test statistic has zero power asymptotically, where β is the maximum likelihood estimator.This means that when α > 0 the asymptotic distribution of β under H is the same as that underany alternative in the α-neighborhood. Thus, when α > 0 one is suggested to choose thebandwidth tending to infinity as the sample size grows in order to take full advantage ofthe global maximum likelihood estimator.

Next, let us take as an alternative hypothesis the α-neighborhood with α < 0. Then thetesting problem changes drastically. The power of the test statistic D tends to one as thesample size increases under any alternative. In other words, one can always distinguish, inthe limit, an alternative in the α-neighborhood from the null. This distinguishable discrepancybetween the true µ and the parametric model P0 may be overcome by fitting the parametricmodel locally. This would appear to be supported by the fact that any smooth function may beapproximated well by a parametric function. When the smoothing bias resulted from locallyfitting the parametric model in an optimal way is smaller than the model discrepancy which isof order n−1−α , it is suggested to take the bandwidth approaching to zero. We shall return tothis point in Section 6 below.

3 STOCHASTIC EXPANSIONS

In this section we derive stochastic expansions of βh(x) and mh(x), which are valid for allα ≥ −1 and all sequences of h. For doing this we consider a population version of βh(x)defined by

βh(x) = arg minβ

(E

{K

(x − X1

h

)D[µ(X1),m(X1, β)]

}), (8)

and also that of mh(x) given by

mh(x) = m[x, βh(x)].

LOCAL LIKELIHOOD METHOD 671

These two quantities are those to which βh(x) and mh(x), respectively, get closer as the samplesize n grows. To state the theorem on the stochastic expansions, we assume

(C1) the minimizer βh(x) defined in Eq. (8) is unique;(C2) the marginal density of X1, denoted by f , has a compact support, say D;(C3) m(x, β) is three times partially differentiable with respect to β and all the partial deriva-

tives are continuous in x and β;(C4) σ 2(x, β) is bounded away from zero for x ∈ D and β ∈ C for some compact C, it is twice

partially differentiable with respect to β, and all the partial derivatives are continuous inx and β;

(C5) the kernel K is bounded and compactly supported.

Let (σ 2)(1)(x, β) denote the vector of the partial derivatives of σ 2(x, β) with respect toβj ’s, and write m(2)(x, β) for the matrix of the second order partial derivatives of m(x, β)withrespect to βj ’s. Define

Wh(x, β) = 1

n

n∑i=1

K

(x − Xi

h

)m(1)(Xi , β)

σ 2(Xi , β)[Yi − m(Xi , β)],

V (x, β) = −[

m(2)(x, β)− m(1)(x, β)(σ 2)(1)(x, β)T

σ 2(x, β)

][µ(x)− m(x, β)]

σ 2(x, β)

+ m(1)(x, β)m(1)(x, β)T

σ 2(x, β),

Dh(x, β) = E

[K

(x − X1

h

)V (X1, β)

].

THEOREM 1 Let h be any sequence whose values range over (0,∞). Under the conditions(C1)–(C5), it follows that

βh(x) = βh(x)+ Dh[x, βh(x)]−1Wh[x, βh(x)] + Op

(log n

n min{1, hd}),

mh(x) = mh(x)+ m(1)[x, βh(x)]T Dh[x, βh(x)]

−1Wh(x, βh(x))+ Op

(log n

n min{1, hd})

uniformly in x ∈ D as n tends to infinity.

Proof Consider first the case where h tends to infinity or stays bounded away from zero andinfinity. We note that βh(x) is a solution of the equation Wh(x, β) = 0, and that βh(x) is theunique solution of the equation Wh(x, β) = 0 where

Wh(x, β) = E

{K

(x − X1

h

)m(1)(X1, β)

σ 2(X1, β)[µ(X1)− m(X1, β)]

}.

Now, the uniqueness of βh(x) and the uniform covergence

supβ∈C,x∈D

|Wh(x, β)− Wh(x, β)| = op(1) (9)

imply βh(x)− βh(x) = op(1) uniformly in x ∈ D. The property (9) may be proved by a stan-dard technique. In fact, one may show that the left-hand side of Eq. (9) equals Op(

√(log n)/n)

672 S. EGUCHI et al.

by making use of the Markov inequality as in the proof of Proposition 1 of Hardle and Mammen(1993). The consistency of βh(x) justifies the following expansion of Wh(x, βh(x)):

0 = Wh(x, βh(x))

= Wh(x, βh(x))+{

EW (1)h [x, βh(x)] + Op

(√log n

n

)}[βh(x)− βh(x)]

+ Op[|βh(x)− βh(x)|2]

where Op’s are uniform in x ∈ D, and W (1)h (x, β) denotes the matrix which has as its (i, j)th

entry the partial derivative of the i th element of Wh(x, β) with respect to βj . Note thatE{W (1)

h (x, β)} = −Dh(x, β). The theorem now follows immediately from this expansion andthe fact that supx∈D |Wh(x, βh(x))| = Op(

√(log n)/n).

The proof for the case where h tends to zero is similar. Note that in this case Eq. (9) still holdsif 1/hd is multiplied on the left hand side. In fact, all the arguments in the above proof are validif we multiply Wh and Wh by 1/hd , and replace Op(

√(log n)/n) by Op(

√(log n)/nhd). �

4 LARGE h ASYMPTOTICS

In this section we concentrate on the case where h tends to infinity as n grows. We assumethat the true regression function belongs to an α-neighborhood with α ≥ −1. When we let hincrease to infinity, the local likelihood estimator targets at m∞(x) = m(x, β∞) where

β∞ = argminβ

E{D[µ(X1),m(X1, β)]}. (10)

For the risk of an estimator m of µ we consider

R∞(m) = 1

2E∫

[m(x)− µ(x)]2

σ 2(x, β∞)f (x) dx . (11)

Note that (∂/∂m)D(µ,m)|m=µ = (m − µ)/[κ ′′(ϑ(m))]|m=µ = 0, and that the risk inEq. (11) is an approximation of the mean integrated deviance

MID(m) = E∫

D[µ(x), m(x)] f (x) dx (12)

One may expect that when h → ∞ the risk of the local likelihood estimator is the same, tothe first order, as that of the global maximum likelihood estimator discussed at the beginningof the previous section. We detail the expansion of the risk to include the higher order terms,which enables us to connect the bandwidth h to the sample size n.

First, we derive relevant expansions for βh(x) and mh(x) in the following theorem which isanalogous to Theorem 1 for β(x) and mh(x). To do this, we assume in addition to (C1) ∼ (C5):

(L1) the minimizer β∞ defined in Eq. (10) is unique;(L2) K (t) = 1 − κ2‖t‖2 + κ4‖t‖4 + o(‖t‖4) as ‖t‖ → 0 for some positive κ2 and κ4.

LOCAL LIKELIHOOD METHOD 673

Note that under the condition (L1), β∞ is the unique solution of the equation W∞(β) = 0where

W∞(β) = E

{m(1)(X1, β)

σ 2(X1, β)[µ(X1)− m(X1, β)]

}.

Define

J∞(β) = E

{m(1)(X1, β)m(1)(X1, β)

T

σ 2(X1, β)

}.

THEOREM 2 Let h → ∞ as n → ∞. Assume the true regression function belongs to anα-neighborhood with α ≥ −1. Assume that (C1)–(C5), (L1) and (L2) hold. Then, we have

βh(x) = β∞ + J∞(β∞)−1Wh(x, β∞)+ O

(1

n(1+α)/2h4+ 1

n1+αh2

),

mh(x) = m∞(x)+ m(1)(x, β∞)T J∞(β∞)−1Wh(x, β∞)+ O

(1

n(1+α)/2h4+ 1

n1+αh2

)uniformly in x ∈ D.

Proof One may prove by a similar way as in the proof of Theorem 1

βh(x) = β∞ + Dh(x, β∞)−1Wh(x, β∞)+ O

(1

n1+αh4

)(13)

uniformly in x ∈ D. To establish Eq. (13) one may use the facts

E[µ(X1)− m∞(X1)]2 = O

(1

n1+α

)(14)

supx∈D

|Wh(x, β∞)| = O

(1

n(1+α)/2h2

). (15)

Note that Eq. (14) comes from the fact that there exist absolute constants 0 < C1 < C2 suchthat for any µ and m

C1(µ− m)2 ≤ D(µ,m) ≤ C2(µ− m)2.

Also, Eq. (15) follows from Eq. (14) and the definition of β∞, i.e. the fact that W∞(β∞) = 0.In fact, writing δ(x, β) = m(1)(x, β)[µ(x)− m(x, β)]/σ 2(x, β), we may verify

Wh(x, β∞) = − κ2

h2E[‖x − X1‖2δ(X1, β∞)] + O

[1

n(1+α)/2h4

]. (16)

Now, using Eq. (14) and the condition (L2) on the kernel K , we may obtain an expansionof Dh(x, β∞) as

Dh(x, β∞) = J∞(β∞)+ O

[1

h2+ 1

n(1+α)/2

]. (17)

Plugging the expression (17) into Eq. (13) yields the expansion for βh(x) in the theorem, andputting that expansion into mh(x) = m[x, βh(x)] entails the expansion for mh(x). �

674 S. EGUCHI et al.

The stochastic term in the expansion of mh(x) given at Theorem 1 may be approximatedfurther using the expansion of βh(x) given at Theorem 2 as described in the following theorem.For this, we let

Jh(x, β) = E

{K

(x − X1

h

)m(1)(X1, β)m(1)(X1β)

T

σ 2(X1, β)

},

Uh(x, β) = 1

n

n∑i=1

K

(x − Xi

h

)m(1)(Xi , β)

σ 2(Xi , β)[Yi − µ(Xi)].

THEOREM 3 Let h → ∞ as n → ∞. Under the conditions of Theorem 2, we obtain

mh(x) = mh(x)+ m(1)(x, β∞)T Jh(x, β∞)−1Uh(x, β∞)+ Op

{(log n)1/2

n1+(α/2) + log n

n

}uniformly in x ∈ D as n goes to infinity.

Proof We note first that

1

n

n∑i=1

K

(x − Xi

h

)m(1)[Xi , βh(x)]

σ 2[Xi , βh(x)]{µ(Xi)− m[Xi , β(x)} = Op

[(log n)1/2

n1+(α/2)

](18)

uniformly in x ∈ D. This follows from Eqs. (14), (15) and the fact that the left-hand sideof Eq. (18) has mean zero by the definition of βh(x). Thus, we can write

mh(x) = mh(x)+ m(1)(x, βh(x))T Dh[x, βh(x)]

−1Uh[x, βh(x)] + Op

[(log n)1/2

n1+(α+2)+ log n

n

]

uniformly in x ∈ D. Now, we may approximate m(1)[x, βh(x)] and Dh[x, βh(x)] bym(1)(x, β∞) and Jh(x, β∞), respectively, with remainders for both being of magnitudesO(n−(1+α)/2h−2) uniformly in x ∈ D. Also, we may verify

Uh(x, βh(x)) = Uh(x, β∞)+ Op

[(log n)1/2

n1+(α/2)h2

]

uniformly in x ∈ D, and show that supx∈D |Uh(x, βh(x))| = O[n−1/2(log n)1/2].This completesthe proof of the theorem. �

Direct risk assessment of mh is out of reach since the estimator does not have a closed form.Instead, we consider its approximation given in Theorem 3. Write

mLh (x) = mh(x)+ m(1)(x, β∞)T Jh(x, β∞)−1Uh(x, β∞). (19)

We can decompose the risk of mLh as

R∞(mLh ) = b∞(n, h)+ v∞(n, h) (20)

LOCAL LIKELIHOOD METHOD 675

where b∞(n, h) and v∞(n, h) are defined by

b∞(n, h) = 1

2

∫[mh(x)− µ(x)]2

σ 2(x, β∞)f (x) dx,

v∞(n, h) = 1

2E∫

[mLh (x)− mh(x)]2

σ 2(x, β∞)f (x) dx .

The first term b∞(n, h) in the decomposition (20) represents the bias of mLh due to model

misspecification, while v∞(n, h) measures the sampling variability of mLh . In the following

theorem, we present more tractable approximations for these components.To state the theorem, we define

�n = E

{m(1)(X1, β∞)XT

1

σ 2(X1, β∞)[µ(X1)− m(X1, β∞)]

},

v(x) = J∞(β∞)−1/2m(1)(x, β∞)/σ (x, β∞),

Q(x) = E[‖x − X1‖4v(X1)v(X1)T] − {E[‖x − X1‖2v(X1)v(X1)

T]}2,

τ = E[v(X1)T Q(X1)v(X1)].

We would like to note here that �n = O[n−(1+α)/2] from Eq. (14). Furthermore, τ > 0 sinceQ(x) is positive definite for all x . The latter property follows from the fact that E[v(X1)v(X1)

T]equals the identity matrix, and the following generalised Cauchy-Schwartz inequality:

(EδψT)(EψψT)−1(EψδT) ≤ EδδT (21)

for any random vectors δ ∈ R p andψ ∈ Rq , with equality holding if and only if δ = Aψ withprobability one for some constant matrix A. In Eq. (21), D1 ≤ D2 for two matrices D1 and D2

means that D2 − D1 is non-negative definite.

THEOREM 4 Let h → ∞ as n → ∞. Under the conditions of Theorem 2, we get

b∞(n, h) = 1

2

∫ {m∞(x)− µ(x)}2

σ 2(x, β∞)f (x) dx − 2κ2

h2tr{�T

n J∞(β∞)−1�n} + o

(1

n1+αh2

)v∞(n, h) = qa(φ)

2n(1 + cn)+ κ2

2 a(φ)τ

2nh4+ o

(1

nh4

)where q is the dimension of the parameter β, and cn is a sequence of constants that does notdepend on h and goes to zero as n tends to infinity.

Proof Put the expansion of mh(x) as given in Theorem 2 with the expression (16) forWh(x, β∞) into

b∞(n, h) = 1

2

∫[m∞(x)− µ(x)]2

σ 2(x, β∞)f (x) dx

+∫

[mh(x)− m∞(x)][m∞(x)− µ(x)]

σ 2(x, β∞)f (x) dx + O

(1

n1+αh4

).

676 S. EGUCHI et al.

Then, we may derive the formula for b∞(n, h) after some algebraic manipulation with theidentity

E[‖X1 − X2‖2δ(X1, β∞)T J∞(β∞)−1δ(X2, β∞)]

= −2E[X T1 X2δ(X1, β∞)T J∞(β∞)−1δ(X2, β∞)]

where δ(x, β) is as defined in the sentence containing Eq. (16). We note that the above identityfollows from the fact that W∞(β∞) = 0.

Next, we prove the second formula for v∞(n, h). We start by noting that

E[m(1)(x, β∞)T Jh(x, β∞)−1Uh(x, β∞)]2

= m(1)(x, β∞)T Jh(x, β∞)−1var[Uh(x, β∞)]Jh(x, β∞)−1m(1)(x, β∞). (22)

For notational convenience, we write J∞ ≡ J∞(β∞) and define

Jk,∞(x) = E

{‖x − X1‖k m(1)(X1, β∞)m(1)(X1, β∞)T

σ 2(X1, β∞)

}for k = 2 and 4. Then, we may write

Jh(x, β∞) = J∞ − κ2

h2J2,∞(x)+ κ4

h4J4,∞(x)+ o

(1

h4

)uniformly in x ∈ D. This implies that

Jh(x, β∞)−1 = J −1∞ + κ2

h2J −1∞ J2,∞(x)J −1

∞ + 1

h4[κ2

2 J −1∞ J2,∞(x)J −1

∞ J2,∞(x)J −1∞

− κ4 J −1∞ J4,∞(x)J −1

∞ ] + o

(1

h4

)(23)

uniformly in x ∈ D. Furthermore, we can calculate the variance Uh(x, β∞) as

var{Uh(x, β∞)} = a(φ)

n

[J∞ − 2κ2

h2J2,∞(x)

+ (κ22 + 2κ4)

h4J4,∞(x)+ o

(1

h4

)](1 + cn) (24)

uniformly in x ∈ D where cn = O(n−(1+α)/2) does not depend on h.Now, plugging Eqs. (23) and (24) into the right hand side of Eq. (22), we may find that h−2

terms cancel, and that the variance part v∞(n, h) reduces to

v∞(n, h) = a(φ)

2n(1 + cn)

∫ [v(x)Tv(x)+ κ2

2

h4

{v(x)T E[‖x − X1‖4v(X1)v(X1)

T]v(x)

−v(x)T{E[‖x − X1‖2v(X1)v(X1)T]}2v(x)

}+ o

(1

h4

)]f (x) dx . (25)

This completes the proof of the formula for v∞(n, h) since E{v(X1)v(X1)T} equals the identity

matrix. �

LOCAL LIKELIHOOD METHOD 677

One may go through a similar asymptotic risk analysis for the global maximum likelihoodestimator m(·, β)which is introduced in the third paragraph of Section 2. In fact, the first termsof b∞(n, h) and v∞(n, h) in Theorem 4 are the bias and the variance contributions, respectively,of m(·, β). Put νn = tr[�T

n J∞(β∞)−1�n]. Then, the risk improvement of the local likelihoodmethod upon the global maximum likelihood estimator equals 2κ2νn/h2 − κ2

2 τa(φ)/(2nh4).Recall that �n = O[(n−(1+α)/2] so that νn = O(n−1−α). Also, recall that τ > 0. Thus, whenthe model P0 is correctly specified, i.e. α = ∞, the global maximum likelihood estimator hasless asymptotic risk than the local likelihood method.

However, when −1 ≤ α < ∞, the local likelihood estimator may have smaller risk byproper choices of the bandwidth. In fact, in the case where 0 < α < ∞, the risk improvementupon the global maximum likelihood estimator is maximised by the optimal bandwidth

hopt =√κ2a(φ)τ

2nνn, (26)

which is O(nα/2). With this optimal bandwidth, the risk improvement equals 4nν2n/[τa(φ)] =

O(n−1−2α). When −1 ≤ α ≤ 0, it is easy to see that b∞(n, h) + v∞(n, h) is getting smaller asthe speed at which h tends to infinity is getting slower.

5 SMALL h ASYMPTOTICS

When we let h tend to zero, the local likelihood estimator at a point x targets at m0(x) =m[x, β0(x)] where

β0(x) = argminβ

D[µ(x),m(x, β)].

The small h asymptotic results presented below are valid for all α ≥ −1. First, we list thefollowing assumptions in addition to (C1) ∼ (C5):

(S1) the vector m(1)(x, β) is not zero for all x ∈ D and β ∈ C;

(S2) q, the dimension of β in the model m(·, β), equals∑p

i=0

(i + d − 1

d − 1

)for some non-

negative integer p;(S3) For each x in the interior of D, µ and m[·, β0(x)] are p + 1 times partially continuously

differentiable in a neighborhood of x .(S4) For each x in the interior of D,m(1)(·, β0(x)) is p times partially continuously differen-

tiable in a neighborhood of x, and M(x) as defined at Eq. (32) is invertible.

Under the condition (S1), β0(x), the solution of the equation

m(1)(x, β)[µ(x)− m(x, β)] f (x) = 0,

is unique and satisfies m[x, β0(x)] = µ(x) for any x in the interior of D. The condition (S2)tells that the number of parameters in the model P0 = {m(·, β): β ∈ Rq} equals the totalnumber of partial derivatives of m(·, β) up to order p. It allows the case where m(x, β) is apth order polynomial in x ∈ Rd .

For the risk of an estimator m of µ we take

R0(m) = 1

2E∫

[m(x)− µ(x)]2

σ 2[x, β0(x)]f (x) dx . (27)

678 S. EGUCHI et al.

The risk in Eq. (27) is a good approximation of the mean integrated deviance MID(m) asdefined in Eq. (12) when m is close to µ.

An analogue of Theorem 2 for expansions of βh(x) and mh(x) when h → 0 is provided inthe following theorem. We note that the expansions do not depend on the index α.

THEOREM 5 Let h → 0 as n → ∞. Assume that (C1)–(C5) and (S1) hold. Then, we have

βh(x) = β0 + Jh[x, β0(x)]−1Wh[x, β0(x)][1 + o(1)],

mh(x) = µ(x)+ m(1)[x, β0(x)]T Jh[x, β0(x)]−1Wh[x, β0(x)][1 + o(1)]

uniformly in x ∈ D.

Proof From (S1) and the fact that as h tends to zero

supβ∈C,x∈D

∣∣∣∣ 1

hdWh(x, β)

∣∣∣∣ = o(1),

we may conclude that βh(x)− β0(x) = o(1) uniformly in x ∈ D. The theorem now followsfrom a standard expansion of h−d Wh[x, βh(x)] for βh(x) being close to β0(x). �

As in the large h case, the risk of mh cannot be evaluated directly. Here we consider thefollowing stochastic approximation of mh given in Theorem 1:

mSh(x) = mh(x)+ m(1)[x, βh(x)]T Dh[x, βh(x)]−1Wh[x, βh(x)]. (28)

We can decompose the risk of mSh as

R0(mSh) = b0(n, h)+ v0(n, h)

where b0(n, h) and v0(n, h) are defined by

b0(n, h) = 1

2

∫[mh(x)− µ(x)]2

σ 2[x, β0(x)]f (x) dx,

v0(n, h) = 1

2E∫

[mSh (x)− mh(x)]2

σ 2[x, β0(x)]f (x) dx .

Below we present more tractable approximations of b0(n, h) and v0(n, h). The presentationsneed some careful notations to treat the multi-dimensional covariates. First, for a d-tuplej ≡ ( j1, . . . , jd) and a d-dimensional vector x , write

j ! = j1! × · · · × jd!, | j | =d∑

i=1

ji, x j = x j11 × · · · × x jd

d .

For a function g defined on Rd , write

(D j g)(x) = ∂ | j |g(x)∂x j1

1 · · · ∂x jdd

.

LOCAL LIKELIHOOD METHOD 679

Let Ni =(

i + d − 1

d − 1

). Arrange Ni number of d-tuples with | j | = i in a lexicographical

order: put (0, 0, . . . , i) first and (i, 0, . . . , 0) last. Let �i denote the function which maps aninteger k: 1 ≤ k ≤ Ni to the d-tuple located at the kth position in the arrangement. For example,�i(1) = (0, 0, . . . , i). The map �i has been considered in Masry (1996). Let ξj = ∫

u j K (u) dufor a d-tuple j . For k, l = 0, 1, . . . write �(k,l)1 to denote the Nk × Nl matrix whose (r, c)thentry is ξ�k (r)+�1(c). Define

�1 =

�(0,0)1 �

(0,1)1 · · · �

(0,p)1

�(1,0)1 �

(1,1)1 · · · �

(1,p)1

......

. . ....

�(p,0)1 �

(p,1)1 · · · �

(p,p)1

, �1,p+1 =

�(0,p+1)1

�(1,p+1)1

...

�(p,p+1)1

.

Note that �1 is a q × q and �1,p+1 is a q × Np+1 matrix. Likewise, define a q × q matrix �2

now replacing ξj in the definition of �1 by γj = ∫u j K 2(u) du.

The following theorem gives the bias and variance approximations for mSh(x). Let Ai (x)

for i = 0, . . . , p + 1 be a Ni × 1 matrix defined by

Ai (x) = (−1)i

D�i (1){µ(·)− m(·, β0(x))}(x)�i (1)!

...

D�i (Ni ){µ(·)− m(·, β0(x))}(x)�i (Ni )!

. (29)

For an arbitrarily small ε > 0, let Dε be the largest compact set such that for each x ∈ Dε anyball with radius ε centered at x is contained in D.

THEOREM 6 Let h → 0 and nhd → ∞ as n → ∞. Assume that (C1) ∼ (C5) and (S1) ∼ (S4)hold. Then, we have

mh(x)− µ(x) = eT1�

−11 �1,p+1Ap+1(x)h

p+1 + o(h p+1),

var{mSh (x)} = 1

nhd

{a(φ)σ 2[x, β0(x)]

f (x)

}eT

1�−11 �2�

−11 e1 + o

(1

nhd

)uniformly in x ∈ Dε for any ε > 0 where e1 is the q × 1 unit vector given by e1 = (1, 0, . . . , 0)T.

A direct consequence of Theorem 6 is given in the following corollary. Let λ denote theLebesgue measure of D in Rd .

COROLLARY 1 Under the assumptions of Theorem 6

b0(n, h) = 1

2h2(p+1)eT

1�−11 �1,p+1

{∫ Ap+1(x)Ap+1(x)T

σ 2[x, β0(x)]f (x) dx

}× �T

1,p+1�−11 e1 + o(h2(p+1)),

v0(n, h) = λa(φ)

2nhdeT

1�−11 �2�

−11 e1 + o

(1

nhd

).

680 S. EGUCHI et al.

Remark 1 Contrary to the case where h → ∞, the bias expansions in Theorem 6 andCorollary 1 do not depend on the global measure of discrepancy α because of the localness ofthe fitting.

Remark 2 When p is even and the kernel K is symmetric about the origin, it may be proved thateT

1�−11 �1,p+1 = 0 so that mh(x)− µ(x) = o(h p+1) and b0(n, h) = o(h2(p+1)). This property

of the bias has been observed by Fan et al. (1995), but in the special case where d = 1 andm(x, β) = g−1(β0 + β1x + · · · + βpx p) for some link function g.

Remark 3 The optimal size of the bandwidth may be determined by trading off the two termsb0(n, h) and v0(n, h). Denote the constant factors of h2(p+1) and n−1h−d , respectively, by cbias

and cvar. Then, we find

hopt =(

cvar

ncbias

)1/(d+2p+2)

. (30)

With this optimal bandwidth, the risk has magnitude of order O(n−(2p+2)/(d+2p+2)).

Proof of Theorem 6 First, we consider the bias part. For a given d-dimensional vector u,define a Ni × 1 vector, denoted by zi(u, h), in such a way that the kth entry is given byhi u�i (k). Next, define a q × 1 vector z(u, h) by

z(u, h)T = (z0(u, h)T, . . . , zp(u, h)T).

Recall that we assume q = ∑pi=0 Ni in (S2). Let for i = 0, . . . , p

Mi(x) = (−1)i(D�i (1)m(1)[x, β0(x)]

�i (1)!, . . . ,

D�i (Ni )m(1)[x, β0(x)]

�i (Ni )!

). (31)

In a slight abuse of notation, we write in Eq. (31) D j m(1)[x, β0(x)] = {D j m(1)[·, β0(x)]}(x)for a d-tuple j . Form a q × q matrix M(x) by

M(x) = (M0(x), . . . ,Mp(x)). (32)

We note that M0(x) = m(1)(x, β0(x)), and thus

m(1)(x, β0(x)) = M(x)e1. (33)

Then we may write

m(1)[x − hu, β0(x)] = M(x)z(u, h)+ o(h p) (34)

uniformly in x ∈ Dε, and in any compact set of u. Let�h denote a diagonal matrix of dimensionq which has hi Ii as its i th diagonal block of size Ni × Ni where Ii is the Ni × Ni identitymatrix. From Eq. (34) and the fact that

∫K (u)z(u, h)z(u, h)T du = �h�1�h , we obtain

1

hdJh[x, β0(x)] = f (x)

σ 2[x, β0(x)]M(x)�h�1�hM(x)T[1 + o(1)] (35)

uniformly in x ∈ Dε. Now, with Ai(x) as defined in Eq. (29) form a q × 1 vector A(x) by

A(x)T = (A0(x)T, . . . ,Ap(x)

T).

LOCAL LIKELIHOOD METHOD 681

Then we may write, as in deriving Eq. (35),

1

hdWh(x, β0(x)) = f (x)

σ 2[x, β0(x)][M(x)�h�1�hA(x)

+ M(x)�h�1,p+1Ap+1(x)hp+1][1 + o(1)] (36)

uniformly in x ∈ Dε.Combining Eqs. (33), (35) and (36) together yields that mh(x)− µ(x) equals

[eT1A(x)+ eT

1�−1h �−1

1 �1,p+1Ap+1(x)hp+1][1 + o(1)]. (37)

Note that A0(x) is scalar and is the first element of A(x). Furthermore, A0(x) = µ(x)−m[x, β0(x)] = 0. Thus we conclude that eT

1A(x) = 0. Since eT1�

−1h = eT

1 , we complete theproof of the bias part.

Next, we consider the variance part. Note that

n

hdvar{Wh[x, βh(x)]} = n

hdE(var{Wh[x, βh(x)] | X1, . . . , Xn})

+ n

hdvar(E{Wh[x, βh(x)] | X1, . . . , Xn}). (38)

The second term of Eq. (38) is negligible since the conditional expectation equals

1

n

n∑i=1

K

(x − Xi

h

)m(1)[Xi , βh(x)]

σ 2[Xi , βh(x)]{µ(Xi)− m[Xi , βh(x)]},

which has mean zero and variance of order o(n−1hd). The first term may be calculated as inderiving Eqs. (35) and (36) to obtain

n

hdvar{Wh(x, βh(x))} = a(φ) f (x)

σ 2(x, β0(x))M(x)�h�2�hM(x)T + o(1) (39)

uniformly in x ∈ Dε. For Eq. (39) we used the relation σ 2[x, β0(x)] = var(Y1 | X1 = x)/a(φ).The variance part now follows from Eqs. (33), (35), (39) and the facts m(1)[x, βh(x)] =m(1)[x, β0(x)][1 + o(1)] and h−d Dh[x, βh(x)] = h−d Jh[x, β0(x)][1 + o(1)] uniformly inx ∈ Dε. �

6 CONCLUDING REMARKS AND FUTURE RESEARCH

In this section, we combine the large and the small h asymptotics to give a theoretical answer tothe question of how to tune the bandwidth h depending on the degree of model misspecification.In the discussion below, we include the case where the bandwidth h is held fixed. A relevantasymptotic treatment for a fixed h may be possible from Theorem 1. The bias of the locallikelihood estimator, contributed by the deterministic term mh(x) in Theorem 1, has magnitudeof order O(n−(1+α)/2), and the variance from the stochastic term including Wh(x, βh(x)) is oforder O(n−1). Thus, the risk of the local likelihood estimator is asymptotic to n−1−α + n−1.

First, consider the case where α > 0. In this case, the large h asymptotics shows that therisk is of order O(n−1) while in the small h case the risk with the optimal choice of bandwidthis O(n−(2p+2)/(d+2p+2)) as discussed in Remark 3. Thus, a large h is preferable in this case,and the optimal theoretical choice, denoted by hopt(α), is as is given in Eq. (26).

682 S. EGUCHI et al.

Next, consider the case where −1 ≤ α < −d/(d + 2 p + 2). In this case, n−1−α dominatesn−1 and it is the leading order of the risk when h tends to infinity. Comparing n−1−α with theoptimal risk of order n−(2p+2)/(d+2p+2) in the small h case, we may conclude that a small hwith the optimal choice hopt(α) given in Eq. (30) is desirable in this range of α values.

In the remaining case where −d/(d + 2 p + 2) ≤ α ≤ 0, the leading term of the risk whenh tends to infinity is still of order n−1−α and it is dominated by the optimal n−(2p+2)/(d+2p+2)

in the small h case. Recall that the large h asymptotics reveals that in this case a bandwidthh increasing to infinity at an arbitrarily slow rate is preferable. This suggests that a fixedbandwidth is desirable in this α range. The fact that the risk for a fixed h is of order n−1−α too,as discussed in the first paragraph of this section, supports this speculation.

A challenging problem is to find a data-driven bandwidth selection rule h. In particular, itis desired to find a selection rule h that satisfies under the α-neighborhood

h − hopt(α)

hopt(α)−→ 0 (40)

as n increases to infinity in some probabilistic sense. One may employ some cross-validationcriteria, bootstrap ideas, or other sophisticated methods such as the plug-in rules of Parkand Marron (1990) or Sheather and Jones (1991). The plug-in ideas rely on the asymptoticformula for hopt(α), and thus their implementation needs knowledge of the size of α as wellas preliminary estimation of several unknown constants in hopt(α). However, cross-validationand bootstrap bandwidth selection are not tied to asymptotics and are readily applicable. Weconjecture that the latter two methods may have the property (40) for the whole range of α. Aproper treatment of this issue is left for future work.

To what extent is it possible to determine the value or the range of the parameter α? Thisis also an interesting and as yet unexplored question. It is closely related to the bandwidthselection problem. We have seen how optimal bandwidth tuning is influenced by the degreeof model misspecification. This indicates, perhaps indirectly, that a bandwidth selected by areasonable method may be able to say something about the α.

It should be noted that the asymptotics presented in this paper are for deterministic h.Exploring the sampling properties for stochastic h is also a very important problem. The basicquestion is whether the theory for deterministic h still holds for stochastic h. This wouldrequire working on a uniform stochastic expansion of the local likelihood estimator, which isanalogous to Theorem 1, but now, over a suitable bandwidth range as well as the design space.We leave this as a challenging open problem, too.

One may construct goodness-of-fit tests for a given parametric model m(·, β) based onthe local likelihood estimator. One possible way of doing this is first to compute the averagedeviance n−1∑n

i=1 D[mh(Xi ),m(Xi , β)] or its cross-validatory version for each fixed h, andthen to take the minimum of these averages over a range of h, where β is the fully parametricmaximum likelihood estimator of β. Alternatively, one may work on the average deviancewith a data-driven bandwidth selector. Another approach is to construct a version of the lack-of-fit tests considered in Aerts et al. (1999, 2000) by using the local likelihood estimator.A comparison of these and the one mentioned in Eq. (7) would be of interest.

Acknowledgements

Research of Shinto Eguchi and Tae Yoon Kim was supported by KOSEF and JPSS throughKorea-Japan Joint Research Program. Research of Byeong U. Park was supported by KOSEFthrough Statistical Research Center for Complex Systems at Seoul National University. Theauthors thank the referees for valuable comments.

LOCAL LIKELIHOOD METHOD 683

References

Aerts, M., Claeskens, C. and Hart, J. D. (1999). Testing the fit of a parametric function. J. Amer. Statist. Assoc., 94,869–879.

Aerts, M., Claeskens, C. and Hart, J. D. (2000). Testing lack of fit in multiple regression. Biometrika, 87, 405–424.Copas, J. B. (1995). Local likelihood based on kernel censoring. J. R. Statist. Soc. B, 57, 221–235.Eguchi, S. and Copas, J. B. (1998). A class of local likelihood methods and near-parametric asymptotics. J. B. Statist.

Soc. B, 60, 709–724.Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman & Hall, London.Fan, J., Heckman, N. and Wand, M. P. (1995). Local polynomial kernel regression for generalized linear models and

quasi-likelihood functions. J. Amer. Statist. Assoc., 90, 141–150.Hardle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regression fits. Ann. Statist., 21,

1926–1947.Hjort, N. L. and Jones, M. C. (1996). Locally parametric nonparametric density estimation. Ann. Statist., 24,

1619–1647.Lindsey, J. K. (1996). Parametric Statistical Inference. Oxford University Press, Oxford.Loader, C. R. (1996). Local likelihood density estimation. Ann. Statist., 24, 1602–1618.Masry, E. (1996). Multivariate regression estimation local polynomial fitting for time series. Stochastic Process. Appl.,

65, 85–101.McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall, London.Park, B. U., Kim, W. C. and Jones, M. C. (2002). On local likelihood density estimation. Ann. Statist., 30, 1480–1495.Park, B. U. and Marron, J. S. (1990). Comparison of data-driven bandwidth selectors. J. Amer. Statist. Assoc., 85,

66–72.Sheather, S. J. and Jones, M. C. (1991).A reliable data-based bandwidth selection method for kernel density estimation.

J. R. Statist. Soc., B 53, 683–690.Tibshirani, R. and Hastie, T. (1987). Local likelihood estimation. J. Amer. Statist. Assoc., 82, 559–567.Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman & Hall, London.