P-splines regression smoothing and difference type of penalty

13
Stat Comput (2010) 20: 499–511 DOI 10.1007/s11222-009-9140-0 P-splines regression smoothing and difference type of penalty I. Gijbels · A. Verhasselt Received: 30 May 2008 / Accepted: 18 June 2009 / Published online: 1 July 2009 © Springer Science+Business Media, LLC 2009 Abstract P-splines regression provides a flexible smooth- ing tool. In this paper we consider difference type penalties in a context of nonparametric generalized linear models, and investigate the impact of the order of the differencing opera- tor. Minimizing Akaike’s information criterion we search for a possible best data-driven value of the differencing order. Theoretical derivations are established for the normal model and provide insights into a possible ‘optimal’ choice of the differencing order and its interrelation with other parame- ters. Applications of the selection procedure to non-normal models, such as Poisson models, are given. Simulation stud- ies investigate the performance of the selection procedure and we illustrate its use on real data examples. Keywords Akaike’s information criterion · B-splines · Difference penalty · Generalized linear modelling · Penalized regression · Smoothing 1 Introduction Ordinary least squares regression with B-splines has been studied in Dierckx (1993). Since regular regression with B- splines tends to overfit, Eilers and Marx (1996) proposed to add a difference penalty on the coefficients of adjacent B- splines, in the same sense as smoothing splines. This leads to the regression P-splines technique. P-splines have been used as a tool in many different areas, see for example Ruppert et al. (2003). I. Gijbels ( ) · A. Verhasselt Department of Mathematics and Leuven Statistics Research Center (LStat), Katholieke Universiteit Leuven, Celestijnenlaan 200 B, KULeuven Box 2400, 3001 Leuven (Heverlee), Belgium e-mail: [email protected] In penalized regression techniques many different forms of penalty functions are used. Often the form of the penalty function is inspired by assumptions inherent to the function to be estimated. For example, if the target function is as- sumed to be twice differentiable then a commonly chosen penalty is a second order difference type of penalty. When the target function may show irregularities, such as discon- tinuities or spikes, then among the appropriate penalty func- tions are penalties that depend on a potential function that is non-smooth at zero. The particular choice of penalty func- tion has impact on several aspects of the problem: the partic- ular nature of the optimization problem—(non)-convexity or (non)-concavity of the objective function—the existence of a unique solution; the possible algorithms to find a solution. Deciding on an appropriate form of the penalty function is one of the important questions in penalized regression. However, in practice the degree of smoothness of the target function is often not known. Consequently, the choice of the order of the difference penalty, when restricting to the class of difference penalties, is not an obvious one. In this paper we study the impact of this differencing order and explore whether a ‘best’ choice for this order given the data at hand exists. As a criterion for selecting the differencing order as well as the smoothing parameter we use Akaike’s informa- tion criterion. We also establish a theoretical approximation of the selection criterion which sheds light on the existence of an ‘optimal’ choice for the differencing order. In this paper the focus is on an adaptive choice of the order in a difference type of penalty, together with a data- driven global (constant) choice of the smoothing parameter. Other work deals with functional type of smoothing para- meters, that is a smoothing parameter that may take differ- ent values in different regions of the domain of estimation. For example, Pintore et al. (2006) discuss a spatial adaptive choice of the smoothing parameter for a piecewise constant

Transcript of P-splines regression smoothing and difference type of penalty

Stat Comput (2010) 20: 499–511DOI 10.1007/s11222-009-9140-0

P-splines regression smoothing and difference type of penalty

I. Gijbels · A. Verhasselt

Received: 30 May 2008 / Accepted: 18 June 2009 / Published online: 1 July 2009© Springer Science+Business Media, LLC 2009

Abstract P-splines regression provides a flexible smooth-ing tool. In this paper we consider difference type penaltiesin a context of nonparametric generalized linear models, andinvestigate the impact of the order of the differencing opera-tor. Minimizing Akaike’s information criterion we search fora possible best data-driven value of the differencing order.Theoretical derivations are established for the normal modeland provide insights into a possible ‘optimal’ choice of thedifferencing order and its interrelation with other parame-ters. Applications of the selection procedure to non-normalmodels, such as Poisson models, are given. Simulation stud-ies investigate the performance of the selection procedureand we illustrate its use on real data examples.

Keywords Akaike’s information criterion · B-splines ·Difference penalty · Generalized linear modelling ·Penalized regression · Smoothing

1 Introduction

Ordinary least squares regression with B-splines has beenstudied in Dierckx (1993). Since regular regression with B-splines tends to overfit, Eilers and Marx (1996) proposedto add a difference penalty on the coefficients of adjacent B-splines, in the same sense as smoothing splines. This leads tothe regression P-splines technique. P-splines have been usedas a tool in many different areas, see for example Ruppert etal. (2003).

I. Gijbels (�) · A. VerhasseltDepartment of Mathematics and Leuven Statistics ResearchCenter (LStat), Katholieke Universiteit Leuven, Celestijnenlaan200 B, KULeuven Box 2400, 3001 Leuven (Heverlee), Belgiume-mail: [email protected]

In penalized regression techniques many different formsof penalty functions are used. Often the form of the penaltyfunction is inspired by assumptions inherent to the functionto be estimated. For example, if the target function is as-sumed to be twice differentiable then a commonly chosenpenalty is a second order difference type of penalty. Whenthe target function may show irregularities, such as discon-tinuities or spikes, then among the appropriate penalty func-tions are penalties that depend on a potential function that isnon-smooth at zero. The particular choice of penalty func-tion has impact on several aspects of the problem: the partic-ular nature of the optimization problem—(non)-convexity or(non)-concavity of the objective function—the existence ofa unique solution; the possible algorithms to find a solution.

Deciding on an appropriate form of the penalty functionis one of the important questions in penalized regression.However, in practice the degree of smoothness of the targetfunction is often not known. Consequently, the choice of theorder of the difference penalty, when restricting to the classof difference penalties, is not an obvious one. In this paperwe study the impact of this differencing order and explorewhether a ‘best’ choice for this order given the data at handexists. As a criterion for selecting the differencing order aswell as the smoothing parameter we use Akaike’s informa-tion criterion. We also establish a theoretical approximationof the selection criterion which sheds light on the existenceof an ‘optimal’ choice for the differencing order.

In this paper the focus is on an adaptive choice of theorder in a difference type of penalty, together with a data-driven global (constant) choice of the smoothing parameter.Other work deals with functional type of smoothing para-meters, that is a smoothing parameter that may take differ-ent values in different regions of the domain of estimation.For example, Pintore et al. (2006) discuss a spatial adaptivechoice of the smoothing parameter for a piecewise constant

500 Stat Comput (2010) 20: 499–511

smoothing parameter function, whereas a mixed modellingframework allowing for a varying smoothing parameter isused in Krivobokova et al. (2008). Another approach to anadaptive choice of a penalty function can be found in, forexample, Heckman and Ramsay (2000), where the focus ison a class of penalty functions that are defined in terms oflinear differential operators, and an adaptive choice for theseoperators is made.

The paper is organized as follows. In Sect. 2, we givethe estimation context for the normal regression model.Section 3 briefly discusses the proposed algorithm andthe Akaike information type of criterion. Theoretical back-ground and derivations for the normal model are providedin Sect. 4; the proofs of all theoretical results are deferredto Appendix. An extensive numerical study in Sect. 5 illus-trates the finite-sample properties of the estimation proce-dure with the proposed differencing order selection for thenormal model. Illustrations of the use of the method on realdata examples are also given. In Sect. 6 we discuss the se-lection procedure for generalized linear models and provideillustrations for it. We conclude with some brief discussionin Sect. 7.

2 Flexible estimation in normal regression models

Suppose we have data (xi, Yi), for i = 1, . . . , n, from

Yi = μ(xi) + εi, (2.1)

with μ(·) a certain smooth unknown function, and where theεi ’s are independent and identically distributed (i.i.d.) zeromean normal random variables with finite variance σ 2. Forsimplicity we assume that the design points are fixed in aninterval [a, b].

To estimate μ(·) we use a regression spline modelμ(x) = ∑m

j=1 αjBj (x;q), where {Bj (·;q) : j = 1, . . . ,K+q = m} is the q-th degree B-spline basis, using normalizedB-splines such that

∑j Bj (x;q) = 1, with K + 1 equidis-

tant knot points t0 = a, t1 = a + b−aK

, . . . , tK = b in [a, b]and α = (α1, . . . , αm)′ is the unknown column vector of re-gression coefficients. The penalized least squares estimatorα is then the minimizer of the penalized log-likelihood

S = 1

σ 2

n∑

i=1

(

Yi −m∑

j=1

αjBj (xi;q)

)2

+ λ

m∑

j=k+1

(�kαj )2,

(2.2)

where λ > 0 is the smoothing parameter and � the differ-encing operator, that is �kαj = ∑k

t=0(−1)t(kt

)αj−t , with

k ∈ N. In particular, for k = 1 and k = 2 this is �1αj =αj −αj−1 and �2αj = αj −2αj−1 +αj−2, respectively. Theparameter λ influences the smoothness of the fitted curve;

if λ → 0, (2.2) reduces to least squares regression with B-splines. With k = 0, the optimization problem in (2.2) cor-responds to ridge regression. See for example, Hastie et al.(2001). This penalized least squares approach can also beused for flexible modelling in generalized regression mod-els. See Sect. 6.

The selection of the smoothing parameter λ > 0 and thenumber and locations of the knot points t0, . . . , tK , have re-ceived considerable attention. In contrast, the choice of theorder k of the differencing operator in relation to the de-gree q of the B-splines is much less explored. The impactof the choice of k on the estimation has been investigatedvia examples in Welham et al. (2007). Currie and Durban(2002) attempted making appropriate choices of q , k andK in practice. Algorithms for adaptive choices of parame-ters in a smoothing spline context are discussed in for ex-ample Izarry (2004) and Eubank et al. (2003). To the bestof our knowledge there are no studies available which alsoprovide theoretical foundations. In the present paper we dis-cuss the choice of the order k of the differencing operatorand use Akaike’s information criterion to find an ‘optimal’value for k, the differencing order, and at the same time forthe smoothing parameter λ (the latter as in Eilers and Marx1996). Our studies, theoretically as well as empirically, re-veal that the choices k, λ, q and K are connected, and thatit would likely be redundant to choose all of them in a data-driven way. In this paper we start from prespecified valuesof K and q and focus on best data-driven choices of k and λ.

In the next sections we present the selection method andprovide theoretical derivations for the AIC criterion for thenormal model (2.1).

3 Akaike’s information criterion and selectionprocedure

Rewriting expression (2.2) in matrix notation we obtain:

S = 1

σ 2(Y − Bα)′(Y − Bα) + λα′D′

kDkα, (3.1)

where the elements Bij of B (∈ Rn×m) are Bj (xi;q), Dk

(∈ R(m−k)×m) is the matrix representation of the kth order

differencing operator �k and Y = (Y1, . . . , Yn)′. We note

that B ′B is a positive semidefinite, symmetric bandmatrixwith 2q + 1 (sub)diagonals different from zero, since onlya few B-splines have overlapping support. Furthermore thematrix D′

kDk is a positive semidefinite, symmetric bandma-trix too, with 2k + 1 (sub)diagonals different from zero.

We now determine the estimate of the vector of regres-sion coefficients α, denoted by α = (α1, . . . , αm)′, and theestimate of the vector of the underlying function μ(·) eval-uated at the data points, i.e. μ = (μ(x1), . . . ,μ(xn))

′. Mini-

Stat Comput (2010) 20: 499–511 501

mizing (3.1) leads to the penalized least squares estimator:

α = (B ′B + σ 2λD′kDk)

−1B ′Y and(3.2)

μ = Bα = B(B ′B + σ 2λD′kDk)

−1B ′Y = HY

since μ = Bα, and where H is a matrix of dimension n×n,called the hat matrix.

Akaike’s information criterion (AIC) is then defined as−2 times the log-likelihood plus 2 times the trace of the hatmatrix:

AIC(λ, k) = 1

σ 2(Y − μ)′(Y − μ) + 2 trace(H)

= 1

σ 2Y ′Y − 2

σ 2Y ′Bα + 1

σ 2α′B ′Bα

+2 trace(H). (3.3)

In this criterion trace(H) is used as an approximation forthe effective dimension of the vector of parameters α. Seefor example Ruppert et al. (2003).

We now use Akaike’s information criterion to select a‘best’ value for the smoothing parameter as well as for thedifferencing order k, as follows.

Selection method:

1. For k fixed, find an ‘optimal’ value for λ: λ(k) =arg minλ∈R

+0

AIC(λ, k).

2. Determine k = arg mink∈N AIC(λ(k), k).

4 Theoretical approximation of the ‘best’ differencingorder under the normal model

In this section we work towards finding a non-stochastic ap-proximation of the selection criterion, with the aim to gainan insight into the existence and impact of a ‘best’ value ofthe differencing order k and interrelations with other para-meters. We start by searching for a first order approximationof AIC(λ, k), in (3.3), and subsequently study the minimiz-ers of this approximation.

4.1 Approximation of the AIC criterion

Under the normal model we have the explicit form (3.2) forthe solution. Since this solution needs to be substituted into(3.3), we first approximate the matrix (B ′B + σ 2λD′

kDk)−1

in Theorem 1.In the asymptotic study that follows, we let the number

of knot points increase with n, i.e. K = Kn and consider asequence of smoothing parameters λ = λn. As will be seenfrom the conditions, the latter sequence needs to tend to 0with n.

Theorem 1 If infi �=j |xi − xj | > 0 and supi,j (xi − xj ) <

δ with δ ≤ b−ab−a+q

< 1, K = Kn and λ = λn are such thatKn

→ constant and λKn

→ 0 when K,n → ∞, then

(B ′B + σ 2λD′kDk)

−1

= (B ′B)−1 − σ 2λ(B ′B)−1D′kDk(B

′B)−1

+o

(λK2

n2

)

1m×m,

where 1m×m is the m × m matrix of ones.

Using this theorem, we find that α = (B ′B)−1B ′Y −σ 2λ(B ′B)−1D′

kDk(B′B)−1B ′Y is a good first order approx-

imation of α when n is sufficiently large. Consequently,from (3.3), a first order approximation of AIC(λ, k) is

AIC(λ, k)

= 1

σ 2Y ′Y − 2

σ 2Y ′Bα + 1

σ 2α′B ′Bα

+2 trace(B(B ′B)−1B ′

−σ 2λB(B ′B)−1D′kDk(B

′B)−1B ′)

= 1

σ 2Y ′Y − 1

σ 2Y ′B(B ′B)−1B ′Y

+σ 2λ2Y ′B(B ′B)−1D′kDk(B

′B)−1D′kDk(B

′B)−1B ′Y

+2m − 2σ 2λ trace((B ′B)−1D′

kDk

). (4.1)

4.2 Minimization of the approximated AIC criterion

The first step in our method is to minimize the AIC with theintention of reaching an ‘optimal’ value for λ. Therefore wedifferentiate AIC(λ, k), in (4.1), for a fixed k with respect toλ and search for the zero of the derivative:

∂AIC(λ, k)

∂λ

= 2σ 2λY ′B(B ′B)−1D′kDk(B

′B)−1D′kDk(B

′B)−1B ′Y

−2σ 2 trace((B ′B)−1D′

kDk

),

resulting into the optimal value for λ:

λ(k) = trace((B ′B)−1D′kDk)

Y ′B(B ′B)−1D′kDk(B ′B)−1D′

kDk(B ′B)−1B ′Y.

Using the notation

Ak = B(B ′B)−1D′kDk(B

′B)−1D′kDk(B

′B)−1B ′, (4.2)

for this square matrix of dimension n × n, we find that

∂2AIC(λ, k)

∂λ2= 2σ 2Y ′AkY.

502 Stat Comput (2010) 20: 499–511

Then λ(k) is the minimum of AIC(λ, k) if the quadraticform Y ′AkY is positive. We therefore assume that

Condition A The matrix Ak is positive definite.

Notably, B ′B is a positive semidefinite matrix, since B ′Bis invertible. Further, Marlow (1993) (p. 151) gives that Ak

is a positive semidefinite matrix, since (B ′B)−1 is a posi-tive definite matrix. Hence Condition A is guaranteed whenD′

kDk(B′B)−1B ′ has full column rank n.

We next work further with AIC(λ(k), k) and ignore theterms which do not depend on k, since we only want an ‘op-timal’ value for k. This leads to considering

AIC(λ(k)) ≡ −σ 2 (trace((B ′B)−1D′kDk))

2

Y ′AkY

= −σ 2(trace((B ′B)−1D′

kDk

))2 1

V, (4.3)

where V = Y ′AkY . Lastly an approximation of 1V

leadsto an approximation of AIC(λ(k)) which is no longer sto-chastic. Such a non-stochastic approximation is very helpfulsince it will allow us to find some ‘optimal’ (non-stochastic)value for k, and will provide further insights in the (stochas-tic) choice of k via the proposed algorithm.

4.3 Approximating the minimized approximated AICcriterion

The following theorem presents an approximation of theterm 1

Vin (4.3). For convenience of the reader we provide

the definitions of the norms in Appendix A.1.

Theorem 2 Suppose Condition A holds and assume that

1. ‖μ′‖∞‖μ‖∞ < ∞ and ‖μ‖2 < ∞2. K3

n→ 0 when K,n → ∞

3. ‖Ak‖sp

‖Ak‖2→ 0 when n → ∞.

Then 1V

= 1E(V )

+ oP (1).

Applying Theorem 2 we obtain, from (4.3), a non-stochastic approximation of AIC(λ(k))

AIC(k;q,K) = −σ 2(trace((B ′B)−1D′

kDk

))2 1

E(V )

= −σ 2 (trace((B ′B)−1D′kDk))

2

μ′Akμ + σ 2trace(Ak), (4.4)

where the notation also explicitly shows the dependenceon the degree q of the B-splines and the number of knotpoints K . Obviously, there is a strong link between q , K

and k and ideally one should look at all these quantities to-gether when minimizing the Akaike’s information criterion.

This is however not a feasible task. Moreover an appropriatechoice of k for given q and K gives a good performance, ascan be seen from the examples. A non-stochastic ‘optimal’value for k is found by minimizing (4.4) with respect to k,for fixed q and K .

5 Simulation study and application

5.1 Simulation study

We consider two examples in which Yi = μ(xi) + εi , whereεi ∼ N(0, σ 2) and the xi ’s are equidistant. For both exam-ples a graphical representation of AIC(λ(k), k) for the 100samples is provided as well as a graph with, for each k, themean of AIC(λ(k), k) over the 100 samples. We also presentAIC(k;q,K), defined in (4.4), for a fixed number of knotpoints but different degrees q of the B-splines as well asAIC(k;q,K) as a function of the number of knot points K

and the order of the differencing operator k, for a fixed de-gree q of the B-splines. These presentations allow to revealsome links between k, q and K . We furthermore present afrequency histogram of the selected optimal k-value in the100 samples.

We consider two models. In Model 1, μ(·) is a polyno-mial of degree 5 on the interval [−1.7,1.7], depicted as asolid line in Fig. 3(b). In Model 2, the function μ(·) is de-fined on [−2,2], as a parabola on [−1,1] and a constantoutside that subinterval. The function μ(·) in Model 2 is notdifferentiable in the points −1 and 1. See the solid line inFig. 6(b).

For Model 1, the non-stochastic approximation AIC(k;q,K) for a fixed K but different q’s is presented inFig. 1(a). It is seen that the higher the degree of the B-spline, the higher the ‘optimal’ k-value, as could be ex-pected. In Fig. 2(a) the sample Akaike information criterionAIC(λ(k), k) as a function of k is given for the 100 samples,using B-splines of degree 3 and K = 6n1/4 = 30 for theP-spline regression. Figure 2(b) shows the non-stochasticAIC(k;q,K) as a function of k and K for q = 3. Bothgraphs indicate that k = 5 is, on average, the ‘optimal’ valuefor the order of the differencing operator. This is confirmedby the histogram of the optimal k-values given in Fig. 3(a)and the graph of the mean AIC in Fig. 1(b), calculated byaveraging the curves AIC(λ(k), k) over all 100 samples. Forcomparison we also plot in Fig. 1(b) (as a dashed line) thenon-stochastic AIC(k;q,K). As can be seen, both func-tions, the averaged finite sample AIC(λ(k), k) and its non-stochastic approximation AIC(k;q,K), achieve their min-imal value in k = 5. Finally, in Fig. 3 (b) the fit using B-splines of degree 3, k = 5 and K = 30 is plotted, and showsan almost perfect fit.

We now turn to simulations for Model 2. Since the mainpart of the underlying function in Model 2 is a parabola,

Stat Comput (2010) 20: 499–511 503

Fig. 1 Model 1. (a): AIC(k;q,K) for a fixed K and different q’s; (b): mean of AIC(λ(k), k) (over all samples) and AIC(k;q,K) for q = 3

Fig. 2 Model 1. Polynomial of degree 5 using B-splines of degree 3: (a) AIC(λ(k), k); and (b) AIC(k;q,K)

we use B-splines of degree 2 and take K = 6n1/4 = 30 forthe P-spline regression. The non-stochastic approximationAIC(k;q,K) for a fixed K but different q’s is presented inFig. 4(a). In Fig. 5(a) the sample Akaike information crite-rion AIC(λ(k), k) as a function of k is shown for the 100samples. From this graph and the histogram of the optimalk-values, provided in Fig. 6(a), we can conclude that k = 1 isthe ‘optimal’ k-value for this example, when using B-splinesof degree 2. The same value of k is seen as the minimizer ofthe mean AIC (over the 100 samples) depicted as a solidline in Fig. 4(b). Note that here the non-stochastic approx-

imation AIC(k;q,K) for q = 2 (see Figs. 4(a) and (b)) re-veals that k = 2 is best. This seemingly discrepancy is dueto the fact that AIC(k;q,K) is an asymptotic approximation

of AIC(λ(k), k). Indeed, enlarging the sample size n, led toa histogram with mode k = 2 (not presented here).

Figure 5(b) presents the non-stochastic AIC(k;q,K).The P-spline fit, using q = 2, k = 1 and K = 30, is depictedin Fig. 6(b). The fit has the same overall shape as the real un-derlying function, but tends to wiggle at the constant partsof the function and the top of the parabola. The different be-haviour in the constant and parabolic parts of the curve isnot surprising.

5.2 Application: yacht race data

The data concern the Sydney-Hobart yacht race, a 630nautical mile ocean race, which starts from Sydney Har-

504 Stat Comput (2010) 20: 499–511

Fig. 3 Model 1. (a): histogram of optimal k-values in 100 samples; (b): P-spline fit using q = 3 and k = 5 (dashed line)

Fig. 4 Model 2. (a): AIC(k;q,K) for a fixed K and different q’s; (b): mean of AIC(λ(k), k) (over all samples) and AIC(k;q,K) for q = 2

bour on December 26 and finishes several days later in

Hobart. The variable of interest Y is the winning time in

minutes, and the x-variable is the year in which the race took

place (from 1945 to 1997). The data are publicly available

at http://www.statsci.org/data/oz/sydhob.html. Figure 7(a)

presents the plots of AIC(λ(k), k) for q = 1,3 and 7, reveal-

ing that for each degree the value k = 3 is the best choice.

Figure 7(b) depicts the fits for each degree with the selected

best k.

6 Selection procedure for generalized linear models

The procedure for selecting the differencing order k can alsobe used in non-normal regression models.

6.1 Akaike’s information criterion

The penalized least squares approach is also applicable forflexible modelling in generalized regression models, extend-ing as such the class of generalized linear models (see Mc-

Stat Comput (2010) 20: 499–511 505

Fig. 5 Model 2. Tunnel using B-splines of degree 2: (a) AIC(λ(k), k); and (b) AIC(k;q,K)

Fig. 6 Model 2. (a): histogram of optimal k-values in 100 samples; (b): P-spline fit using q = 2 and k = 1 (dashed line)

Cullagh and Nelder 1985). In generalized regression mod-els, the predictor function takes the form

η(x) =m∑

j=1

αjBj (x;q) (6.1)

and a link function g(·) links μ(x) with η(x) through η(x) =g(μ(x)), where μi = μ(xi) = E(Yi). In the context of ageneralized linear model, the probability density function ofthe random variable Y is

f (y; θ,φ) = exp

(yθ − h(θ)

s(φ)+ c(y,φ)

)

, (6.2)

where s(·), h(·) and c(·) are specific functions, φ is a scale

parameter, and θ is the canonical parameter; furthermore

E(Y) = μ = h′(θ) and Var(Y ) = h′′(θ)s(φ). In the consid-

ered generalized regression models the parameter θ depends

on x and model (6.2) involves the unknown function θ(x).

The penalized regression estimator α is then the mini-

mizer of

S = −2 logL(α1, . . . , αm; (x1, Y1), . . . , (xn,Yn)

)

m∑

j=k+1

(�kαj )2, (6.3)

506 Stat Comput (2010) 20: 499–511

Fig. 7 (a): AIC(λ(k), k); (b): P-spline fits; for three values of q

where L(α1, . . . , αm; (x1, Y1), . . . , (xn,Yn)) is the likeli-hood function derived from model (6.2), extended to allowfor a function θ(·).

Denote the minimizer of (6.3) by α = (α1, . . . , αm)′. Thisleads to the estimated predictor function η(·) via (6.1). Letμ = (μ(x1), . . . , μ(xn))

′, obtained from μ(xi) = g−1(η(xi)),and denote η = (η(x1), . . . , η(xn))

′.Akaike’s information criterion (AIC) is then given by

AIC(λ, k) = −2 logL(α1, . . . , αm; (x1, Y1), . . . , (xn,Yn)

)

+2 trace(H), (6.4)

where H is defined below.For generalized linear models the penalized log-likeli-

hood (6.3) can be written as

S = −2

(Y ′Bα − 1′

nh(Bα)

s(φ)+ c(Y,φ)

)

+ λα′D′kDkα,

where 1n = (1, . . . ,1)′ is the column vector of dimension n

with all elements equal to one.Minimizing this with respect to α leads to the following

system of equations

B ′(Y − μ) = λs(φ)D′kDkα. (6.5)

As proposed in Eilers and Marx (1996), this system of equa-tions can be solved by iterative weighted linear regressionwith the system:

B ′W (Y − μ) + B ′WBα = (B ′WB + λs(φ)D′kDk)α,

where α and μ are current approximations and W is a diag-onal matrix with Wii = 1

Var(Yi )(∂μi

∂ηi)2. Akaike’s information

criterion is then

AIC(λ, k) = −2

(Y ′Bα − 1′

nh(Bα)

s(φ)+ c(Y,φ)

)

+2 trace(H), (6.6)

with H = B(B ′WB + λs(φ)D′kDk)

−1B ′W the hat matrix,

where W is based on α and μ from the last step in the itera-tion procedure.

Application of the selection procedure for k and λ canthen be performed as before. Establishing supporting theo-retical results for this more general case is however a verytedious task, and is pursued in Gijbels and Verhasselt (2009).As an illustration we show the use of the selection procedurein the case of Poisson data.

6.2 Simulation study: Poisson model

We explore two examples where Yi ∼ Poisson(μ(xi)) withxi equidistant. A histogram of the number of optimal k-values, found by minimizing the Akaike information crite-rion, in 100 samples is given for both examples. Again, weconsider two different choices for the function μ(·) leadingto two different models.

In Model 3 the underlying function μ(·) is a polynomialof degree 6 on the interval [−5,5], depicted as a solid linein Fig. 8(b). Figure 8(a) represents the number of times ak-value is selected in the 100 samples, using B-splines ofdegree 3 and K = 6n1/4 = 30 in the AIC. The P-spline fit(dashed curve) is presented in Fig. 8(b), using B-splines ofdegree 3, K = 30 and k = 3.

In Model 4 the function μ(·) is a straight line with a dip.Figure 9(a) represents the number of times a k-value is se-lected in the 100 samples, by using B-splines of degree 3 and

Stat Comput (2010) 20: 499–511 507

Fig. 8 Model 3. (a): histogram of the number of optimal k-values in 100 samples; (b): P-spline fit using q = 3 and k = 3 (dashed curve)

Fig. 9 Model 4. (a): histogram of the number of optimal k-values in 100 samples; (b): P-spline fit using q = 3 and k = 2 (dashed curve)

K = 6n1/4 = 30 in the calculation of the Akaike’s informa-tion criterion. The P-spline fit, using B-splines of degree 3,K = 30 and k = 2, is shown in Fig. 9(b) as a dashed line.

6.3 Application: coal mines data (Poisson model)

The number of severe disasters in British coal mines fromyear 1850 to 1962 are represented in the coal mines data set,and are considered by Jarrett (1979) and Diggle and Mar-ron (1988), among others. We use AIC(λ(k), k) to find anoptimal value for k, as shown in Fig. 10(a), by taking theknot points equidistant in the domain of the data and usingB-splines of degree 2 and 3. Note that in this example the

AIC(λ(k), k) curve has a quite different form compared toprevious examples. For both degrees (2 and 3) the optimalvalue for k is 3. The fitted curve, which is almost the samefor both degrees of the B-splines, is presented in Fig. 10(b).We also added a fit using B-splines of degree 3 and k = 1,which differs considerably from the two other fits at the lo-cal maxima and minima.

7 Conclusion and discussion

In this paper we consider P-spline regression smoothing ina context of nonparametric generalized linear models. Se-

508 Stat Comput (2010) 20: 499–511

Fig. 10 (a): AIC(λ(k), k); (b): P-spline fits

lection of the order of the difference type penalty is donevia Akaike’s information criterion, simultaneously with thechoice of the smoothing parameter λ. As is evidenced bythe examples, for normal and non-normals models, adap-tive choices of k and λ together lead to a very good perfor-mance, for any degree q given and fixed, and K fixed. In ad-dition, for the normal model we study the theoretical behav-iour of the selection criterion, and provide a first-order non-stochastic approximation of the criterion to be minimizedin k. This non-stochastic approximation allowed us to studymore in detail the interrelations between q , K and k, and tovisualize them in plots. Moreover, such a non-stochastic op-timal k is a first-order approximation to the stochastic choiceof k obtained via the selection algorithm.

The study for the theoretical behaviour, provided in theappendix for the normal model, is rather involved. Estab-lishing the theoretical support for the full set of generalizedregression models requires additional technical tools, andwill be part of a forthcoming work by Gijbels and Verhasselt(2009).

Acknowledgements The authors thank the Editor, an Associate Ed-itor and two reviewers for their very valuable comments which ledto a considerable improvement of the manuscript. Support from theGOA/07/04-project of the Research Fund KULeuven is gratefullyacknowledged, as well as support from the IAP research networknr. P6/03 of the Federal Science Policy, Belgium.

Appendix

A.1 Definitions of matrix norms

Definition 1 Let A = (Aij ) be an m × n real valued matrix,then

• the 1-norm of A is ‖A‖1 = max1≤j≤n

∑mi=1 |Aij |

• the ∞-norm of A is ‖A‖∞ = max1≤i≤m

∑nj=1 |Aij |

• for 1 < ν < ∞, the ν-norm of A is ‖A‖ν =(∑m

i=1∑n

i=j |Aij |ν)1/ν

• the spectral norm of A is ‖A‖sp = √ρmax(A′A), where

ρmax(A) is the largest eigenvalue of A

• A is bounded invertible if its inverse exists and isbounded.

Note that the norm ‖A‖ν with ν = 2 is called the Frobe-nius norm of a matrix A, which equals ‖A‖2

2 = trace(A′A).Furthermore, it is worthwhile noting that when m = n, thep-norm of a matrix A ∈ N

n×n, with p ≥ 1, is defined as‖A‖p,MN = maxx �=0(‖Ax‖p/‖x‖p) where x is any non-nullvector of dimension n × 1 and ‖v‖p = (

∑ni=1 |vi |p)1/p is

the usual p-norm of a vector v of dimension n × 1. In thespecial case that p = 2 (the Euclidean norm), the 2-norm ofthe square matrix A, ‖A‖2,MN , is equal to the spectral norm‖A‖sp .

A.2 Proof of Theorem 1

The proof of Theorem 1 relies on the generalization of a Tay-lor expansion of 1

1+xfor matrices. From Horn and Johnson

(1993) (p. 301) we get the following proposition.

Proposition 1 If B ′B is bounded invertible, then so isB ′B + σ 2λD′

kDk if a matrix norm ‖.‖ exists such that

‖σ 2λD′kDk‖ <

1

‖(B ′B)−1‖ .

Moreover, under this condition, it holds that

Stat Comput (2010) 20: 499–511 509

(B ′B + σ 2λD′kDk)

−1

=∞∑

j=0

(−1)j (σ 2λ)j (B ′B)−1(D′kDk(B

′B)−1)j .

We prove Theorem 1 applying Proposition 1.

Proof of Theorem 1 With a view to applying Proposi-tion 1, we verify its conditions using the ∞-norm. Weknow from Gröchenig and Schwab (2003) that B ′B is in-vertible if infi �=j |xi − xj | > 0 and supi,j (xi − xj ) < δ with

δ ≤ b−ab−a+q

< 1. Moreover, we know from Shen et al. (1998)

(Lemma 6.3) that ‖(B ′B)−1‖∞ = O(Kn). Thus assuming

infi �=j |xi − xj | > 0 and supi,j (xi − xj ) < δ, δ ≤ b−ab−a+q

< 1

and Kn

→ constant, implies that B ′B is bounded invertible.We verify the second condition of Proposition 1 us-

ing ‖D′kDk‖∞ = ∑2k

j=0 |(−1)j(2k

j

)| = 4k . As a result we

find ‖σ 2λD′kDk‖∞‖(B ′B)−1‖∞ = σ 2λ4kO(K

n) = O(λK

n),

which converges to 0 if λKn

→ 0, when K,n → ∞.Furthermore, if λK

n→ 0 the higher order terms in the

Taylor series are negligible compared to the first two termssince

(σ 2λ)2‖(B ′B)−1(D′kDk(B

′B)−1)2‖∞σ 2λ‖(B ′B)−1D′

kDk(B ′B)−1‖∞

≤ σ 2λ‖(B ′B)−1D′

kDk(B′B)−1‖∞‖D′

kDk(B′B)−1‖∞

‖(B ′B)−1D′kDk(B ′B)−1‖∞

= O

(

λK

n

)

.

Therefore

(B ′B + σ 2λD′kDk)

−1

= (B ′B)−1 − σ 2λ(B ′B)−1D′kDk(B

′B)−1

+o(λ‖(B ′B)−1D′

kDk(B′B)−1‖∞

)1m×m

= (B ′B)−1 − σ 2λ(B ′B)−1D′kDk(B

′B)−1

+o

(λK2

n2

)

1m×m. �

A.3 Proof of Theorem 2

We prove Theorem 2 by using a Taylor series and show thatonly the constant term differs from zero. We prove that thehigher order terms are op(1).

The following propositions are needed to prove Theo-rem 2. Proposition 2 presents some conditions under whichE(V ) is bounded.

Proposition 2 If K3

n→ constant when K,n → ∞ and

‖μ′‖∞‖μ‖∞ < ∞, then E(V ) is bounded.

Proof Since μ = E(Y), it follows

E(V ) = μ′Akμ + σ 2 trace(Ak). (A.1)

With a view to bound E(V ), upperbounds for μ′Akμ andtrace(Ak) are sought. Since Ak is positive semidefinite wehave that

μ′Akμ = ‖μ′Akμ‖∞ ≤ ‖Ak‖∞‖μ′‖∞‖μ‖∞.

The proof of Theorem 1 establishes that ‖(B ′B)−1‖∞ =O(K/n) and ‖D′

kDk‖∞ = 4k . Since ‖B‖∞ =maxi

∑mj=1 |Bj (xi;q)| = 1 and ‖B ′‖∞ = ‖B‖1 =

maxj

∑ni=1 |Bj (xi;q)| ≤ maxj

∑ni=1 1 = n we receive that

‖Ak‖∞ = ‖B(B ′B)−1D′kDk(B

′B)−1D′kDk(B

′B)−1B ′‖∞≤ ‖B‖∞‖(B ′B)−1‖3∞‖D′

kDk‖2∞‖B ′‖∞

≤ nO

(K3

n3

)

.

Therefore μ′Akμ is bounded when K3/n2 → constant and‖μ′‖∞‖μ‖∞ < ∞. A rough upperbound for trace(Ak) isn‖Ak‖∞. Consequently E(V ) is bounded when K3/n →constant and ‖μ′‖∞‖μ‖∞ < ∞. �

Note that E(V ) > 0, since Ak is assumed a positive defi-nite matrix.

We rewrite V = Y ′AkY using the symmetry of Ak andthe fact that Zi = Yi−μi

σ∼ N(0,1):

V =n∑

i=1

n∑

j=1

(Ak)ij YiYj

= σ 2n∑

i=1

n∑

j=1

(Ak)ijZiZj + 2σ

n∑

i=1

μi

n∑

j=1

(Ak)ijZj

+n∑

i=1

n∑

j=1

(Ak)ijμiμj

≡ σ 2S1 + 2σS2 + S3.

Therefore, using (A.1),

V − E(V ) = V − μ′Akμ − σ 2 trace(Ak)

= σ 2S1 + 2σS2 + S3 − S3 − σ 2 trace(Ak)

= σ 2(S1 − trace(Ak)) + 2σ S2 (A.2)

and we will show the asymptotic behaviour of V −E(V ) byshowing the asymptotic normality of S1 − trace(Ak) and S2.

The asymptotic normality of S1 − trace(Ak) followsfrom an asymptotic normality result for quadratic forms inBhansali et al. (2006) (see Theorem 2.1 in that paper) pre-sented in the following proposition.

510 Stat Comput (2010) 20: 499–511

Proposition 3 If‖Ak‖sp

‖Ak‖2→ 0 when n → ∞, then

S1 − E(S1)√Var(S1)

D−→ N(0,1).

Note that the condition in Proposition 3 involves the norm

‖Ak‖sp =√

ρmax(A′kAk) and ‖Ak‖2 =

√trace(A′

kAk) =√∑

ρ2j , where {ρj } are the eigenvalues of the matrix

Ak , since A′kAk is a square matrix. Hence the condition

‖Ak‖sp

‖Ak‖2→ 0 is nothing but assuming that the largest eigen-

value is dominated by all other eigenvalues together.On the other hand, the normality of

S2 =n∑

i=1

n∑

j=1

(Ak)ijμiZj

=n∑

j=1

Zj

(n∑

i=1

(Ak)ijμi)

)

=n∑

j=1

αn,jZj

is obvious since the Zj are standard normal distributed.The elements above are essential to prove Theorem 2.

Proof of Theorem 2 The Taylor series of 1V

is

1

V= 1

E(V )

∞∑

j=0

(−1)j(

V − E(V )

E(V )

)j

.

Since we proved (in Proposition 2) that under Condition A,and the first and second condition of Theorem 2, 0 <

E(V ) < ∞, the terms (V −E(V )

E(V ))j (with j > 0) converge

to 0 in probability if V − E(V )P→ 0. Consequently 1

V=

1E(V )

+ oP (1).From (A.2) we further write

V − E(V ) = σ 2 S1 − trace(Ak)√Var(S1)

· √Var(S1)

+2σS2

‖Akμ‖2· ‖Akμ‖2. (A.3)

For the first term on the right hand side, we know thatE(S1) = trace(Ak) and Var(S1) = 2 trace(A2

k), because ingeneral E(Z′�Z) = trace(��) + γ ′�γ and Var(Z′�Z) =2 trace(����) + 4γ ′���γ , where γ is the expectedvalue and � the variance-covariance matrix of Z. As a resultof Proposition 3 we have that

S1 − trace(Ak)√

2trace(A2k)

D−→ N(0,1).

Since

trace(A2k) ≤ (

trace(Ak))2 ≤ (n‖Ak‖∞)2 = O

(K6

n2

)

we obtain by condition 2 in Theorem 2 that the first term onthe right hand side of (A.3) is oP (1).

For the second term in (A.3) we use the fact that thestandardized S2 quantity is normally distributed. Indeed,for a fixed n, αn,jZj (j = 1, . . . , n) are independent ran-dom variables which are N(0, α2

n,j ) distributed. Conse-

quently,∑n

j=1 αn,jZj is N(0,∑n

j=1 α2n,j ) distributed. Note

that∑n

j=1 α2n,j = μ′A2

kμ = ‖Akμ‖22 and thus is

1

‖Akμ‖2

n∑

i=1

n∑

j=1

(Ak)ijμiZj = S2

‖Akμ‖2∼ N(0,1).

Since

‖Akμ‖2 ≤ ‖Ak‖2‖μ‖2 ≤ n‖Ak‖∞‖μ‖2 = O

(K3

n

)

and by the conditions of the theorem we conclude that thesecond term in (A.3) is oP (1) and consequently V −E(V ) =oP (1). �

References

Bhansali, R.J., Giraitis, L., Kokoskza, P.S.: Convergence of quadraticforms with nonvanishing diagonal. Stat. Probab. Lett. 77, 726–734(2006)

Currie, I.D., Durban, M.: Flexible smoothing with P-splines: a unifiedapproach. Stat. Model. 2, 333–349 (2002)

Dierckx, P.: Curve and Surface Fitting with Splines. Clarendon, Oxford(1993)

Diggle, P., Marron, J.S.: Equivalence of smoothing parameter selectorsin density and intensity estimation. J. Am. Stat. Assoc. 86, 793–800 (1988)

Eilers, P., Marx, B.: Flexible smoothing with B-splines and penalties.Stat. Sci. 11, 89–102 (1996)

Eubank, R.L., Huang, C., Wang, S.: Adaptive order selection for splinesmoothing. J. Comput. Graph. Stat. 12, 382–397 (2003)

Gijbels, I., Verhasselt, A.: Regularization and P-splines in generalizedlinear models. Manuscript (2009)

Gröchenig, K., Schwab, H.: Fast local reconstruction methods fornonuniform sampling in shift-invariant spaces. SIAM J. MatrixAnal. Appl. 24, 899–913 (2003)

Hastie, T., Tibshirani, R., Friedman, J.: The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer, NewYork (2001)

Heckman, N.E., Ramsay, O.J.: Penalized regression with model-basedpenalties. Can. J. Stat. 28, 241–258 (2000)

Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge UniversityPress, Cambridge (1993)

Izarry, R.A.: Choosing smoothness parameters for smoothing splinesby minimizing an estimate of risk. Technical Report, Dept. of Bio-statistics Working Papers, Working Paper 30, Johns Hopkins Uni-versity (2004)

Jarrett, R.G.: A note on the intervals between coal-mining disasters.Biometrika 66, 191–193 (1979)

Krivobokova, T., Crainiceanu, C.M., Kauermann, G.: Fast adaptive pe-nalized splines. J. Comput. Graph. Stat. 17, 1–20 (2008)

Marlow, W.H.: Mathematics for Operations Research. Dover, NewYork (1993)

Stat Comput (2010) 20: 499–511 511

McCullagh, P., Nelder, J.A.: Generalized Linear Models. Chapman andHall, London (1985)

Pintore, A., Speckman, P., Holmes, C.C.: Spatially adaptive snoothingsplines. Biometrika 93, 113–125 (2006)

Ruppert, D., Wand, M.P., Carroll, R.J.: Semiparametric Regression.Cambridge University Press, Cambridge (2003)

Shen, X., Wolfe, D.A., Zhou, S.: Local asymptotics for regressionsplines and confidence regions. Ann. Stat. 26, 1760–1782 (1998)

Welham, S.J., Cullis, B.R., Kenward, M.G., Thompson, R.: A compar-ison of mixed model splines for curve fitting. Aust. N. Z. J. Stat.49, 1–23 (2007)