The shared frailty model and the power for heterogeneity tests in multicenter trials

18
Computational Statistics & Data Analysis 40 (2002) 603 – 620 www.elsevier.com/locate/csda The shared frailty model and the power for heterogeneity tests in multicenter trials Luc Duchateau a;b; , Paul Janssen c , Patrick Lindsey c , Catherine Legrand b , Rosemary Nguti c; d , Richard Sylvester b a Faculty of Veterinary Medicine, Department of Physiology, Biochemistry and Biometrics, Ghent University, Salisburylaan 133, 9820 Merelbeke, Belgium b European Organization for Research and Treatment of Cancer, Av. E. Mounier 83, Box 11, 1200 Brussels, Belgium c Center for Statistics, Limburgs Universitair Centrum, Universitaire Campus, Diepenbeek, Belgium d Department of Mathematics, University of Nairobi and Biometrics Unit, International Livestock Research Institute, Nairobi, Kenya Abstract Heterogeneity between centers in multicenter trials with time to event outcome can be modeled by the frailty proportional hazards model. The majority of the dierent approaches that have been used to t frailty models assume either the gamma or the lognormal frailty density and are based on similar log likelihood expressions. These approaches are briey reviewed and their specic features described; simulations further demonstrate that the dierent techniques lead to virtually the same estimates for the heterogeneity parameter. An important issue is the relationship between the size of a multicenter trial, in terms of number of centers and number of patients per center, and the bias and the spread of estimates of the heterogeneity parameter around its true value. Based on simulation results (restricted to constant hazard rate and the gamma frailty density), it becomes clear how the number of centers and the number of patients per center inuence the quality of the estimates in the particular setting of breast cancer clinical trials. This insight is important in treatment outcome research, where one tries to relate dierences with respect to clinical practice to dierences in outcome in the various centers. c 2002 Elsevier Science B.V. All rights reserved. Keywords: Frailty model; Treatment outcome; Multicenter trial; EM algorithm; Penalized likelihood; REML Corresponding author. Faculty of Veterinary Medicine, Department of Physiology, Biochemistry and Biometrics, Ghent University, Salisburylaan 133, 9820 Merelbeke, Belgium. E-mail address: [email protected] (L. Duchateau). 0167-9473/02/$ - see front matter c 2002 Elsevier Science B.V. All rights reserved. PII: S0167-9473(02)00057-9

Transcript of The shared frailty model and the power for heterogeneity tests in multicenter trials

Computational Statistics & Data Analysis 40 (2002) 603–620www.elsevier.com/locate/csda

The shared frailty model and the power forheterogeneity tests in multicenter trialsLuc Duchateaua; b;∗, Paul Janssenc , Patrick Lindseyc ,

Catherine Legrandb , Rosemary Ngutic;d , Richard SylvesterbaFaculty of Veterinary Medicine, Department of Physiology, Biochemistry and Biometrics, Ghent

University, Salisburylaan 133, 9820 Merelbeke, BelgiumbEuropean Organization for Research and Treatment of Cancer, Av. E. Mounier 83, Box 11, 1200

Brussels, BelgiumcCenter for Statistics, Limburgs Universitair Centrum, Universitaire Campus, Diepenbeek, BelgiumdDepartment of Mathematics, University of Nairobi and Biometrics Unit, International Livestock

Research Institute, Nairobi, Kenya

Abstract

Heterogeneity between centers in multicenter trials with time to event outcome can be modeledby the frailty proportional hazards model. The majority of the di0erent approaches that have beenused to 1t frailty models assume either the gamma or the lognormal frailty density and are basedon similar log likelihood expressions. These approaches are brie3y reviewed and their speci1cfeatures described; simulations further demonstrate that the di0erent techniques lead to virtuallythe same estimates for the heterogeneity parameter. An important issue is the relationship betweenthe size of a multicenter trial, in terms of number of centers and number of patients per center,and the bias and the spread of estimates of the heterogeneity parameter around its true value.Based on simulation results (restricted to constant hazard rate and the gamma frailty density),it becomes clear how the number of centers and the number of patients per center in3uence thequality of the estimates in the particular setting of breast cancer clinical trials. This insight isimportant in treatment outcome research, where one tries to relate di0erences with respect toclinical practice to di0erences in outcome in the various centers. c© 2002 Elsevier Science B.V.All rights reserved.

Keywords: Frailty model; Treatment outcome; Multicenter trial; EM algorithm; Penalized likelihood; REML

∗ Corresponding author. Faculty of Veterinary Medicine, Department of Physiology, Biochemistry andBiometrics, Ghent University, Salisburylaan 133, 9820 Merelbeke, Belgium.

E-mail address: [email protected] (L. Duchateau).

0167-9473/02/$ - see front matter c© 2002 Elsevier Science B.V. All rights reserved.PII: S0167 -9473(02)00057 -9

604 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

1. Introduction

For large international multicenter trials, such as cancer clinical trials, AIDS trialsand cardiovascular disease trials, outcomes are typically more heterogeneous than out-comes from other clinical trials. Heterogeneity decreases the power to detect clinicallyimportant treatment di0erences, but on the other hand, more heterogeneous trials leadto more general conclusions as they are based on a wider patient population. Addition-ally, the di0erences between centers can be studied to determine whether di0erencesin clinical practice at the center level have an in3uence on the outcome. This research1eld is termed ‘treatment outcome’ and can lead to changes in standard patient care(Harding et al., 1993; Hillner et al., 2000).Treatment outcome research, however, has to be done on multicenter trials of ade-

quate size in order to draw valid conclusions. The main aim of this paper is to studythe relationship between the size of a multicenter trial, in terms of number of cen-ters and number of patients per center, and the bias and spread of the estimates ofthe heterogeneity parameter around its true value. As discussed in Chapter 9 in Th-erneau and Grambsch (2000), this important issue on the amount of information andthe patients=center distribution needed to produce stable estimates of the heterogeneityparameter requires further consideration. Based on our results, a decision can be takenwith regard to the minimum size of a multicenter trial in order to do treatment outcomeresearch in the particular 1eld of breast cancer. We believe that such an analysis isrequired before undertaking treatment outcome research in a particular 1eld.We have chosen to base this analysis on simulation because the number of asymp-

totic results for the heterogeneity parameter available in the context of the proportionalhazards model with covariates (treatment e0ect in our simulations) is rather limited.Relevant references for the asymptotic theory are Murphy (1994, 1995) and Parner(1998). Furthermore, it is not easy to obtain the standard error for heterogeneity pa-rameter estimates. Although Klein (1992), see also Klein and Moeschberger (1997),proposes such an estimate based on the information matrix, this matrix has rank equalto the number of distinct event times plus the number of covariates plus one for theheterogeneity parameter. In the case of a multicenter trial, with typically a large numberof distinct event times, this approach is not feasible because of the high dimension-ality. Simulation, however, allows to study the spread of the heterogeneity parameter.Additionally, the simulations enable us to study the likelihood ratio test and validateit in the context of the frailty model.In order to model heterogeneity between centers for time to event outcomes, the

frailty model is used (Klein, 1992; Therneau and Grambsch, 2000; Yamaguchi andOhashi, 1999). The frailty model with proportional hazards assumption and unspeci1edbaseline hazard is described in Section 2. A number of recent papers describe a varietyof likelihood approaches to 1t the frailty model. Taking the observable likelihood(Section 3) as corner stone we explain (in Sections 3 and 4) the usefulness of partiallikelihood and penalized partial likelihood to 1t frailty models. The main purpose ofthis review is to explore the common ground for the available methods. For reasons ofimplementation we also add in the appendix the corresponding algorithms (the S-Pluscode is available from the authors). The e0ect of the size of the multicenter trial on the

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 605

spread of estimates of the heterogeneity parameter � around its true value is studiedby simulation in Section 5. In Section 6, a cancer clinical trial example is discussedin some detail.

2. The shared frailty model

Assume we have a total of n individuals that come from G di0erent centers, centeri having ni individuals (n =

∑Gi=1 ni). For each patient a time to event=censoring and

a censoring indicator is observed, given by yij=(tij ; �ij) for the jth patient from centeri with i = 1; : : : ; G and j = 1; : : : ; ni. Associated risk variables for the jth patient fromcenter i are collected in the (p × 1) vector xij. The number of events in center i isgiven by Di =

∑nij=1 �ij. The matrix y contains all yij’s.

The frailty model is given by

hij(t) = h0(t)exp(�ij) = h0(t)exp(xtijR+ wi);

where hij(t) is the hazard rate at time t for individual j from center i, h0(t) is theunspeci1ed baseline hazard at time t, � is the 1xed e0ects vector, xij is the incidencevector for the 1xed e0ects and 1nally wi is the random e0ect for center i. The wi’s,i=1; : : : ; G are a sample (independent and identically distributed) from a density fW (·).With �= (�11 ; : : : ; �1n1

; : : : ; �G1 ; : : : ; �GnG)t, an (n× 1) vector, we have

�= X� + Zw;

where X and Z are the (n × p) and (n × G) incidence matrices for the 1xed e0ects� and the random e0ects w, � is a (p× 1) and w a (G × 1) vector.The frailty model can be rewritten as follows:

hij(t) = h0(t)exp(wi)exp(xtij�) = h0(t)ui exp(xtij�):

Now ui = exp(wi) is termed the frailty for the ith center. The independent ui’s, i =1; : : : ; G have common density fU (·).Furthermore, distributional assumptions have to be made for the frailties. Two clas-

sical choices for the density of the frailties are:(a) The zero-mean normal density for W ; then the density of U is lognormal, i.e.,

fU (u) =1

u√2��2

exp(− (log u)2

2�2

): (1)

(b) The one-parameter gamma density for U , i.e.,

fU (u) =u(1=�)−1 exp(−u=�)

�1=��(1=�): (2)

Then the corresponding density for W is

fW (w) =(exp(w))1=�exp(−exp(w)=�)

�1=��(1=�):

Note that for the lognormal density E(Wi) = 0 but E(Ui) �=1 whereas for the gammadensity E(Ui) = 1 but E(Wi) �=0. Typically Var(W ) = �2 is used to describe hetero-geneity in the lognormal density case, whereas the parameter Var(U )= � is most often

606 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

used in the gamma density case. Below we use � as generic notation for heterogeneity(meaning �2 for the lognormal density and � for the gamma density).

3. The EM algorithm for the gamma frailty

Typical for the frailty model is that we have observed information y and unobserved(latent) information w. To estimate � = (�; �) we would like to base the likelihoodmaximization and statistical inference on the observed data log likelihood (Klein, 1992):

‘obs(�) = logf(y; �) = log

(∫ ∞

0· · ·∫ ∞

0f(y;w; �)

G∏i=1

dwi

)

= log

(∫ ∞

0: : :∫ ∞

0f(y|w; �)f(w|�)

G∏i=1

dwi

)

=G∑i=1

(Di log �− log�(1=�) + log�(1=�+ Di)

− (1=�+ Di)log

1 + �

ni∑j=1

H0(tij)exp(xtij�)

+ni∑

j=1

�ij[xtij� + log h0(tij)]): (3)

However, this likelihood is diQcult to maximize as it contains, apart from �, also theunspeci1ed baseline hazard. We therefore rely on the EM algorithm to estimate �.With ‘full(�)=logf(y;w; �) and ‘pred(�)=logf(w|y; �) we write, in the EM algorithm

framework,

‘obs(�) = ‘full(�)− ‘pred(�);

where ‘full(�) is the complete data log likelihood and ‘pred(�) the predictive log likeli-hood. Taking the conditional expectation with respect to y and with �(k−1) a provisionalvalue of � at stage k − 1 in the EM algorithm, we obtain

‘obs(�) = Q(�|�(k−1))− H (�|�(k−1));

where

Q(�|�(k−1)) = E�(k−1) [logf(y;w; �)|y];H (�|�(k−1)) = E�(k−1) [logf(w|y; �)|y]:

Instead of maximizing ‘obs(�) for � rather Q(�|�(k−1)) is maximized. It is a generalresult from EM methodology that if �(k) maximizes Q(�|�(k−1)) then ‘obs(�

(k))¿ ‘obs(�(k−1)), i.e., �(k) is ‘better’ than �(k−1). This property is central in Dempster et al.

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 607

(1977). In order to maximize Q(�|�(k−1)), we alternate between an expectation and amaximization step. The details are as follows.

3.1. Expectation step

The expected value Q(�|�(k−1)) can be obtained by plugging in the conditional ex-pectations for wi and exp(wi) in logf(y;w; �). With (:) the digamma function, theseare given by

Ek(wi) = E�(k−1) [wi|y] = (Di + 1=� (k−1))− log(1=� (k−1) + H (k−1)i )

and

Ek(exp(wi)) = E�(k−1) [exp(wi)|y] = 1=� (k−1) + Di

1=� (k−1) + H (k−1)i

;

where

H (k−1)i =

ni∑j=1

H (k−1)0 (tij)exp(xtij�

(k−1)) (4)

with

H (k−1)0 (t) =

∑t(‘)6t

h(k−1)‘0

and

h(k−1)‘0 =

N(‘)∑tqs¿t(‘) exp(x

tqs�

(k−1))E(k−1)(exp(ws)):

With t(1) ¡ · · ·¡t(r) the ordered event times (r denotes the number of di0erent eventtimes), N(‘) is the number of events at time t(‘), ‘=1; : : : ; r (see Sections 8.2 (theoreticalnotes) and 8.3 in Klein and Moeschberger (1997) to get good insight). As startingvalues we use an initial guess �(0) = (� (0); �(0)).These expected values need to be inserted in

logf(y;w; �) =G∑i=1

logfW (wi) +G∑i=1

ni∑j=1

[�ij[log h0(tij) + xtij� + wi]

− [H0(tij)exp(xtij� + wi)]]: (5)

Plugging in the density for fW (·), corresponding to the gamma frailty, and replacingin (5) wi and exp(wi) by their conditional expectations, we obtain after rearrangingsome of the terms

Q(�|�(k−1)) = Q1(�|�(k−1)) + Q2(�|�(k−1))

with

Q1(�|�(k−1)) =G∑i=1

[(1=�+ Di)Ek(wi)− Ek(exp(wi))=�

−Di logEk(exp(wi))− log(�)=�− log�(1=�)]

608 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

and

Q2(�|�(k−1)) =G∑i=1

ni∑j=1

[�ij[log h0(tij) + xtij� + logEk(exp(wi))]

− [H0(tij)exp(xtij�)Ek(exp(wi))]]:

3.2. Maximization step

The reason why we arti1cially include the term∑ni

j=1 �ij logEk(exp(wi)) in the def-

initions of Q1(�|�(k−1)) and Q2(�|�(k−1)), with, respectively, a minus sign and a plussign, is that we now can interpret Q2(�|�(k−1)) as a full censored data log likelihoodwith (�t ; 1)t as regression coeQcients and (xi1; : : : ; xip; logEk(exp(wi))) as risk vari-ables. For the case of an unspeci1ed baseline (cumulative) hazard function, we use thepro1le likelihood idea (1x � and maximize the full censored data log likelihood as afunction of the baseline hazard only) in combination with Breslow’s method to handleties, to replace this term by

‘part(�|�(k−1)) =r∑

l=1

tij=t(l)

�(k−1)ij − N(‘) log

tqs¿t(‘)

exp(�(k−1)qs )

;

where

�(k−1)ij = xtij� + logEk(exp(wi)):

So we maximize, with respect to �= (�; �),

Q1(�|�(k−1)) + ‘part(�|�(k−1)): (6)

The new estimate �(k) for the kth iteration is then used to obtain Q(�|�(k)), the updateof the conditional expectation, and so on. This process continues until the di0erencebetween the two consecutive values E�(k−1) [logf(y;w; �)|y] and E�(k) [logf(y;w; �)|y]becomes smaller than some prespeci1ed value ' (see Flow Chart 1 in Appendix A).

Remark 1. Nielsen et al. (1992) propose a modi1ed pro1le likelihood EM algorithmthat leads to the same results but the convergence is faster.

4. The penalized partial likelihood for shared frailty models

From (5) and (6) it is clear that a logical proposal for the likelihood to use for theestimation of �= (�; �) is the penalized partial likelihood

‘ppl(�;w) = ‘part(�;w)− ‘pen(�;w); (7)

where, with �ij = xtij� + wi,

‘part(�;w) =r∑

‘=1

tij=t(‘)

�ij − N(‘) log

tqs¿t(‘)

exp(�qs)

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 609

and

‘pen(�;w) =−G∑i=1

logfW (wi):

(a) For random e0ects wi; i=1; : : : ; G, having a zero-mean normal density, we have(now think and write � for �2)

‘pen(�;w) =12

G∑i=1

[w2

i

�+ log (2��)

]

and (7) becomes the penalized partial likelihood studied in McGilchrist (1993).The frailties are thus in both parts of the penalized partial likelihood. The second term

penalizes random e0ects that are far away from the mean value zero by reducing thepenalized partial likelihood. This corresponds to shrinking the random e0ects towardsthe zero-mean.The maximization of the penalized partial (log)likelihood consists of an inner and an

outer loop. In the inner loop the Newton–Raphson procedure is used to maximize, for aprovisional value of �, ‘ppl(�;w) for � and w (best linear unbiased predictors, BLUPs).In the outer loop, the restricted maximum likelihood estimator for � is obtained usingthe BLUPs. The process is iterated until convergence.The details are as follows. Let ‘ denote the outer loop index and k the inner loop

index. Let � (‘) be the estimate for � at the ‘th iteration in the outer loop. Given � ‘,�(‘;k) and w(‘;k) are the estimates and predictions for � and w at the kth iterative stepin the inner loop.Starting from initial values �(1;0), w(1;0) and � (1) the kth iterative step for Newton–

Raphson, given � (‘), is given by[�(‘;k)

w(‘;k)

]=[�(‘;k−1)

w(‘;k−1)

]− V−1

[0

[� (‘)]−1w(‘;k−1)

]+ V−1[X Z]

d‘part(�;w)d�

;

where

V =[V11 V12

V21 V22

]=[X t

Z t

](−d2‘part(�;w)d� d�t

)[X Z] +

[0 00 [� (‘)]−1IG

]:

Once the Newton–Raphson procedure has converged for the current value of � (‘), aREML estimate for � is given by

� (‘+1) =∑G

i=1 (w(‘;k)i )2

G − r;

where r = trace[(V−1)22]=� (‘).This outer loop is iterated until convergence based on the di0erence between the

sequential values � is obtained.(b) For random e0ects wi, i = 1; : : : ; G, with corresponding one-parameter gamma

density for the frailties, we have (now �= �)

‘pen(�;w) =−G∑i=1

(wi − exp(wi)

)− G

(log ��

− log�(1�

)):

610 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

To maximize the penalized partial (log)likelihood we still can use an inner and outerloop. The inner loop is identical to the one described in (a). No REML estimate, how-ever, is available for � as in the normal density case. Therefore, in the outer loop, alikelihood similar to ‘obs(:) is maximized for � as in the case of the EM algorithm.This likelihood, for 1xed value of � (‘), also corresponds to Q(�|�(k−1))− H (�|�(k−1))evaluated at (� (l); �� (l) ; w� (l) ), where �� (l) and w� (l) are the estimates obtained by max-imizing the penalized partial likelihood for 1xed value � (l). This expression is in termsof the partial likelihood which contains the expectation of the frailties as 1xed o0setterms.We therefore apply, in the outer loop, the golden section method on the following

modi1ed version of the observable likelihood:

‘partobs (�;w) = ‘part(�;w) +G∑i=1

[log

�(Di + 1=�)�(1=�)

+1�log(

1=�Hi + 1=�

)

−Di log(Di + 1=�) + Di

]: (8)

That ‘partobs (�;w) is a modi1ed version of ‘obs(�) can be seen as follows. With Hi =∑nij=1 H0(tij)exp(xtij�) the observable likelihood ‘obs(�) given by (3) can be written as

‘obs(�) =G∑i=1

nij∑j=1

[�ij[log h0(tij) + xtij� + logE(exp(wi))]

− [H0(tij)exp(xtij�)E(exp(wi))]]

+G∑i=1

[Di log �+ log

�(Di + 1=�)�(1=�)

− (Di + 1=�)log[�(Hi + 1=�)]

−Di logE(exp(wi)) + HiE(exp(wi)) ] : (9)

Now modify ‘obs(�) by replacing the 1rst (double) sum on the r.h.s. of (9) by ‘part(�;w).Using the relations E(exp(wi))=(Di+1=�)=(Hi+1=�) and

∑Gi=1 Di=

∑Gi=1 HiE(exp(wi))

(martingale residuals sum to zero), it easily follows that the second sum on the r.h.s.of (9) equals the sum on the r.h.s. of (8). Flow Chart 2 in the appendix gives thedetails.

Remark 2. For the gamma case an important property; stated in Therneau and Gramb-sch (2000; p. 254); is that; for any 1xed value of �; the solution to the penalized partiallikelihood maximization coincides with the solution based on the EM algorithm (dis-cussed in Section 2). Since S-Plus contains a fast algorithm for the penalized partiallikelihood procedure; this property is very important from a practical point of view.Also note that in this case there is no need to estimate the cumulative baseline hazardsince we use of ‘partobs (·; ·) in the outer loop.

Remark 3. The penalized partial likelihood approach has been implemented in theCoxph function of S-Plus. A frailty term can be added to the Cox model. Using the

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 611

default parameters for the convergence criterion ' and the maximum number of outerloops; however; the iteration procedure will terminate before reaching the maximumlikelihood parameters. Therefore; the ' (parameter ‘eps’ in the frailty function) needsto be reset to a lower value; and the maximum number of outer loops (parameter‘outer.max’) needs to be increased. We obtained the maximum likelihood estimationsin the simulations by setting '= 10−10 and outer.max = 50.

Remark 4. A very direct approach is to maximize lpartobs (·; ·) simultaneously for (�; �;w).This leads to exactly the same solution as the penalized partial likelihood approach. For1xed �; the di0erence between lpartobs (·; ·) and lppl(·; ·) only contains terms not dependingon � or w so that maximization of the likelihood based on one or the other for 1xed �gives necessarily the same answer. On the other hand; the penalized likelihood approachuses also lpartobs (·; ·) in its outer loop so that again this leads to the same solution. Themain advantage of this approach is that a general nonlinear optimizer can be used to1nd estimates for � and � and; if needed; w. Flow Chart 3 in the appendix gives thedetails.

5. Simulations

5.1. Description of simulations

In this section, the relationship between the size of a multicenter trial and the biasand spread of estimates of � around its true value is studied by simulation. The sizeof a multicenter trial is a function of the number of centers, G, and the number ofpatients per center, ni; i=1; : : : ; G. These two parameters are studied separately in thesimulation. Parameters that might in3uence this relationship are the event rate h0(t)(assumed constant over time: h0(t) = h0), the size of the true heterogeneity parameter�, and the treatment e0ect ), expressed in terms of the hazard ratio HR = exp()).Therefore, the e0ect of these parameters is also studied in this simulation.The choice of the parameter values has been determined by the clinical setting

in which we want to study treatment outcome, namely breast cancer clinical trials.The number of centers typically varies from 15 to 30 centers, and for the numberof patients per center we take 20, 40 and 60. We study two di0erent types of breastcancer trials: the early breast cancer clinical trials, with a typically low event rate, sethere at h0 = 0:07 yearly constant event rate (median time to event equals 9.9 years);and metastatic breast cancer clinical trials, with a typically high event rate, set hereat h0 = 0:22 yearly constant event rate (median time to event equals 3.15 years). Weassume an accrual period of 5 years (with constant accrual rate) and a further followup period of 3 years, which corresponds to a typical early breast cancer setting. Timeat risk for a particular patients, rtij, thus consists of the time at risk before the end ofthe accrual period (ranging from 0 to 5 years) plus the follow up time. This results inapproximately 30% and 70% of the patient having the event in the early breast cancerand metastatic breast cancer clinical trial, respectively, at the end of the study with theremaining patients censored. As true values for the heterogeneity parameter, 0,0.1 and

612 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

Median time to event (years)

Den

sity

func

tion

0 5 10 15 20

θ=0.1h =0.220

h =0.070

Median time to event (years)

Den

sity

func

tion

0 5 10 15 20

θ=0.2

h =0.220

h =0.070

Fig. 1. The spread of the median time to event from center to center for �= 0:1 and 0.2 and h0 = 0:07 and 0.22.

0.2, are used as this is the most likely range of values to be observed in breast cancerclinical trials. In the example presented in Section 6, for instance, the heterogeneityparameter is estimated to be 0.092, and for all the breast cancer trials that we studied,values in this range are found. The e0ect of the heterogeneity parameter � on thespread of median time to event from center to center is demonstrated by the densityfunction of the median time to event over the centers (Fig. 1).It can easily be shown that this density function fTM (t) is given by

fTM (t) =(log(2)�h0HR

)1=� 1�(1=�)

(1t

)1+1=�

exp(− log(2)

�th0HR

):

For instance, for � = 0:1 and HR = 1, 90% of the centers would have a median timeto event between 2 and 5.6 years for h0 = 0:22 and between 6.4 and 17.6 years forh0 = 0:07. Finally, for the treatment e0ect, we use a hazard ratio of 1 and 1.3, as avalue of 1.3 is often taken to be the expected e0ect of the standard as compared tothe new treatment in cancer clinical trials.For each parameter setting, 6500 data sets are generated and for each of them the

heterogeneity parameter � from the frailty model with the treatment e0ect as covariateis estimated by the relevant technique and the P-value of the log likelihood ratio testfor H0: �=0 is assessed as well based on a chi-square distribution with one degree offreedom (Andersen et al., 1993). We have chosen 6500 iterations in order to have aprobability of 90% that the simulated mean does not di0er more than 0.001 from the

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 613

value that would be obtained if the number of simulations would go to in1nity. Thesimulations are quite computer intensive and we believe that this precision is adequatefor our purposes. From this list of 6500 values of �, the 5%, 50% and 95% quantilesand the mean are determined to describe the spread and the bias of the estimate aroundits true value. Additionally, the percentage of data sets with a signi1cant likelihood ratiotest at the 5% signi1cance level is determined. For � = 0 this percentage correspondsto the size of the test, whereas for �¿ 0 this corresponds to the power of the test ata nominal signi1cance level of 5%.Given a particular parameter setting (G; n; h0; �; HR), the observations of a data set are

generated in the following way. First, G random center e0ects w1; : : : ; wG are generatedfrom the appropriate frailty density. The time to event outcome for each patient, etij,is randomly generated from an exponential distribution (rexp function in S-Plus) withparameter hij for the jth patient from center i given by

hij = h0 exp()xij + wi):

A patient for which the time to event is longer than the time at risk is censored withtime to censoring equal to time at risk so that tij =min(rtij; etij) and �ij = I(rtij ¿ etij)the censoring indicator.

5.2. Results of simulations

We only simulated data from the gamma frailty density (2) and 1tted the correspond-ing frailty model. All techniques described in the previous section lead to virtually thesame results for the parameter settings studied here, which follows as well from the-oretical considerations: Therneau and Grambsch (2000, p. 254) have shown that theEM algorithm leads to the same results as the penalized likelihood approach, and inRemark 4, we show that the same results are obtained for the simultaneous approach.Therefore, the simulations are based on the penalized likelihood approach as this isavailable as a standard S-Plus function (see Remark 3). As an extension to this, thelikelihood ratio test is based on ‘partobs (·; ·).

5.2.1. BiasIn all cases studied for � = 0:1 and 0:2, the estimate of � is biased downward as

the mean of the heterogeneity parameter from the simulated data sets is smaller thanthe true parameter value. For a 1xed number of centers, either 15 or 30, there is littlee0ect on bias when increasing the patient numbers (Table 1). Increasing the numberof centers from 15 to 30 reduces the bias, especially when � = 0:2 and less so for�= 0:1. Part of this downward bias can be explained by the fact that heterogeneity isestimated in a model with treatment e0ect included as a 1xed e0ect and no correctionsare made for this e0ect similar to the REML estimation used in the normal frailtydensity. Additional simulations (data not shown) seem to con1rm this, but it needs tobe studied in more detail.

614 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

5.2.2. 90% interquantile rangeThe 90% interquantile range decreases substantially with increasing number of pa-

tients per center and with increasing number of centers whatever be the parametersetting. A pronounced e0ect is observed when increasing the number of patients percenter up to 40. When increasing the number of patients per center further, the gain interms of a smaller interquantile range is only marginal in most cases (Table 1). Theinterquantile range shrinks considerably when increasing the number of centers from15 to 30.For 1xed total sample size of 600 patients but distributed either over 15 centers

or over 30 centers, the results depend on the event rate and the true heterogeneityparameter �. With a low event rate and �=0:1, it is preferable to have only 15 centerswith 40 patients each, whereas with a high event rate, a trial with 30 centers and only20 patients per center leads to a smaller interquantile range.For �= 0:2 it is better for both event rates to have 30 centers and only 20 patients

per center (Table 1). The hazard ratio has little e0ect on the 90 % interquantile range.A higher event rate, however, reduces the interquantile range substantially, especiallyfor �= 0:1 and for the smaller trials with fewer patients (Table 1).

5.2.3. Size and powerIn all cases, the size of the likelihood ratio test stays well below 5% and comes

somewhat closer to 5% with larger trials (Table 1).With respect to power based on the likelihood ratio test, it is obvious that a trial

with 15 centers and only 20 patients per center is inadequate when the event rate islow, or when � = 0:1 with a high event rate as in these cases the power stays wellbelow 0.8. When increasing the sample size to 40 patients per center, the trial has alsoa power below 0.8 when � = 0:1 and h0 = 0:07 regardless the value of HR; furtherincreasing the number of patients to 60 with �= 0:1, h0 = 0:07 and HR= 1 will onlyresult in a power of 0.74. The power stays also well below 0.8 with 20 patients percenter and 30 centers if the event rate is low and �=0:1. In all other cases the poweris above 0.8 (Table 1).The power can also be derived based on the simulated results alone by determining

the percentage of data sets for which the estimate of the heterogeneity parameter isabove the 95% quantile for �=0. For instance, 42.5% of the data sets with 15 centersand 20 patients per center generated with parameters h0 = 0:07, �= 0:1 and HR= 1:3have an estimated heterogeneity parameter � above 0.086, the 95% quantile for thesame setting but with �=0. The simulated power, however, is in most instances muchhigher than the likelihood ratio power obtained from an asymptotic 12(1) distribution.Thus, the likelihood ratio test is often too conservative. This is due to the fact that thevalue for � in the null hypothesis is at the boundary of the parameter space, a factwell known in mixed models (Stram and Lee, 1994; Self and Liang, 1987). Therefore,the approximation of the log likelihood ratio statistic by a chi-square distribution withone degree of freedom is inadequate (see Fig. 2). Rather a mixture of a chi-squaredistribution with one and zero degrees of freedom should be used. This will result ina decrease of the critical value of the log likelihood ratio statistic.

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 615

Table 1Simulation results for HR = 1 and 1.3 and event rate = 0:07 and 0.22

� Quant. 20=15 40=15 60=15 20=30 40=30 60=30

HR= 1; event rate = 0:070 5% 0.000 0.000 0.000 0.000 0.000 0.000

50% 0.000 0.000 0.000 0.000 0.000 0.00095% 0.098 0.047 0.032 0.068 0.031 0.021Mean 0.018 0.009 0.006 0.014 0.006 0.004Size 0.009 0.013 0.017 0.017 0.019 0.019

0.1 5% 0.000 0.000 0.012 0.000 0.025 0.03550% 0.066 0.078 0.082 0.085 0.089 0.09295% 0.265 0.208 0.191 0.217 0.177 0.165Mean 0.088 0.088 0.089 0.094 0.094 0.095PowerLR 0.197 0.527 0.740 0.424 0.836 0.963PowerSim 0.371 0.692 0.866 0.607 0.929 0.985

0.2 5% 0.000 0.040 0.058 0.047 0.084 0.09650% 0.163 0.167 0.176 0.178 0.181 0.18595% 0.425 0.368 0.356 0.357 0.317 0.312Mean 0.181 0.180 0.187 0.187 0.189 0.192PowerLR 0.502 0.870 0.949 0.800 0.988 1.000PowerSim 0.708 0.938 0.985 0.911 0.999 1.000

HR= 1; event rate = 0:220 5% 0.000 0.000 0.000 0.000 0.000 0.000

50% 0.000 0.000 0.000 0.000 0.000 0.00095% 0.044 0.021 0.015 0.030 0.015 0.010Mean 0.008 0.004 0.003 0.006 0.003 0.002Size 0.012 0.015 0.015 0.014 0.019 0.019

0.1 5% 0.000 0.020 0.030 0.028 0.043 0.04850% 0.081 0.084 0.086 0.089 0.092 0.09495% 0.210 0.182 0.174 0.171 0.163 0.153Mean 0.090 0.090 0.092 0.094 0.096 0.096PowerLR 0.568 0.890 0.967 0.867 0.998 1.000PowerSim 0.743 0.947 0.987 0.942 1.000 1.000

0.2 5% 0.045 0.066 0.074 0.088 0.105 0.10850% 0.169 0.175 0.175 0.185 0.186 0.18895% 0.353 0.337 0.326 0.318 0.293 0.290Mean 0.180 0.185 0.185 0.192 0.192 0.193PowerLR 0.889 0.996 0.999 0.995 1.000 1.000PowerSim 0.953 0.998 0.999 0.998 1.000 1.000

HR= 1:3; event rate = 0:070 5% 0.000 0.000 0.000 0.000 0.000 0.000

50% 0.000 0.000 0.000 0.000 0.000 0.00095% 0.086 0.042 0.027 0.061 0.030 0.020Mean 0.016 0.008 0.005 0.012 0.006 0.004Size 0.012 0.016 0.017 0.014 0.022 0.021

0.1 5% 0.000 0.000 0.015 0.000 0.027 0.03850% 0.070 0.078 0.082 0.086 0.089 0.09195% 0.245 0.200 0.188 0.207 0.175 0.162Mean 0.088 0.087 0.090 0.093 0.093 0.095PowerLR 0.254 0.568 0.802 0.478 0.881 0.983PowerSim 0.425 0.751 0.904 0.654 0.941 0.991

616 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

Table 1 (Continued)

� Quant. 20=15 40=15 60=15 20=30 40=30 60=30

0.2 5% 0.000 0.041 0.056 0.054 0.087 0.09850% 0.156 0.169 0.172 0.180 0.186 0.18695% 0.424 0.360 0.344 0.352 0.317 0.303Mean 0.179 0.181 0.183 0.188 0.192 0.191PowerLR 0.556 0.870 0.969 0.879 0.995 1.000PowerSim 0.737 0.945 0.986 0.937 0.999 1.000

HR= 1:3; event rate = 0:220 5% 0.000 0.000 0.000 0.000 0.000 0.000

50% 0.000 0.000 0.000 0.000 0.000 0.00095% 0.041 0.020 0.014 0.029 0.015 0.009Mean 0.007 0.004 0.003 0.006 0.003 0.002Size 0.011 0.013 0.014 0.018 0.022 0.022

0.1 5% 0.002 0.023 0.030 0.029 0.045 0.05050% 0.080 0.084 0.086 0.090 0.093 0.09395% 0.203 0.180 0.171 0.176 0.157 0.151Mean 0.089 0.091 0.092 0.095 0.095 0.096PowerLR 0.614 0.893 0.975 0.892 0.999 1.000PowerSim 0.770 0.961 0.990 0.953 0.999 1.000

0.2 5% 0.048 0.072 0.075 0.089 0.107 0.11050% 0.171 0.176 0.176 0.186 0.188 0.19095% 0.361 0.333 0.333 0.317 0.298 0.295Mean 0.184 0.186 0.186 0.192 0.193 0.194PowerLR 0.899 0.992 0.998 0.995 1.000 1.000PowerSim 0.959 0.998 1.000 0.998 1.000 1.000

0 5

0.0

0.2

0.4

0.6

0.8

1.0

G=30,n=60,h=0.22G=15,n=20,h=0.07

χ20

χ21

1 2 3 4

Fig. 2. The estimated density function of the log likelihood ratio from the simulated data of the two di0erentsettings (G = 30, n = 60, h0 = 0:22 and G = 15, n = 20, h0 = 0:07). The other two lines correspond to thedensity functions for 120 and 121.

6. A case study: the peri-operative trial

This research was undertaken because we wanted to study heterogeneity in an earlybreast cancer clinical trial comparing peri-operative chemotherapy with post-operative

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 617

chemotherapy (Clahsen et al., 1996). In the previous section, the e0ect of numberof centers and number of patients per center was studied using an equal number ofpatients in each center. In practice, however, a multicenter trial typically contains afew large centers and many rather small centers. For example, the peri-operative trialconsists of 15 centers, with the following number of patients per center sorted accordingto size: 6, 19, 25, 39, 48, 53, 54, 60, 78, 184, 185, 206, 311, 622, 902. In this section,we therefore use the actual distribution of the patients over the centers observed inthe peri-operative trial. The heterogeneity parameter �, the treatment e0ect HR andthe baseline hazard h0 were estimated to be 0.092, 1.16 and 0.07. Furthermore, theaccrual time and follow-up time equal 5 and 3 years (these values are used in thesimulation).Assuming that � = 0, we simulated again 6500 data sets and studied the spread

of the estimates of �. The 95% quantile is given by 0.007. Therefore, the observedvalue for � = 0:092 contradicts the hypothesis that there is no heterogeneity betweenthe centers. The power based on the likelihood ratio test, resp., the simulated power,equals 0.94, resp., 0.98 for � = 0:1 and 1.00, resp., 1.00 for � = 0:2. Therefore, thistrial is adequate for treatment outcome research and for further investigating sources ofheterogeneity.

7. Conclusions

It is important, before embarking on studying the heterogeneity between centers inmulticenter trials and drawing clinical conclusions, to assure that the multicenter trialis, in terms of number of centers and number of patients per center, suQciently largeso that the estimated heterogeneity parameter is actually describing true heterogeneitybetween centers and not just random variability. The type of simulations describedhere can be used to that end. It is found that the number of centers and the eventrate play a crucial role in determining whether a multicenter trial is adequately sized.On the other hand, once a certain number of patients per center has been obtained,adding more patients to the center does not substantially increase the power of the trialanymore. For the cases simulated here, the value is close to 40 patients per center.For each particular situation, a simulation of the type described in Section 5 can bedone to determine whether the estimate of the heterogeneity parameter has a probabilitylower than 5% to come from a multicenter trial with �=0. Furthermore, the interquan-tile range and the bias can be studied for di0erent values of � and an investigatorcan take then a decision to use this multicenter trial for treatment outcome researchor not.Note that we take constant hazard rates in our simulations. This is an accept-

able assumption for the example of Section 6 where an exponential hazard rate isnot rejected when compared to a Weibull hazard rate (based on a likelihood ra-tio test). This assumption might not hold in other situations. Furthermore, only thegamma frailty density has been used. It might be of interest to consider other situ-ations and to use frailties di0erent from the gamma density. Although the 1ndings

618 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

from the simulations are necessarily limited to the parameter settings studied, theconclusions are highly relevant for breast cancer clinical trials. For other tumor typesor diseases di0erent parameter settings might be relevant, which can lead to di0erentdesign requirements.

Acknowledgements

Paul Janssen received support by grant BIL00=28 (International Scienti1c and Tech-nological Cooperation) from the Flemmish community. Rosemary Nguti received partialsupport from the Belgian Technological Cooperation (BTC).

Appendix A

Flow Chart 1. The EM algorithm approach. � are the estimates for the regres-sion coeQcients in the classical Cox regression model (without frailties) and H =(H 1; : : : ; H G) where H i is obtained from (4) with �(0) and E0(exp(wi)) = 1. Ek(w) =(Ek(w1); : : : ; Ek(wG)), etc. See Section 3 for further details.

L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620 619

Flow Chart 2. The penalized likelihood approach. For details, see Section 4.

Flow Chart 3. The simultaneous likelihood maximization approach. For details, seeRemark 4.

References

Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N., 1993. Statistical Models Based on Counting Processes.Springer, New York, p. 663.

Clahsen, P.C., van de Velde, C.J., Julien, J.P., Floiras, J.L., Delozier, T., Mignolet, F.Y., Sahmoud, T.M.,1996. Improved local control and disease-free survival after preoperative chemotherapy for early-stage

620 L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603–620

breast cancer. A European organization for research and treatment of cancer breast cancer cooperativegroup study. J. Clin. Oncol. 14, 745–753.

Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood estimation from incomplete data viathe EM algorithm (with discussion). J. Roy. Statist. Soc. B 39, 1–38.

Harding, M.J., Paul, J., Gilles, C.R., Kaye, S.B., 1993. Management of malignant teratoma: does referral toa specialist unit matter? Lancet 341, 999–1002.

Hillner, B.E., Smith, T.J., Desch, C.E., 2000. Hospital and physician volume or specialization and outcomesin cancer treatment: importance in quality of cancer care J. Clin. Oncol. 18, 2327–2340.

Klein, J.P., 1992. Semiparametric estimation of random e0ects using the Cox model based on the EMalgorithm. Biometrics 48, 795–806.

Klein, J.P., Moeschberger, M.L., 1997. Survival Analysis. Techniques for Censored and Truncated Data.Springer, New York.

McGilchrist, C.A., 1993. REML estimation for survival models with frailty. Biometrics 49, 221–225.Murphy, S.A., 1994. Consistency in a proportional hazard model incorporating a random e0ect. Ann. Statist.

22, 712–731.Murphy, S.A., 1995. Asymptotic theory for the frailty model. Ann. Statist. 23, 182–198.Nielsen, G.G., Gill, R.D., Andersen, P.K., Sorensen, T.I.A., 1992. A counting process approach to maximum

likelihood estimation in frailty models. Scand. J. Statist. 19, 25–43.Parner, E., 1998. Asymptotic theory for the correlated gamma-frailty model. Ann. Statist. 26, 183–214.Self, S.G., Liang, K.-L., 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio

tests under nonstandard conditions. J. Amer. Statist. Assoc. 82, 605–610.Stram, D.O., Lee, J.W., 1994. Variance components testing in longitudinal mixed e0ects models. Biometrics

50, 1171–1177.Therneau, T.M., Grambsch, P.M., 2000. Modeling Survival Data. Extending the Cox Model. Springer, New

York.Yamaguchi, T., Ohashi, Y., 1999. Investigating centre e0ects in a multi-centre clinical trial of super1cial

bladder cancer. Statist. Med. 18, 1961–1971.