Small Sample Asymptotics for Type I Censored Data

10
Small Sample Asymptotics for Type I Censored Data Gabriela Damilano 1 and Pedro Puig 2 * 1 Unitat d’Estadı ´stica, Departament de Matema `tiques, Universitat Auto `noma de Barcelona, 08193 Cerdanyola del Valles (Barcelona), Spain. 2 Servei d’Estadı ´stica de la UAB, Edifici D, Universitat Auto `noma de Barcelona, 08193 Cerdanyola del Valles (Barcelona), Spain. Summary For Location and Scale models with Type I Censored Data the estimation of the parameters based on likelihood is analyzed. When the sample size is very small the usual procedures for inference based on the asymptotic distribution of the statistics do not function properly. We develop higher-order asympto- tic methods and their performance is investigated by Monte Carlo experiments. Key words: Location and Scale Models; Type I Censored Data; Saddlepoint Approximation; Normal distribution; Extreme value distribution. 1. Introduction Nowadays, modern society demands that the use of animals in scientific experi- ments must be adequately controlled and reduced. This position is enforced by ethical committees who authorize or deny permission for these kind of experi- ments. As a consequence the statistician sometimes has to work with very small sample sizes. The problem becomes grave when data is also censored. Inference with censored data has been studied before by many authors (see Cohen, 1991) but essentially they provide methods only valid for large samples. Many of the procedures used in statis- tical inference are based on the knowledge of the asymptotic distribution of the sta- tistics involved. For instance, two widely used properties are that the maximum like- lihood estimator is asymptotically normal distributed and the likelihood ratio test statistic, under null hypothesis, follows a chi-squared distribution. However this in- formation is not very useful when the sample sizes are very small. Furthermore, the exact distribution of these statistics is intractable in most real problems. Thus classical asymptotic approximations need to be improved in order to adapt them to smaller samples. One approach is to use higher-order asymptotic metho- dology. In this paper we present Saddlepoint approximations in order to make Biometrical Journal 44 (2002) 7, 867–876 # 2002 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 0323-3847/02/0710-0867 $ 17.50þ.50/0 * Corresponding author: [email protected]

Transcript of Small Sample Asymptotics for Type I Censored Data

Small Sample Asymptotics for Type I Censored Data

Gabriela Damilano1 and Pedro Puig

2*

1 Unitat d’Estadıstica, Departament de Matematiques, Universitat Autonoma de Barcelona,08193 Cerdanyola del Valles (Barcelona), Spain.

2 Servei d’Estadıstica de la UAB, Edifici D, Universitat Autonoma de Barcelona,08193 Cerdanyola del Valles (Barcelona), Spain.

Summary

For Location and Scale models with Type I Censored Data the estimation of the parameters based onlikelihood is analyzed. When the sample size is very small the usual procedures for inference based onthe asymptotic distribution of the statistics do not function properly. We develop higher-order asympto-tic methods and their performance is investigated by Monte Carlo experiments.

Key words: Location and Scale Models; Type I Censored Data; SaddlepointApproximation; Normal distribution; Extreme value distribution.

1. Introduction

Nowadays, modern society demands that the use of animals in scientific experi-ments must be adequately controlled and reduced. This position is enforced byethical committees who authorize or deny permission for these kind of experi-ments. As a consequence the statistician sometimes has to work with very smallsample sizes.The problem becomes grave when data is also censored. Inference with censored

data has been studied before by many authors (see Cohen, 1991) but essentially theyprovide methods only valid for large samples. Many of the procedures used in statis-tical inference are based on the knowledge of the asymptotic distribution of the sta-tistics involved. For instance, two widely used properties are that the maximum like-lihood estimator is asymptotically normal distributed and the likelihood ratio teststatistic, under null hypothesis, follows a chi-squared distribution. However this in-formation is not very useful when the sample sizes are very small. Furthermore, theexact distribution of these statistics is intractable in most real problems.Thus classical asymptotic approximations need to be improved in order to adapt

them to smaller samples. One approach is to use higher-order asymptotic metho-dology. In this paper we present Saddlepoint approximations in order to make

Biometrical Journal 44 (2002) 7, 867–876

# 2002 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 0323-3847/02/0710-0867 $ 17.50þ.50/0

* Corresponding author: [email protected]

inference with one and two samples for several location and scale models whichhave censored data of type I.

2. Location and Scale models and Censoring

Let Y1; Y2; . . . ; YN be independent and identically distributed continuous randomvariables belonging to a location and scale family, that is, whose density functionhas the form

f ðx; m; sÞ ¼ 1

sf0

x � m

s

� �; �1 < x < 1 ; ð1Þ

where m 2 < is the location and s 2 <þ is the scale parameter. Its distribution func-

tion can be expressed as Fðx; m; sÞ ¼ F0ððx � mÞ=sÞ where F0ðxÞ ¼Ðx

�1f0ðtÞ dt.

This kind of models are particularly important in theory and practice, becausethey allow us to define accurate and simple procedures for inference (see Pace

and Salvan, 1997). Important members of this family are the Normal, Logistic,Exponential and Extreme Value distributions, among others.In addition, experiments that are conduced for a fixed period of time, like the

study of the lifetime of individuals, or measures done with machines that onlydetect maximum or minimum fixed quantities, are common situations where datapresent type I censoring (see Hougaard, 1999).We restrict our discussion to the situation of one sample coming from a loca-

tion-scale family with a single type I censored data at y0 to the right. The exten-sion to multiple or left censoring is trivial in almost all cases. This censoringpattern implies that only n of the N observations are actually measured, while forthe others r ¼ N � n we only know that they are greater than the censoring point,yi > y0. For this situation the log-likelihood function is

lðy; m; sÞ ¼ �n log sþPni¼1

log f0yi � m

s

� �h iþ r log 1� F0

y0 � m

s

� �h ið2Þ

and the maximum likelihood estimator (MLE) can be calculated by solving the

likelihood equations, that is,@l

@m¼ 0 and

@l

@s¼ 0. Technical details can be found

in Lawless (1982).In what follows, we restrict our discussion to the Extreme Value (Log-Weibull)

and Normal distributions, which are very important in survival analysis and relia-bility theory. Under the special censoring pattern that we are considering, the log-likelihood functions can be expressed as follows:1. For the Extreme Value Distribution, with density:

f ðy; m; sÞ ¼ 1

sexp

y� m

s

� �exp �exp

y� m

s

� �h i

868 G. Damilano and P. Puig: Small Sample Asymptotics for Type I Censored Data

the log-likelihood function is,

lðy; m; sÞ ¼ �n log sþPni¼1

yi � m

s

� ��Pni¼1

expyi � m

s

� �� r exp

y0 � m

s

� �ð3Þ

2. For the Normal Distribution the log-likelihood function takes the form,

lðy; m; sÞ ¼ � n logs� 1

2s2Pni¼1

yi � mð Þ2

þ r log 1�Fy0 � m

s

� �h i� n

2log ð2pÞ ð4Þ

In these specific models, using an argument similar to those in Burridge (1981)it can be shown that the MLE always exists and it is unique.

3. Confidence Intervals

Let q ¼ ðw; lÞ be unknown parameters, where w is the parameter of interest and lis a nuisance parameter. For location and scale models like (1), w can be m or s.Now we are interested in determining a confidence interval of level 1� að Þ forthe parameter w.For non censored data or type II censoring, an exact interval can be calculated

by using a pivotal quantity. However for type I censored observations a pivotalquantity is unknown and approximate methods have to be used.One approach is to use the fact that the maximum likelihood estimator qq is

asymptotically Normal with mean q and covariance matrix iðqÞ�1, the inverse ofthe Fisher information matrix. So an approximate confidence interval of level

ð1� aÞ for w is ww� z1�a=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVðww; llÞ

q, where z1�a=2 is the ð1� a=2Þ percentage

point of the standard normal distribution and Vðww; llÞ is the estimator of the var-iance of ww obtained from the asymptotic covariance matrix and the MLE. Thismethod will be denoted as AN.Another alternative is to consider the set of values that are consistent with the

null hypothesis of the test H0 : w ¼ w0. Usually the likelihood ratio test is usedand the test statistic is,

WðwÞ ¼ 2ðlðX; ww; llÞ � lðX;w; llwÞÞ ; ð5Þwhere lðÞ is the log-likelihood function given in (2) and llw is the constrainedMLE of the nuisance parameter when w is known. Under the null hypothesis Whas an asymptotic c21 distribution, so the procedure to get a confidence interval oflevel ð1� aÞ for w consists in calculating the region determined by the inequalityWðwÞ � c21;a where the right term indicates the ð1� aÞ percentile of the c21 distri-bution.

Biometrical Journal 44 (2002) 7 869

However Monte Carlo studies show (see Section 4) that the preceding proce-dure does not perform well when the sample size is small, although it is betterthan the AN.We can try to improve the convergence of the log-likelihood ratio test statistic

expressed in (5) to its asymptotic distribution, that is, a c21, which should be re-expressed as ZðwÞ ¼ sign ðww� wÞ

ffiffiffiffiffiffiffiffiffiffiffiffiWðwÞ

pwith an asymptotic standard normal

distribution. A way to do this is to use higher-order asymptotic methodology.

3:1 Higher-Order Asymptotics

In the last two decades, the higher-order procedures in likelihood analysis havebeen an active area of research in statistics giving rise to a lot of theoretical devel-opments. Its application to problems with type I censored data has been recentlyextended by Wong and Wu (2000) using a transformation based on the Saddle-point approximation, the Z* of Barndorff-Nielsen (1991). The distribution of Z*

is closer to the normal distribution than that of Z. We have also used a transforma-tion of Z similar to that employed by Wong and Wu (2000), but more easy toexpress for some distributions such as the extreme value. Our Monte Carlo simula-tions show that its performance is quite good, even for a very small sample size.This transformation is of the form:

Z*ðwÞ ¼ ZðwÞ þ 1

ZðwÞ logCðwÞUpðwÞ

ZðwÞ

� �; ð6Þ

where CðwÞ and UpðwÞ must be computed for each distributional model as follows:

CðwÞ ¼j~lllllj

fj~jjllj j jjlljg1=2;

UpðwÞ ¼ �jpðwwÞ�1=2f~llww � llww � ~lll;wwð~lll; llÞ�1ð~llll � llllÞg ð7Þ

Here ll:: and ~ll:: are the derivatives of (2) expressed as a function of the parametersq and the MLE qq, that is lðw; l; ww; llÞ, evaluated respectively at qq ¼ ðww; llÞ and~qq ¼ ðw; ~llÞ, with ~ll ¼ llw being the constrained MLE of l for a fixed value w. Inparticular for location and scale models it can be written as follows:

lðm; s; mm; ss; aÞ ¼ � n log sþPni¼1

log f0ss

sai þ

mm� m

s

� �� �

þ r log 1� F0ss

sa0 þ

mm� m

s

� �� �ð8Þ

where ai ¼ ðyi � mmÞ=ss and a0 ¼ ðy0 � mmÞ=ss. In (7), j~lllllj denotes the determinantof the second derivatives matrix of l, @2lðw; l; ww; llÞ=@l @ll and @2lðw; l; ww; llÞ=@l2are matrices only if l is multidimensional which is not the case with a likelihood

870 G. Damilano and P. Puig: Small Sample Asymptotics for Type I Censored Data

function of the form under consideration. Hence, the determinants involved cansimply replaced by single elements of the Hessian matrix lð:Þ evaluated at theappropriate point. In the expression for UpðwÞ, jpðwwÞ�1 represents the estimatedvariance of ww that is, the corresponding element of the inverse of the observedinformation matrix jðqqÞ.The statistic Z*ðwÞ is asymptotically distributed as a nð0; 1Þ with a relative

error of order opðn�3=2Þ. For theoretical details see for example, Pace and Salvan

(1997), Jensen (1993, 1994) or Reid (1996).The necessary steps in order to obtain a confidence interval for w can be sum-

marized as follows:i) Express the log-likelihood, as a function of the parameter q and the

maximum likelihood estimators as well, lðw; l; ww; llÞ.ii) Compute the corrective factors CðwÞ and UpðwÞ given in (7).iii) Obtain the statistic Z*ðwÞ ¼ ZðwÞ þ ZðwÞ�1 log ðCðwÞ UpðwÞ ZðwÞ�1Þ.iv) Calculate the interval for w determined by Z*ðwÞ2 � c21; ð1�aÞ.

3:2 Confidence Interval for the scale parameter s of an Extreme Value distribution

Now we are going to follow the steps described above.i) For this distribution the log-likelihood can be expressed as a function of

the parameters and the MLE ðmm; ssÞ, as follows:

lðm;s; mm; ssÞ ¼ �n log sþPni¼1

ss

sai þ

mm� m

s

� ��Pni¼1

esss ai þ

mm�msð Þ� r e

sss a0 þ

mm�msð Þ

ð9Þwhere ai ¼ ðyi � mmÞ=ss and a0 ¼ ðy0 � mmÞ=ss.From the likelihood equations, the fact that

Pni¼1

exp ðaiÞ þ r exp ða0Þ ¼ n andsome algebra we obtain:

ll ¼ lðmm; ss; mm; ssÞ ¼ �n log ssþPni¼1

ai �Pni¼1

eai � r ea0 ¼ �n log ssþPni¼1

ai � n

ð10Þ

~ll ¼ lð~mm; s; mm; ssÞ ¼ �n log sþ ss

s

Pni¼1

ai � n log n�1 Pni¼1

esss ai þ r e

sss a0

� �� �� n

ð11Þwhere ~mm ¼ mms, is the constrained MLE for a given s. Then, the likelihood ratiotest statistic W can be written as:

WðsÞ ¼ 2 �n logss

s

� �þ 1� ss

s

� �Pni¼1

ai þ n log n�1 Pni¼1

esss ai þ r e

sss a0

� �� �� �ð12Þ

Biometrical Journal 44 (2002) 7 871

ii) The second step is to compute the correction factor CðsÞUpðsÞ (de-noted as CUpðsÞ), which in this case take the form:

CUpðsÞ ¼ss

s

ðss=sÞ nP

ai esss ai þ ra0 e

sss a0

� � Pesss ai þ r e

sss a0

� ��1�P

ai

� �� nffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

a2i eai þ ra20 e

a0 þ 2P

ai �P

aið Þ2=nq

ð13Þiii) With this explicit expression for CUpðsÞ compute the statistic Z*ðsÞ as

in (6).iv) Then, a confidence interval of level ð1� aÞ for the scale parameter, is

determined by the set of values of s > 0 such that:

Z*ðsÞ2 ¼ Z2ðsÞ þ 2 logCUpðsÞZðsÞ

� �þ 1

Z2ðsÞ logCUpðsÞZðsÞ

� �� �2� c21;ð1�aÞ ð14Þ

Furthermore, the confidence intervals for s based on the likelihood ratio teststatistic WðsÞ and that based on (AN) can also be calculated. Nowadays, there arestatistical packages like SAS, Stata, EVIEWS, among others, that can be used tocalculate the MLE for censored data for several distributional patterns. They alsoprovide the AN confidence intervals. However, we have developed a program inMathematica that is available from the authors, that calculates the confi-dence intervals for the scale parameter based on WðsÞ and Z*ðsÞ. The re-sults of the following application was obtained by means of it.

Example 1. This data set corresponds to the time a mouse can continuerunning upright (in seconds) in a Rotorod ‘‘Treadmill for mice 7600, UgoBasile, Italy” for 8 mice:

82; 120þ; 110; 120þ; 38; 92; 120þ; 120þThis experiment is performed in order to study the neurotoxicity of acrylamide.

The animals are given a dose of 20 mg/Kg of body weight of the substance produ-cing a performance decrement. The value 120+ indicates that the value is censoredbecause the mouse is removed from the rotorod after 120 seconds if it does notfall before.The distribution of the data is usually assumed as a Weibull. Then, in order to

use the procedure described for the extreme value distribution, we first have totransform the data by taking logarithms.In this case, we have n ¼ 4 observed and r ¼ 4 censored data, the censoring

point is y0 ¼ 120, and the MLE are mm ¼ 138:74 and ss ¼ 2:51. By using our pro-gram we obtain the following 90% confidence intervals for s:

AN: (1.42; 10.53) W: (1.03; 4.93) Z*: (0.60; 3.95)

872 G. Damilano and P. Puig: Small Sample Asymptotics for Type I Censored Data

Notice how the interval based on Z* is to the left from the others. Moreover, ifa test of exponentiality is performed, that is H0 : s ¼ 1, H0 would not be rejectedusing Z*.

3:3 Confidence Intervals for the Location Parameter m of a normal distribution

A first order asymptotic interval for m of level ð1� aÞ can be calculated from theregion determined by the inequality WðmÞ � c21;ð1�aÞ, where WðmÞ is expressed as:

WðmÞ ¼ �n logss2

~ss2

� �þ r log

1�FðxxÞ1�FðxÞ

!�Pni¼1

yi � mm

ss

� �2

�Pni¼1

yi � m

~ss

� �2ð15Þ

Here FðzÞ denotes the distribution function of a standard normal, ~ss is the con-strained MLE for the scale parameter for a fixed m and x ¼ ðy0 � mÞ=s is thestandardized censoring point. The log-likelihood function can be rewritten as:

lðm;s; mm; ssÞ ¼ � n

2log ðs2Þ þ r log ð1�FðxÞÞ

� n

2s2fðmm� mÞ2 þ ss2 � ssWðxxÞ ðmm� 2mþ y0Þg ð16Þ

where WðxÞ ¼ r

n

fðxÞ½1�FðxÞ� and fðzÞ denote the density function of a standard

normal.Unfortunately the derivatives involved in the expression of the components of

the correction factor CUp do not have a closed expression. In any case we havedeveloped a program in Mathematica, that is available from the authors, thatwith the MLE(mm, ss), the number of censored (r) and observed data (n), andthe censoring point (y0) as inputs, gives the confidence intervals based onthe statistics W and Z*. The results for the application presented below wereobtained running this program.

Example 2. The following values correspond to determinations of the le-vel of glucose in the blood of 9 mice with experimentally induced diabetes:

592; 544; 466; 600þ; 600þ; 600þ; 443; 524; 600þThe measurements have been taken with the “Glucometer Elite” (Bayer) that

only registers values below 600. Observe that there are 4 censored values denotedby 600þ, that is, concentrations of glucose above 600. The distribution of the datais usually assumed to be normal.In this case we have n ¼ 5; r ¼ 4; y0 ¼ 600 and the MLE are mm ¼ 582:815 and

ss ¼ 93:994: With this information the program gives us the following 95% inter-vals for the mean based on W and Z*:

W : (513.897; 698.748) Z*: (503.315; 758.613)

Biometrical Journal 44 (2002) 7 873

The interval based on the Saddlepoint Approximation is longer than that ob-tained by using W. This is because the latter does not have the required level ofcoverage, as can be concluded from our Monte Carlo studies in the next section.

4. Monte Carlo Studies

In this section we present the results of Monte Carlo experiments in order tocompare the estimated coverage level obtained using the three methods consideredhere (W ; Z*, AN).

4:1 Extreme Value Distribution

For the standard extreme value distribution EVð0; 1Þ, we have simulated 7000samples of different small sizes (10, 8, 6) with type I censoring at y0 ¼ 0:8. Therather unusual choice of this number of replications is due to the long executiontime for each simulation. For each value generated we have determined whether itwas observed or censored data and then we have calculated the MLE for eachsample. With this information we computed the confidence intervals at differentlevels (90%, 95%, 99%) and registered if the real value of the scale parameter(s ¼ 1) belonged to that interval or not.Observe in Table 1 how the procedure based on Z* presents better approxima-

tions to the nominal levels of coverage than those obtained with the statistics Wor AN. Moreover the interval based on AN is clearly the worst.

4:2 Normal Distribution

Here we have simulated 5000 samples of different small sizes (10, 8, 6) with typeI censoring at y0 ¼ 1:5 for the standard normal distribution Nð0; 1Þ. As in the case

874 G. Damilano and P. Puig: Small Sample Asymptotics for Type I Censored Data

Table 1

Estimate Coverage Level in (%) of Confidence Intervals for the Scale Parameter s

SampleSize N

Nominal Level

90% 95% 99%

AN W Z* AN W Z* AN W Z*

10 83.0 87.6 89.7 87.0 93.4 94.0 92.4 98.3 98.98 81.0 86.9 89.0 85.1 93.1 93.6 90.4 98.3 98.76 80.0 87.6 89.1 93.7 93.4 94.6 89.7 98.2 99.2

Values based on 7000 simulations for EV(0,1) with y0 ¼ 0:8

of the extreme value, for each sample distribution the MLE are calculated, as wellas the 90%, 95% and 99% confidence intervals for the location parameter m.The results shown in the Table 2, indicate that the procedure based on W does

not achieved the nominal level required, although once more the results are betterthan those obtained from (AN). On the other hand, the results based on Z* areevidently much better, even when the sample size is very small. Indeed, the ob-served coverage level obtained with this higher-order asymptotic approximation,almost matched the nominal level proposed.

Acknowledgements:

We thank Drs. F. Bosch and P. Otaegui of the Biochemistry and Molecular Biol-ogy Dep. of the UAB, as well as Dr. J. Guerrero of Animal Physiology Dep. ofUAB, for providing us their data sets. This research has been partially supportedby the grant BFM2000-02 of the Ministerio de Educacion y Cultura of Spain.

References

Barndorff-Nielsen, O., (1991): Modified signed log likelihood ratio. Biometrika 78, 557––563.Burridge, J., (1981): A note on maximum likelihood estimator regression models using grouped data.

Journal of the Royal Statistical Society, B 43, p. 41––45.Cohen, C. A., (1991): Truncated and Censored Samples. Theory and Applications. Marcel Dek-

ker. New York.Hougaard, P., (1999): Fundamentals of Survival Data. Biometrics 55, p. 13––22.Jensen, J. L., (1993): A Historical Sketch and some new results on the Improved Log-Likelihood Ratio

Statistic. Scandinavian Journal of Statistics 20, p. 1––15.Jensen, J. L., (1994): Saddlepoint Approximations. Oxford University Press. Oxford.Lawless, J. F., (1982): Statistical Models and Methods for Lifetime Data. John Wiley & Sons.

New York.Pace, L. and Salvan, A., (1997): Principles of Statistical Inference. World Scientific Press. Singa-

pore.

Biometrical Journal 44 (2002) 7 875

Table 2

Estimate Coverage Level in (%) of Confidence Intervals for the Location Parameter m

SampleSize N

Nominal Level

90% 95% 99%

AN W Z* AN W Z* AN W Z*

10 85.9 87.4 90.3 91.7 92.9 95.0 96.8 98.4 99.18 84.7 86.7 90.1 90.1 92.2 95.3 95.9 98.4 99.36 82.5 85.0 90.5 87.8 91.3 95.8 94.4 97.7 99.2

Values based on 5000 simulations for N(0,1) with y0 ¼ 1:5

Reid, N., (1996): Likelihood and Higher-Order Approximations to Tail Areas: A Review and Anno-tated Bibliography. The Canadian Journal of Statistics 24, p. 141––166.

Wong, A. and Wu, J., (2000): Practical Small-Sample Asymptotics for Distributions Used in Life DataAnalysis. Technometrics 42, p. 149––155.

Received, April 2001Revised, April 2002Accepted, July 2002

876 G. Damilano and P. Puig: Small Sample Asymptotics for Type I Censored Data