Non-parametric assessment of non-inferiority with censored data

17
STATISTICS IN MEDICINE Statist. Med. (in press) Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/sim.2444 Non-parametric assessment of non-inferiority with censored data G. Freitag 1; ; , S. Lange 2 and A. Munk 1 1 Institut f ur Mathematische Stochastik; Georg-August-Universit at G ottingen; Germany 2 Institut f ur Qualit at und Wirtschaftlichkeit im Gesundheitswesen, K oln, Germany SUMMARY We suggest non-parametric tests for showing non-inferiority of a new treatment compared to a standard therapy when data are censored. To this end the dierence and the odds ratio curves of the entire survivor functions over a certain time period are considered. Two asymptotic approaches for solving these testing problems are investigated, which are based on bootstrap approximations. The performance of the test procedures is investigated in a simulation study, and some guidance on which test to use in specic situations is derived. The proposed methods are applied to a trial in which two thrombolytic agents for the treatment on acute myocardial infarction were compared, and to a study on irradiation therapies for advanced non-small-cell lung cancer. Non-inferiority over a large time period of the study can be shown in both cases. Copyright ? 2005 John Wiley & Sons, Ltd. KEY WORDS: bootstrap; condence interval; non-inferiority; non-parametrics; odds ratio; survival distri- bution; therapeutic equivalence 1. INTRODUCTION 1.1. Background Recently, an increasing number of clinical studies have been conducted with the aim of showing the non-inferiority of a new (test) treatment compared to a standard (reference) one. With regard to survival data, this amounts to proving that the new treatment results in survival probabilities which are not relevantly smaller than those of the standard treat- ment, or that the hazard rate is not relevantly increased under the test treatment as com- pared to the reference. The term therapeutic equivalence is sometimes used synonymously for non-inferiority. Regarding medical reasoning and specic methodological issues for Correspondence to: G. Freitag, Institut f ur Mathematische Stochastik, Georg-August-Universit at G ottingen, Maschm uhlenweg 8-10, 37073 G ottingen, Germany. E-mail: [email protected] Contract=grant sponsor: Deutsche Forschungsgemeinschaft; contract=grant number: TR 471=1 Received 15 June 2005 Copyright ? 2005 John Wiley & Sons, Ltd. Accepted 25 September 2005

Transcript of Non-parametric assessment of non-inferiority with censored data

STATISTICS IN MEDICINEStatist. Med. (in press)Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/sim.2444

Non-parametric assessment of non-inferiority withcensored data

G. Freitag1;∗;†, S. Lange2 and A. Munk1

1Institut f�ur Mathematische Stochastik; Georg-August-Universit�at G�ottingen; Germany2Institut f�ur Qualit�at und Wirtschaftlichkeit im Gesundheitswesen, K�oln, Germany

SUMMARY

We suggest non-parametric tests for showing non-inferiority of a new treatment compared to a standardtherapy when data are censored. To this end the di�erence and the odds ratio curves of the entiresurvivor functions over a certain time period are considered. Two asymptotic approaches for solvingthese testing problems are investigated, which are based on bootstrap approximations. The performanceof the test procedures is investigated in a simulation study, and some guidance on which test to use inspeci�c situations is derived. The proposed methods are applied to a trial in which two thrombolyticagents for the treatment on acute myocardial infarction were compared, and to a study on irradiationtherapies for advanced non-small-cell lung cancer. Non-inferiority over a large time period of the studycan be shown in both cases. Copyright ? 2005 John Wiley & Sons, Ltd.

KEY WORDS: bootstrap; con�dence interval; non-inferiority; non-parametrics; odds ratio; survival distri-bution; therapeutic equivalence

1. INTRODUCTION

1.1. Background

Recently, an increasing number of clinical studies have been conducted with the aim ofshowing the non-inferiority of a new (test) treatment compared to a standard (reference)one. With regard to survival data, this amounts to proving that the new treatment resultsin survival probabilities which are not relevantly smaller than those of the standard treat-ment, or that the hazard rate is not relevantly increased under the test treatment as com-pared to the reference. The term therapeutic equivalence is sometimes used synonymouslyfor non-inferiority. Regarding medical reasoning and speci�c methodological issues for

∗Correspondence to: G. Freitag, Institut f�ur Mathematische Stochastik, Georg-August-Universit�at G�ottingen,Maschm�uhlenweg 8-10, 37073 G�ottingen, Germany.

†E-mail: [email protected]

Contract=grant sponsor: Deutsche Forschungsgemeinschaft; contract=grant number: TR 471=1

Received 15 June 2005Copyright ? 2005 John Wiley & Sons, Ltd. Accepted 25 September 2005

G. FREITAG, S. LANGE AND A. MUNK

conducting non-inferiority trials, we refer to the special issue of Statistics in Medicine onthis topic in 2003 [1]. See also a number of guidelines of the Committee for ProprietaryMedicinal Products (CPMP) or the International Conference of Harmonization (ICH), such asReferences [2–6].For assessing non-inferiority with censored data, several parametric and semi-parametric

approaches have been suggested in the literature (cf. also the survey in Freitag [7]). Underthe assumption of exponential failure times, Bristol [8] proposes comparing the medians, orthe two (constant) hazard rates, respectively. In Stallard and Whitehead [9] it is assumed thatthe underlying failure time distributions are from the Weibull family with a common shapeparameter, which allows to examine therapeutic equivalence in terms of the constant hazardratio. Fleming [10] and Com-Nougue et al. [11] consider the (semi-parametric) case of pro-portional hazards without specifying the underlying distributions, and again de�ne equivalencein terms of the hazard ratio assumed to be constant over the time range (cf. also Wellek [12]).The hazard ratio under the assumption of proportional hazards is also used in Rothman et al.[13]. Under the assumption of an accelerated failure times model, the scale factor can be usedfor the de�nition of non-inferiority (cf., e.g. Wei and Gail [14]). In Bro�et et al. [15] a curerate model with identical long-term survivor rates is assumed, and non-inferiority is de�nedin terms of the short-term e�ects.Often, however, a parametric or semi-parametric modelling of the treatment e�ects is dif-

�cult (cf., e.g. Example 1.2), and a non-parametric approach becomes necessary. Thus, theaim of this paper is to investigate non-inferiority with censored data within a completelynon-parametric framework.Note that there are already some non-parametric approaches for the given problem. Com-

Nougue et al. [11] suggest to use the di�erence of the estimated survivor functions in thetwo groups at a pre-speci�ed time �¿0. Another option is the di�erence or the ratio ofthe median lifetimes for de�ning non-inferiority (cf. Su and Wei [16]). Finally, the Mann–Whitney functional P(X¿Y ) can be considered, which is related to the hazard ratio in case ofproportional hazards (cf. References [17, 18], and the application for non-inferiority problemsin the statistical software package TESTIMATE [19]).The above-mentioned approaches consider speci�c functionals of the survivor functions. In

contrast, we will focus in this paper on the comparison of the entire survival functions forthe two treatment groups. The comparison is based on the di�erence or the odds ratio of thesurvival probabilities. Note, however, that the risk ratio or the cumulative hazard ratio canbe treated in essentially the same way. The aim is to show non-inferiority uniformly overa certain time period instead of only at a single time point. This will be motivated by thefollowing two examples.

1.2. Two motivating examples

Example 1.1 (Lung cancer data)Nestle et al. [20] report on a study on irradiation therapies for advanced non-small-cell lungcancer. The aim was to exclude a 20 per cent or greater advantage in 1-year survival ofconventional (reference treatment R) over palliative irradiation (test treatment T ). A totalof 152 patients were included into the study, with 79 patients in the conventional treatmentgroup. The observed di�erence between the 1-year survival rates was 0.019 in favour of thereference treatment, with a 95 per cent-con�dence interval of (−0:135; 0:173). Based on this

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

Sur

viva

l fun

ctio

n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Months after randomization

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51

Patients at risk:

R:

T:

79

72

50

46

31

27

11

13

8

7

5

4

3

1

3 2

Figure 1. Estimated survival functions for conventional (R:- -) and palliative (T : —)irradiation from the lung cancer study.

(two-sided approach, as used in the analysis of the study), non-inferiority could be shown,because the upper con�dence bound did not exceed 0.2. In Figure 1 the estimated Kaplan–Meier survival curves are displayed for the two treatment groups. Here, it can be seen thatthe estimated di�erence between the two survival functions is rather small in the period of6–12 months, whereas it is considerably bigger (even to the disadvantage of the referencetreatment) especially for the �rst half of the second year. It would be of interest to investigatenon-inferiority not only at the single time point �=12, but also over a whole time interval,e.g. for the period [0; 18] of the �rst 1.5 years.

Example 1.2 (COMPASS data)Tebbe et al. [21] present a non-inferiority trial on saruplase (test treatment T ) versusstreptokinase (standard treatment R) for thrombolytic therapy after acute myocardial infarc-tion. The study included 3089 patients, of whom 1542 received the test treatment T . The aimof the study was to show that the odds ratio def= [pT =(1−pT )]=[pR=(1−pR)] of the 30 daysmortality rates (pT and pR, resp.) was less than 1.5, and Fisher’s exact test of ¿ 1:5 versus ¡1:5 gave a signi�cant result (p=0:00007). Note that there was (almost) no censoringin the �rst 30 days, such that methods for uncensored data have been used in the analy-sis. However, after this period censoring played a role: about 1 per cent of the observationswas censored within half a year, and about 10 per cent within 1 year. In Figure 2(a), theKaplan–Meier estimates of the survival functions are displayed over the follow-up period of1 year for the two treatments. From the �gure it seems that non-inferiority is maintainedthroughout the entire study period and not only at the single time point �=30. Indeed, inthe following we will present a method which allows showing this at a controlled error rate.Note that parametric or semi-parametric modelling is di�cult in this example. In particular,the hazard ratio does not seem to be constant over time, as can be seen from a plot of the

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

G. FREITAG, S. LANGE AND A. MUNK

0 30 50 100 150 200 250 300 350

Sur

viva

l fun

ctio

n

0.88

0.89

0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

Days since beginning of treatmentPatients at risk:

R:

T:

1547

1542

1427

1440

1414

1428

1409

1420

1407

1415

1401

1414

1393

1411

1382

1406

0 30 50 100 150 200 250 300 350

Cum

ulat

ive

Haz

ard

Rat

io

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Days since beginning of treatment

(a) (b)

Figure 2. Non-parametric estimators from the COMPASS data: (a) estimated survival functionsfor streptokinase (R: - -) and saruplase (T : —); and (b) estimator of the cumulative hazard

ratio function hr(t) def= �T (t)=�R(t).

0 30 50 100 150 200 250 300 350

Odd

sra

tio

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Days since beginning of treatment

Figure 3. Estimated cumulative odds ratio curve from COMPASS.

cumulative hazard ratio estimator hr(t)= �T (t)=�R(t) in Figure 2(b). Hence, for assessingnon-inferiority, we will consider the (cumulative) odds ratio as a function of time (not as-sumed to be constant), which can be estimated based on the two survival curves as displayedin Figure 3. Note that the plots of the estimated cumulative odds ratio and the estimated cumu-lative hazard ratio are very similar for this example, since the estimators of the two survivalfunctions are close to 1 over the follow-up period of 1 year, as is the ratio of the estimatesof (1− FR(t))=(1− FT (t)).

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

The paper is organized as follows. In Section 2 the notation and the hypotheses for test-ing of non-inferiority in a non-parametric setting are presented. For the sake of brevity, allinvestigations are restricted to the case of simple random right censoring (cf. also Remark4.1 for more general censoring schemes). In Section 3 we describe the construction of testsfor the given problems. Two di�erent approaches are considered, one is based on asymptoticpointwise con�dence intervals for the di�erence or odds ratio function, respectively (using theintersection–union principle), and the other one is based on asymptotic con�dence intervalsfor the supremum of the respective functions. In both cases, the required con�dence intervalsare obtained via bootstrapping. The corresponding C++ code is available from the authorsupon request. Some remarks and extensions are provided in Section 4. The performance ofthe proposed tests is examined in a simulation study in Section 5, and their application to theexamples introduced above is presented in Section 6. It can be seen that both the pointwisecon�dence interval method and the supremum-based approach keep their nominal level, butthe latter is clearly superior to the �rst one with respect to power. On the other hand, thepointwise con�dence interval method provides a nice graphical tool for analysing the data,and it can be used for �nding the maximal set of time points within a given time window,for which non-inferiority can be shown. A short discussion of the results is given in Section7. Additional mathematical details and a sketch of the proofs are given in the Appendix.

2. TESTING PROBLEMS

We denote the cumulative distribution function (cdf) of a failure time under the reference(test) treatment by FR (FT ). In the following we assume that the Fk are continuous cdfs withcorresponding survivor functions Sk =1− Fk and continuous hazard functions �k , k=R; T .Let {Tki}; i=1; : : : ; nk , be independent and identically distributed (i.i.d.) failure times (Tki¿0)

according to Fk , k=R; T . Further, let the corresponding censoring times {Uki} (Uki¿ 0) bedistributed according to Gk , where Uki is independent of Tki, i=1; : : : ; nk ; k=R; T . The obser-vations consist of pairs {Xki; �ki}, where Xki=Tki ∧Uki

def= min{Tki; Uki} are the observed failuretimes and �ki= I{Tki6Uki}; i=1; : : : ; nk ; k=R; T , are the corresponding censoring indicators. Forestimating the underlying distribution functions Fk , we will use the standard non-parametricKaplan–Meier estimators, which shall be denoted by Fk , k=R; T .Instead of comparing the two cdfs Fk , k=R; T , only at a single time point �, we propose

considering a whole time interval [�0; �1] for some suitably chosen boundaries (06)�0¡�1de�ned in advance. Thus, if one is interested in the di�erence of the cdfs, as it was the casein Example 1.1, we suggest hypotheses of the form

Hd: d(t)¿�d for some t ∈ [�0; �1] versus Kd: d(t)¡�d ; t ∈ [�0; �1] (1)

where d(t) def= FT (t) − FR(t). Note that rejection of Hd implies that non-inferiority (in termsof the di�erence of the cdfs) is declared valid over the entire time interval [�0; �1] at acontrolled error rate. If the main interest is in the odds ratio of the survival probabilities, as inExample 1.2, we propose the testing problem

Ho: (t)¿�o for some t ∈ [�0; �1] versus Ko: (t)¡�o; t ∈ [�0; �1] (2)

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

G. FREITAG, S. LANGE AND A. MUNK

where (t) def= [FT (t)=(1− FT (t))] = [FR(t)=(1− FR(t))]. Here, �d¿0 and �o¿1 are �xed ir-relevance bounds which have to be de�ned in advance. The ‘choice of �’ is important andshould be based on historical estimates or assumptions for the e�cacy of the reference treat-ment, but we will not address this issue further in this article. Instead, we refer to the extensivediscussions in Reference [1] and in the special guideline of the CPMP [5]. Note that for themethods proposed here the same choice of � is feasible as for the case when only a singletime point is considered.

3. NON-PARAMETRIC BOOTSTRAP TESTS

For testing the above hypotheses we will use estimators of the di�erence and the odds ratiofunctions, respectively, i.e. d(t) def= FT (t) − FR(t) and (t) def= [FT (t)=(1 − FT (t))]=[FR(t)=(1− FR(t))].An appealing and simple possibility to construct level-� tests for hypotheses (1) and (2)

is the construction of pointwise upper (1 − �)-con�dence intervals for d(t) and (t) atevery t ∈ [�0; �1], and to reject the null hypothesis if all con�dence intervals fall below thebound �d and �o, respectively. Note that here it is not required to calculate simultaneouscon�dence bands for the two discrepancy functions, which follows from the intersection–union principle (IUT principle; cf. Berger [22]). We call this the pointwise approach in thesequel.As follows from the results reviewed in the Appendix, if

nR; nT → ∞; nT =(nR + nT )→�∈ (0; 1) (3)

the limiting distribution of√

nRnT =(nR + nT )(d(t)− d(t)) and√

nRnT =(nR + nT )( (t)− (t))is normal with mean 0 and variance �2d(t) and �2o(t), respectively, which can be computedexplicitly and depend on FT ; FR; GT ; GR. For the construction of an asymptotic upper con�-dence interval for d(t) or (t) at �xed t, the respective variance has to be estimated. Forthe di�erence curve this can be done by using the Greenwood variance estimators of thesingle survival curves. However, it turns out to be more complicated in case of the odds ratiocurve. Alternatively, one can use bootstrapping for the construction of the con�dence intervals.For the case of simple right censoring, the ‘simple method’ of resampling suggested by Efron(cf. Reference [23]) can be used. For group k ∈ {R; T}, a bootstrap sample (X ∗

kj; �∗kj),

j=1; : : : ; mk , is drawn from the pairs (Xki; �ki); i=1; : : : ; nk . The corresponding Kaplan–Meierestimators F∗

k (·); k=R; T are then calculated, giving the bootstrap estimators d∗(t)= F∗T (t)−

F∗R (t) and ∗(t)= [F∗

T (t)=(1 − F∗T (t))]=[F

∗R (t)=(1 − F∗

R (t))] for d(t) and (t), respectively. Aproof of the weak consistency of the bootstrap method (cf. Shao and Tu [24]) in this con-text will be sketched in the Appendix, whereas details can be found in Freitag and Munk[25]. This assures, given the data, the (asymptotic) closeness of the sample distribution ofthe test statistics and the distribution of their bootstrapped versions. Hence, empirical quan-tiles of the bootstrap samples can be used to approximate those of the asymptotic distri-bution of each test statistic, which leads to the percentile (PC) method. Alternatively, thebias-corrected accelerated (BCa) bootstrap method can be used for constructing the requiredbootstrap con�dence intervals, which yields an improvement of the accuracy of the approxima-tion in many cases (cf., e.g. Reference [24]). The bootstrap con�dence intervals are calculated

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

as follows (for the sake of brevity, we describe this for the di�erence d(·) only; for (·) justreplace d with ):

Step 1: Draw B bootstrap samples from each group k ∈ {R; T} of sizes mk = nk . This yieldsbootstrap estimators d∗

b(t), t ∈ [�0; �1], b=1; : : : ; B.Step 2a: For t ∈ [�0; �1], the upper (1 − �)-PC-con�dence bound for d(t) is given by the

empirical (1− �)-quantile d∗(1−�)(t) of the d∗

b(t), b=1; : : : ; B.Step 2b: For t ∈ [�0; �1], the upper (1 − �)-BCa-con�dence bound for d(t) is given by the

empirical �-quantile d∗�(t) of the sample d∗

1(t); : : : ; d∗B(t), where

� def= �(z0 +

z0 + u1−�

1− a(z0 + u1−�)

)

z0def= �−1

(1B

B∑b=1

I{d∗b(t)¡d(t)}

)

a def=∑n

i=1(d(·)(t)− d(i)(t))3

6(∑n

i=1(d(·)(t)− d(i)(t))2)3=2

Here, d(i)(t) is calculated as d(t), but leaving out the ith observation from the pooled sample,

i=1; : : : ; nR + nT ≡ n. Accordingly, d(·)(t)def= 1=n

∑ni=1 d(i)(t). Further, � and u� denote the

cdf and the �-quantile of the standard normal distribution, respectively. Note that z0 is anestimator of the bias correction term, whereas a is an estimator of the acceleration constant,which are used for the BCamethod (cf. Reference [24]).Note that the random Gaussian multipliers technique (a sort of Gaussian bootstrap) used

in Parzen et al. [26] for constructing con�dence intervals for the di�erence of two survivalfunctions can also be applied for the test of Hd. This was investigated in a simulation study(not displayed), but did not yield better results than the non-parametric bootstrap describedabove. Observe that this method could also be used for the odds ratio function.Naturally, the pointwise approach will be over-conservative. This could be improved by

choosing only a small number of time points for which non-inferiority has to be shown.Alternatively, a second approach to testing (1) and (2) is based on the construction of anupper con�dence interval directly for the supremum of the discrepancy under consideration,i.e. for supt∈[�0 ; �1] d(t) and supt∈[�0 ; �1] (t), respectively. This is motivated by the fact that thetesting problem in (1) is equivalent to

H : supt∈[�0 ; �1]

d(t)¿�d versus Kd : supt∈[�0 ; �1]

d(t)¡�d

and similarly for (2). For constructing a corresponding test, it would be necessary to estimatethe variance of the supremum of the limiting process of the chosen discrepancy measure, andit is not evident how to do this. Again, however, a test can be constructed via bootstrapping,as discussed in the Appendix. Note that the above-mentioned resampling schemes are notvalid anymore. However, in Reference [25] it is shown that the hybrid bootstrap method(cf., e.g. Reference [24]) is weakly consistent, if the bootstrap sample sizes are chosen smaller

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

G. FREITAG, S. LANGE AND A. MUNK

than the original sample sizes. More precisely, mk has to be chosen such that mk = o(nk) andmT=(mR+mT )−→� (with � from (3)), as mk; nk → ∞, k=R; T . (For further applications of theso-called ‘m out of n bootstrap’ cf., e.g. References [24, 27–30].) Then, after having performedthe above Step 1 with suitably chosen mk¡nk , k=R; T , the upper (1− �)-con�dence boundfor supt∈[�0 ; �1] d(t) is calculated as(

1 +

√mRmT (nR + nT )nRnT (mR + nT )

)sup

t∈[�0 ; �1]d(t)−

√mRmT (nR + nT )nRnT (mR + nT )

(sup

t∈[�0 ; �1]d∗(t)

)�

(4)

where (supt∈[�0 ; �1] d∗(t))� denotes the empirical �-quantile of the supt∈[�0 ; �1] d

∗b(t), b=1; : : : ; B.

The use of the normalizing factors in (4) is motivated in the Appendix. Thus, the null hy-pothesis Hd can be rejected if the calculated con�dence bound lies below �d. (Likewise forHo.) We will call this the supremum approach in the following. A numerical comparison ofthe pointwise and the supremum approach is performed in Section 5. The proper choice ofthe bootstrap sample sizes mk will be investigated, too.

4. REMARKS AND EXTENSIONS

Remark 4.1The above methods can be extended to situations under more general censoring mechanisms.For this it has to be assured that the weak convergence of the underlying empirical processes(product-limit processes) holds, and a strongly consistent method for bootstrapping these pro-cesses has to be used. Then both the pointwise and the supremum approach can be applied, asdescribed above. The weak consistency of the resulting procedures follows along the lines ofthe discussion in Appendix A. To give some examples, note that strongly consistent methodsfor bootstrapping the underlying empirical processes (and cumulative hazard processes) wereproposed for the case of progressive type I censoring in Reference [31], and for the case ofrandom left truncation and right censoring in Reference [32]. See also Reference [33, p. 220]for a method of bootstrapping general counting processes.

Remark 4.2For the pointwise approach, it is possible to specify a variable non-inferiority boundary,for example, �o(t), t ∈ [�0; �1], for the testing problem (2). This might be of interest sincethe underlying event rates are changing over time, which changes the interpretation of themagnitude of the boundary at di�erent time points. See also References [34, 35] for the use ofvariable non-inferiority margins in the case of binary data, where the boundary is a functionof the event probability of one group.

Remark 4.3Observe that the following proceeding is valid with the pointwise approach. Starting from theinitial time window, one can select the biggest time interval within this window, for whichthe null hypothesis can be rejected, without increasing the size of the test (cf. the discussionof this issue in Reference [7]). Hence, the pointwise con�dence bands can always be used asan additional analysis to the supremum test. Observe that, even if a study is planned for theinvestigation of the mortality rates only at a single time point, it is still a valid procedure to

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

accompany the data analysis a posteriori by reporting on the largest time interval within thefollow-up period for which Hd (Ho) can be rejected.

5. SIMULATION RESULTS

In the following we present a summary of an extensive simulation study of the �nite samplebehaviour of the proposed tests.For the investigation of the test of Hd in (1), the cdfs FR and FT were chosen as expo-

nential with scale parameter 0.5 and �, respectively. The time interval [�0; �1]= [0; 12] wasconsidered. For the test of Ho in (2), two settings (a) and (b) were investigated. In case (a),FR was chosen log-logistic, FR(t)=1− 1=[1 + (�1t)�2 ], with parameters �1 = 0:8 and �2 = 1:7or �2 = 1:8, whereas FT was chosen log-normal, FT (t)=�((ln(t)− �1)=�2). Here, the time in-terval [�0; �1]= [0:4; 6:0] was taken. In case (b), a proportional odds setting, logit(P(TRi6 t))=log(0:08t), logit(P(TTi6t))=log(0:08t)+�2, was chosen in the time interval [�0; �1]=[1:5; 12:0].In particular, for the odds ratio the time intervals were de�ned such that the underlying cdfsattain values between 0.1 and 0.95 (because of the large variability of estimators of the oddsratio for very small or very large probabilities).For the simulations referring to the signi�cance level, the remaining parameters of FT were

chosen such that the boundary �d or �o, respectively, was attained at least at one time point.For the exponential cdf, �=0:6567 was used to achieve �d =0:1 at a single time point,and �=0:86707 to achieve �d =0:2. For the odds ratio, in setting (a) the log-normal cdfwith (�1; �2)= (0:15; 1:038) was used to attain �o =1:25 at a single time point, and withparameters (0:071; 1:0) to achieve �o =1:5. For setting (b) the parameter �2 was varied toobtain a constant odds ratio of 1.25 and 1.5, respectively. For the power simulations the truediscrepancy was set to �d =0 or �o =1, respectively, i.e. here FT

def= FR.For all settings selected, exponentially distributed censoring times (with the same distribu-

tion in each group) were used, such that the censoring proportion in the reference group wasequal to z=0; 0:2; 0:4. The sample sizes were chosen as nR= nT ≡ n=100; 200; 500; 1000; 1500,and the nominal level of the tests was set to �=0:05; 0:10. For each setting, 1000 simulationswith B=2000 bootstrap replications were conducted. The simulations were performed usingC++ code, which is available from the authors upon request.The results for the tests of Hd are displayed in Table I, where type I stands for the simulated

type I error probabilities. On the left-hand side, the test based on pointwise BCa-con�denceintervals is presented. We mention that the results from the PC method (cf. Section 3) wereslightly worse for smaller sample sizes and comparable to those from the BCamethod for largersample sizes. Note that for the pointwise approach there are two possibilities for drawing thebootstrap samples: either only one bootstrap sample is taken for calculating the con�denceintervals for each t ∈ [�0; �1], or a new bootstrap sample is drawn for each point of time cor-responding to a jump of the discrepancy function. Our simulations revealed that there wasno remarkable di�erence in the results for these resampling schemes, hence only the com-putationally much simpler �rst method will be displayed in the following. For the sake ofbrevity, for the supremum test (with the hybrid method) we only display the results for thecase m= n (m def= mR=mT ), since our simulations showed that the rejection probability ofthe test is decreasing with increasing m (cf. also Figures 4 and 5). Numerically, we found

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

G. FREITAG, S. LANGE AND A. MUNK

Table I. Simulated type I error probabilities and power of tests of Hd.

Pointwise approach (BCa) Supr. appr. (hybrid; m = n)

�=0:05 �=0:10 �=0:05 �=0:10

n z �d type I power type I power type I power type I power

100 0 0.1 0.001 0.083 0.004 0.174 0.041 0.386 0.062 0.4610.2 0.005 0.665 0.012 0.789 0.040 0.857 0.070 0.913

200 0 0.1 0.000 0.237 0.006 0.393 0.037 0.636 0.057 0.7150.2 0.005 0.949 0.011 0.977 0.046 0.983 0.072 0.992

200 0.2 0.1 0.001 0.138 0.004 0.255 0.036 0.545 0.054 0.6380.2 0.002 0.906 0.013 0.960 0.034 0.975 0.065 0.985

200 0.4 0.1 0.000 0.055 0.001 0.124 0.020 0.386 0.036 0.4790.2 0.000 0.735 0.010 0.846 0.042 0.920 0.079 0.962

500 0 0.1 0.003 0.774 0.007 0.877 0.042 0.924 0.072 0.9470.2 0.004 1 0.021 1 0.041 1 0.064 1

1000 0 0.1 0.005 0.988 0.017 0.997 0.034 0.990 0.067 0.9960.2 0.009 1 0.025 1 0.038 1 0.064 1

1500 0 0.1 0.008 0.998 0.017 1 0.050 1 0.088 10.2 0.009 1 0.024 1 0.045 1 0.078 1

Leve

l

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

m

0 50 100 150 200 250 300 350 400 450 500

n=500

Leve

l

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

m

0 100 200 300 400 500 600 700 800 900 1000

n=1000

Figure 4. Simulated type I error probabilities of the (hybrid) supremum test for Ho in dependenceon m; setting (a), �o =1:5, �=0:05.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

Leve

l

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

m

0 50 100 150 200 250 300 350 400 450 500

n=500

Leve

l

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

m

0 100 200 300 400 500 600 700 800 900 1000

n=1000

Figure 5. Simulated type I error probabilities of the (hybrid) supremum test for Ho in dependenceon m; setting (b), �o =1:5, �=0:05.

in all settings that the choice m = n ensures that the supremum test keeps its nominal level.The pointwise test was found to be very conservative as compared to the supremum test. Forsample sizes n=200 and the boundary �d =0:2, the supremum test achieves already powervalues larger than 0.9, even for the most conservative choice of m = n. Overall, censoringtends to decrease the power of the tests. We mention that the results for the pointwise con�-dence intervals based on the approach of Parzen et al. [26] were very similar to those of thepointwise non-parametric bootstrap test, hence they are not displayed here.Table II shows the results for the supremum test of Ho for settings (a) and (b), again for

the case of maximum bootstrap sample sizes. The results for the pointwise BCatest (and alsofor the pointwise PC test) were again much more conservative, and hence are not displayed.It can be seen that the properties of the test depend strongly on the underlying cdfs. The twosettings (a) and (b) are extreme cases in the sense that in case (a) the supremum odds ratiois attained only at a single time point, whereas in case (b) it is attained at every time point.This might be the reason for the lower rejection probabilities for setting (b) as compared tosetting (a), which were obtained in case of large sample sizes. We mention that the rejectionprobability for setting (b) increases as the time window [�0; �1] is shortened. For both settings,the nominal level is maintained. Higher power values were achieved for setting (b). This maybe due to the fact that the right trimming (by �1) is ‘stronger’ in setting (b) in the sense thatthe values of the underlying cdfs are only about 0.6 at �1, in contrast to values of about 0.95for setting (a). Again, censoring tends to decrease the power of the test. For high sample sizesn=1500, no censoring, and �o =1:5 (cf. the setting of the COMPASS trial in Example 1.2),the attained power values are satisfactory. The dependence of the type I error probabilities onthe bootstrap sample sizes is displayed in Figures 4 and 5, respectively. It can be seen that thetest exceeds slightly its nominal level for setting (a) in case of small values of m, whereasit is conservative throughout for setting (b). Overall, for practical applications we recom-mend to use rather large values of m. In general, a proper choice of m can be determined by

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

G. FREITAG, S. LANGE AND A. MUNK

Table II. Simulated type I error probabilities and power of supremum test of Ho(hybrid bootstrap; m= n).

Setting (a) Setting (b)

�=0:05 �=0:10 �=0:05 �=0:10

n z �o type I power type I power type I power type I power

100 0 1.25 0.017 0.045 0.030 0.074 0.023 0.129 0.033 0.1761.50 0.026 0.132 0.039 0.189 0.025 0.275 0.041 0.335

200 0 1.25 0.027 0.082 0.046 0.118 0.027 0.165 0.038 0.2251.50 0.026 0.253 0.047 0.325 0.015 0.417 0.027 0.493

200 0.2 1.25 0.014 0.064 0.028 0.109 0.017 0.150 0.034 0.2191.50 0.021 0.209 0.037 0.272 0.018 0.399 0.033 0.449

200 0.4 1.25 0.011 0.049 0.020 0.077 0.017 0.126 0.020 0.1711.50 0.022 0.135 0.033 0.194 0.023 0.349 0.036 0.439

500 0 1.25 0.023 0.199 0.043 0.273 0.024 0.272 0.039 0.3661.50 0.037 0.562 0.059 0.655 0.019 0.761 0.036 0.823

1000 0 1.25 0.027 0.390 0.052 0.481 0.018 0.534 0.034 0.6191.50 0.046 0.802 0.078 0.847 0.017 0.930 0.029 0.954

1500 0 1.25 0.041 0.547 0.071 0.632 0.013 0.671 0.025 0.7461.50 0.043 0.917 0.080 0.943 0.017 0.971 0.023 0.981

simulations in the planning stage of a study. Nevertheless, the choice m= n was always foundto be a valid option (in the sense of maintaining the nominal level), albeit theory formallyrequires m= o(n).In summary, the simulation results re�ect the conservatism of the pointwise approach. On

the other hand, reasonable simulated power values were obtained for the supremum approach.The choice m = n of the bootstrap sample size for the supremum approach yields tests whichkeep the nominal level and still show good power properties.

6. ANALYSIS OF THE EXAMPLES

We illustrate the application of the proposed methods for the data examples discussed inSection 1.

Example 6.1 (Lung cancer data; cf. Example 1.1)In Figure 6 the estimated di�erence of the survival functions for the conventional and thepalliative irradiation therapies from the lung cancer study is displayed for the �rst 18 months.Moreover, upper 95 per cent-BCa-con�dence bounds are shown for each time point. It canbe seen that these remain below the boundary of 0.2 (used in Reference [20]) throughoutthe whole time interval. Thus, the pointwise BCa-con�dence intervals test of hypotheses (1)is able to reject Hd for the entire time period [�0; �1]= [0; 18] and �d =0:2. For the supre-mum approach, the upper 95 per cent-hybrid-con�dence bound for the supremum of d(·) was

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

Months after randomization

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 3 6 9 12 15 18

S R -

ST

Figure 6. Estimator (—) and pointwise upper 95 per cent-BCa-con�dence bounds (- -) for the di�erenceof the survival functions for R and T from the lung cancer study.

Days after beginning of treatment

Odd

s R

atio

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 50 100 150 200 250 300 350 400

Figure 7. Estimator (—) and pointwise upper 95 per cent-BCa-con�dence bounds (- -) for the oddsratio of the cdfs for T and R from the COMPASS trial.

calculated as 0:074 for mk = nk , k=R; T (and as 0.073 for mk =30; 0:068 for mk =20), hence,the corresponding test also rejects the null hypothesis Hd.

Example 6.2 (COMPASS data; cf. Example 1.2)Figure 7 shows the estimated odds ratio of the cdfs for saruplase and streptokinase treatmentfrom the COMPASS trial over the period of [5; 365] days. In addition, upper 95 per centBCacon�dence bounds for the odds ratio at each time point are displayed in the �gure. Sinceall of these are below the boundary of 1.5, the null hypothesis of (2) can be rejected for[�0; �1]= [5; 365] and �o =1:5. For the supremum test, the upper 95 per cent-hybrid-con�dencebound for the supremum of (·) was obtained as 1.17 for mk = nk , k=R; T (and as 1.15 formk =750, 1.14 for mk =500), hence this test leads also to a rejection of Ho at 5 per cent

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

G. FREITAG, S. LANGE AND A. MUNK

level, and thus non-inferiority of saruplase can be shown over the entire follow-up period of12 months.

7. DISCUSSION

In summary, the presented methods allow for the non-parametric assessment of non-inferiorityover an entire period of survival times instead of only a single time point. We mention thatthe same techniques as for the di�erence and the odds ratio function could also be applied tothe relative risk function r(t) def= FT (t)=FR(t), and to other smooth functionals T (FT ; FR) of thesurvivor functions or of the cumulative hazard functions (such as the ratio of the cumulativehazard functions; cf. Reference [25]).For the application of the proposed testing procedures, there are several parameters to be

chosen. First, an appropriate time window [�0; �1] has to be speci�ed for which non-inferiorityis to be shown. Usually, this will be shorter than the follow-up period, but we believe that itis often more appropriate than just a single time point. Moreover, it seems to be preferable todo a thorough non-parametric analysis with a restricted time interval rather than an analysisfor the whole follow-up period, but using the wrong (parametric or semi-parametric) model.The choice of the parameter �d or �o, respectively, can be made essentially in the same

way as it is done for simple pointwise hypotheses. Here we refer also to the special issue ofStatistics in Medicine [1] (in particular to Reference [15]), and to the special guideline ofthe CPMP [5]. Note that for the pointwise approach it would also be possible to specify anon-constant boundary function �d(t) or �o(t), respectively (cf. Remark 4.2).In addition, for the supremum approach an appropriate bootstrap sample size has to be

chosen. We recommend using rather large values of m in order to ensure the maintenanceof the level of the test. In planning a trial, several realistic settings should be considered viasimulations in order to choose m. A ‘safe’ choice is given by m= n, although this yieldsrather conservative tests.If the focus of an application is on testing, the supremum approach should be preferred,

since it yields less conservative and more powerful tests. However, the relative merit ofthe pointwise approach is to provide a useful graphical tool for the analysis of the data.This helps to get a clearer impression in the behaviour of the discrepancy measure over thepre-speci�ed time interval, besides the aggregated information obtained from the supremumapproach. Moreover, it is a valid procedure to start with the initial time window and thenselect the biggest time interval within this window, for which the null hypothesis can berejected, without increasing the size of the test (cf. Remark 4.3).Finally, if concern is raised with respect to the equality or ‘su�cient similarity’ of the

censoring distributions, it is possible to accompany our test by a non-parametric test on theequivalence of the two censoring distributions. For this the equivalence test as suggested inMunk and Czado [36] can be used, based on the non-parametric estimators of the censoringdistributions Gk; k=R; T . Note that the bootstrapping scheme has to be adjusted accordingly.

APPENDIX A

In the following we will sketch the main ideas underlying the tests presented in Section 3. Thedetailed proofs are provided in a technical report [25]. There a generalized setting in terms

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

of smooth functionals of the underlying cumulative hazard functions is considered, with thedi�erence and odds ratio functionals as special cases. These can also be viewed as functionalsof the underlying distribution functions, as indicated below.Both testing approaches are based on the weak convergence of the underlying empirical

processes (product-limit processes; cf., e.g. Reference [33]) and on the resulting weak con-vergence of the di�erence and odds ratio processes, respectively. The latter is discussed inReference [25] and essentially uses the functional delta method as presented, e.g. in Gill[37]. Here, the Hadamard di�erentiability of the di�erence functional and of the odds ratiofunctional is exploited.In addition to that, the consistency of the bootstrap of the empirical processes is required

for the application of the proposed methods. For the precise de�nition of (weak or strong)consistency of the bootstrap method we refer to Reference [24]. Basically, it means thatthe distributions of the limiting random elements of the processes or functionals under con-sideration can be mimicked, ‘su�ciently well’ asymptotically by their bootstrap analogues.According to Akritas [38], the simple method for bootstrapping randomly right-censored data(cf. Section 3) is strongly consistent for the reproduction of the asymptotic distribution of theproduct-limit processes.Further, the delta method for the bootstrap (cf., e.g. Reference [37]) yields the weak con-

sistency of the bootstrapped versions d∗(t) and ∗(t) of d(t) and (t), respectively, whichcan be used immediately to construct the pointwise con�dence bounds as described in Section3. Note that here the bootstrap sample sizes can be chosen as mk = nk , k=R; T .The weak consistency of the supremum approach was shown in Reference [25] as follows.

First, the convergence in distribution of the supremum statistics sup d and sup was derived,based on results of Raghavachari [39]. Then the weak consistency of the bootstrapped supre-mum statistics was shown, using almost sure constructions as in the proof of Theorem 5in Reference [37] and the assumption mk = o(nk), k=R; T . Thus, we get that the bootstrapversions

√mRmT =(mR+mT )(supt∈[�0 ; �1] d

∗(t)− supt∈[�0 ; �1] d(t)) and√

mRmT =(mR+mT )(supt∈[�0 ; �1] ∗(t)− supt∈[�0 ; �1] (t)) are weakly consistent for the asymptotic distributions of

√nRnT =(nR+nT )

(supt∈[�0 ; �1] d(t) − supt∈[�0 ; �1] d(t)) and√

nRnT =(nR + nT )(supt∈[�0 ; �1] (t) − supt∈[�0 ; �1] (t)), re-spectively. From this the corresponding tests can be derived immediately, by Monte Carlosimulations of the distributions of

√mRmT =(mR +mT )(supt∈[�0 ; �1] d

∗(t) − supt∈[�0 ; �1] d(t)) and√mRmT =(mR +mT )(supt∈[�0 ; �1]

∗(t)− supt∈[�0 ; �1] (t)), respectively. This yields the normaliz-ing factors in the calculations of the upper con�dence bounds as given in (4).

ACKNOWLEDGEMENTS

This research was supported by the grant TR 471=1 of the Deutsche Forschungsgemeinschaft. Further,the authors wish to thank U. Nestle for supplying the data from the lung cancer study, and H.-J.Trampisch for helpful discussions, and for the data from the COMPASS trial. Valuable commentsby R. O’Neill are gratefully acknowledged. The authors would like to thank two referees for helpfulsuggestions which lead to an improvement of the manuscript.

REFERENCES

1. Non-Inferiority Trials: Advances in Concepts and Methodology. In D’Agostino RB et al. (eds). Statistics inMedicine 2003; 22(2):165–336.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

G. FREITAG, S. LANGE AND A. MUNK

2. CPMP. Biostatistical Methodology in Clinical Trials in Applications for Marketing Authorizations for MedicinalProducts. CPMP Working Party on E�cacy of Medicinal Products Note for Guidance III=3630=92-EN. Statisticsin Medicine 1995; 14:1659–1682.

3. CPMP. Points to consider on switching between superiority and non-inferiority. CPMP=EWP=482=99, 2000.4. ICH. ICH Harmonised tripartite guideline. Statistical principles for clinical trials (E9). Statistics in Medicine1999; 18:1905–1942.

5. CPMP. Guideline on the choice of the non-inferiority margin. EMEA=CPMP=EWP=2158=99, 2005.6. ICH. ICH Harmonized tripartite guideline. Choice of control group and related issues in clinical trials (E10).http:==www.fda.gov=cder=guidance=iche10.pdf (Electronic Citation), 2001.

7. Freitag G. Methods of assessing noninferiority with censored data. Biometrical Journal 2005; 47:88–98.8. Bristol DR. Planning survival studies to compare a treatment to an active control. Journal of BiopharmaceuticalStatistics 1993; 3:153–158.

9. Stallard N, Whitehead A. An alternative approach to the analysis of animal carcinogenicity studies. RegulatoryToxicology and Pharmacology 1996; 23:244–248.

10. Fleming TR. Evaluation of active control trials in aids. Journal of Acquired Immune De�ciency Syndromes1990; 2:S82–S87.

11. Com-Nougue C, Rodary C, Patte C. How to establish equivalence when data are censored: a randomized trialof treatments for B non-Hodgkin lymphoma. Statistics in Medicine 1993; 12:1353–1364.

12. Wellek S. A log-rank test for equivalence of two survivor functions. Biometrics 1993; 49:877–881.13. Rothmann M, Li N, Chen G, Chi GYH, Temple R, Tsou H-H. Design and analysis of non-inferiority mortality

trials in oncology. Statistics in Medicine 2003; 22:239–264.14. Wei LJ, Gail MH. Nonparametric estimation for a scale-change with censored observations. Journal of the

American Statistical Association 1983; 78:382–388.15. Bro�et P, Tubert-Bitter P, De Rycke Y, Moreau T. A score test for establishing non-inferiority with respect to

short-term survival in two-sample comparisons with identical proportions of long-term survivors. Statistics inMedicine 2003; 22:931–940.

16. Su JQ, Wei LJ. Nonparametric estimation for the di�erence or ratio of median failure times. Biometrics 1993;49:603–607.

17. Efron B. The two-sample problem with censored data. In Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability, vol. IV. University of California Press: Berkeley, CA, 1967; 831–853.

18. Kalb�eisch JD, Prentice RL. Estimation of the average hazard ratio. Biometrika 1981; 68:105–112.19. TESTIMATE. Statistics package for exact tests, e�ect size measures and con�dence intervals, test procedures

for di�erence=equivalence, idv-Data Analysis and Study Planning, Gauting=Munich, Germany.20. Nestle U, Nieder C, Walter K, Abel U, Ukena D, Sybrecht GW, Schnabel K. A palliative accelerated

irradiation regimen for advanced non-small-cell lung cancer versus conventionally fractionated 60 Gy: resultsof a randomized equivalence study. International Journal of Radiation Oncology Biology Physics 2000;48:95–103.

21. Tebbe U, Michels R, Adgey J, Boland J, Caspi A, Charbonnier B, Windeler J et al. Randomized, double-blindstudy comparing Saruplase with Streptokinase therapy in acute myocardial infarction: the COMPASS equivalencetrial. Journal of the American College of Cardiology 1998; 31:487–493.

22. Berger RL. Multiparameter hypothesis testing and acceptance sampling. Technometrics 1982; 24:295–300.23. Efron B. Censored data and the bootstrap. Journal of the American Statistical Association 1981; 76:312–319.24. Shao J, Tu D. The Jackknife and Bootstrap. Series in Statistics. Springer: New York, 1995.25. Freitag G, Munk A. Bootstrap procedures for the nonparametric assessment of noninferiority with censored data.

Technical Report, Georg-August-Universit�at G�ottingen, http:==www.stochastik.math.uni-goettingen.de=research=,2005.

26. Parzen MI, Wei LJ, Ying Z. Simultaneous con�dence intervals for the di�erence of two survival functions.Scandinavian Journal of Statistics 1997; 24:309–314.

27. Shao J. Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society 1994;122:1251–1262.

28. Bickel PJ, Ren J-J. The m out of n bootstrap and goodness of �t tests with doubly censored data. In RobustStatistics, Data Analysis, and Computer-Intensive Methods, Rieder H (ed.). Lecture Notes in Statistics,vol. 109. Springer: Berlin, 1996; 35–47 (In honour of Peter Huber’s 60th birthday).

29. Bickel PJ, G�otze F, van Zwet WR. Resampling fewer than n observations: gains, losses, and remedies for losses.Statistica Sinica 1997; 7:1–31.

30. Lee SMS. On a class of m out of n bootstrap con�dence intervals. JRSS B 1999; 61:901–911.31. Doss H, Gill RD. An elementary approach to weak convergence for quantile processes, with applications to

censored survival data. Journal of the American Statistical Association 1992; 87:869–877.32. Gross ST, Lai TL. Bootstrap methods for truncated and censored data. Statistica Sinica 1996; 6:509–530.33. Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Methods Based on Counting Processes. Springer:

New York, 1993.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)

NON-PARAMETRIC ASSESSMENT OF NON-INFERIORITY WITH CENSORED DATA

34. R�ohmel J, Mansmann U. Unconditional non-asymptotic one-sided tests for independent binomial proportionswhen the interest lies in showing non-inferiority and=or superiority. Biometrical Journal 1999; 41:149–170.

35. Munk A, Skipka G, Stratmann B. Testing general hypotheses under binomial sampling: the two sample case—asymptotic theory and exact procedures. Computational Statistics and Data Analysis 2005; 49:723–739.

36. Munk A, Czado C. Nonparametric validation of similar distributions and assessment of goodness of �t. JRSSB 1998; 60(1):223–241.

37. Gill RD. Non- and semi-parametric maximum likelihood estimators and the von Mises method—I. ScandinavianJournal of Statistics 1989; 16:97–128.

38. Akritas MG. Bootstrapping the Kaplan–Meier estimator. Journal of the American Statistical Association 1986;81:1032–1038.

39. Raghavachari M. Limiting distributions of Kolmogorov–Smirnov-type statistics under the alternative. Annals ofStatistics 1973; 1:67–73.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. (in press)