Power analysis in randomized clinical trials based on item response theory

21
Controlled Clinical Trials 24 (2003) 390–410 Power analysis in randomized clinical trials based on item response theory Rebecca Holman, M.Math. a, * , Cees A.W. Glas, Ph.D. b , Rob J. de Haan, Ph.D. a a Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, Amsterdam, The Netherlands b Department of Research Methodology, Measurement and Data Analysis, University of Twente, Enschede, The Netherlands Manuscript received May 20, 2002; manuscript accepted March 17, 2003 Abstract Patient relevant outcomes, measured using questionnaires, are becoming increasingly popular endpoints in randomized clinical trials (RCTs). Recently, interest in the use of item response theory (IRT) to analyze the responses to such questionnaires has increased. In this paper, we used a simula- tion study to examine the small sample behavior of a test statistic designed to examine the difference in average latent trait level between two groups when the two-parameter logistic IRT model for binary data is used. The simulation study was extended to examine the relationship between the number of patients required in each arm of an RCT, the number of items used to assess them, and the power to detect minimal, moderate, and substantial treatment effects. The results show that the number of patients required in each arm of an RCT varies with the number of items used to assess the patients. However, as long as at least 20 items are used, the number of items barely affects the number of patients required in each arm of an RCT to detect effect sizes of 0.5 and 0.8 with a power of 80%. In addition, the number of items used has more effect on the number of patients required to detect an effect size of 0.2 with a power of 80%. For instance, if only five randomly selected items are used, it is necessary to include 950 patients in each arm, but if 50 items are used, only 450 are required in each arm. These results indicate that if an RCT is to be designed to detect small effects, it is inadvisable to use very short instruments analyzed using IRT. Finally, the SF-36, SF-12, and SF-8 instruments were considered in the same framework. Since these instruments consist of items scored in more than two categories, slightly different results were obtained. 2003 Elsevier Inc. All rights reserved. Keywords: Item response theory; Sample size; Power; Latent trait; IRT; Two-parameter model; SF-36; SF-12; SF-8 * Corresponding author: Rebecca Holman, M.Math, Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, PO Box 22700, 1100 DE Amsterdam, The Netherlands. Tel.: 31-20-566-6947; fax: 31-20-691-2683. E-mail address: [email protected] 0197-2456/03/$—see front matter 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0197-2456(03)00061-8

Transcript of Power analysis in randomized clinical trials based on item response theory

Controlled Clinical Trials 24 (2003) 390–410

Power analysis in randomized clinical trials based on itemresponse theory

Rebecca Holman, M.Math.a,*, Cees A.W. Glas, Ph.D.b,Rob J. de Haan, Ph.D.a

aDepartment of Clinical Epidemiology and Biostatistics, Academic Medical Center,Amsterdam, The Netherlands

bDepartment of Research Methodology, Measurement and Data Analysis, University of Twente,Enschede, The Netherlands

Manuscript received May 20, 2002; manuscript accepted March 17, 2003

Abstract

Patient relevant outcomes, measured using questionnaires, are becoming increasingly popularendpoints in randomized clinical trials (RCTs). Recently, interest in the use of item response theory(IRT) to analyze the responses to such questionnaires has increased. In this paper, we used a simula-tion study to examine the small sample behavior of a test statistic designed to examine the differencein average latent trait level between two groups when the two-parameter logistic IRT model for binarydata is used. The simulation study was extended to examine the relationship between the number ofpatients required in each arm of an RCT, the number of items used to assess them, and the power todetect minimal, moderate, and substantial treatment effects. The results show that the number ofpatients required in each arm of an RCT varies with the number of items used to assess the patients.However, as long as at least 20 items are used, the number of items barely affects the number ofpatients required in each arm of an RCT to detect effect sizes of 0.5 and 0.8 with a power of 80%.In addition, the number of items used has more effect on the number of patients required to detectan effect size of 0.2 with a power of 80%. For instance, if only five randomly selected items are used,it is necessary to include 950 patients in each arm, but if 50 items are used, only 450 are required ineach arm. These results indicate that if an RCT is to be designed to detect small effects, it is inadvisableto use very short instruments analyzed using IRT. Finally, the SF-36, SF-12, and SF-8 instrumentswere considered in the same framework. Since these instruments consist of items scored in more thantwo categories, slightly different results were obtained. � 2003 Elsevier Inc. All rights reserved.

Keywords: Item response theory; Sample size; Power; Latent trait; IRT; Two-parameter model; SF-36; SF-12;SF-8

* Corresponding author: Rebecca Holman, M.Math, Department of Clinical Epidemiology and Biostatistics,Academic Medical Center, PO Box 22700, 1100 DE Amsterdam, The Netherlands. Tel.: �31-20-566-6947;fax: �31-20-691-2683.

E-mail address: [email protected]

0197-2456/03/$—see front matter � 2003 Elsevier Inc. All rights reserved.doi:10.1016/S0197-2456(03)00061-8

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 391

Introduction

In recent years, there has been an enormous increase in the use of patient relevant outcomes,such as functional status and quality of life, as endpoints in medical research, includingcontrolled randomized clinical trials (RCTs). Many patient relevant outcomes are measuredusing questionnaires designed to quantify a theoretical construct, often modeled as a latentvariable. When a questionnaire is administered to a patient, responses to individual itemsare recorded. Often the scores on each item are added together to obtain a single score foreach patient. The reliability and validity of sum scores are usually examined in the frameworkof classical test theory. This framework is widely accepted and applied in many areas ofmedical assessment [1]. However, following dissatisfaction with these methods, interest inthe use of an alternative paradigm, known as item response theory (IRT), has grown [2].IRT was developed as an alternative to the use of classic test theory when analyzing dataresulting from school examinations. An overview of IRT methods is given in this paper,while in-depth descriptions in general [3,4] and in medical applications [5] are givenelsewhere.

Advantages of using IRT to analyze an RCT include proper modeling of ceiling and flooreffects, solutions to the problem of missing data, and straightforward ways of dealing withheteroscedacity between groups. However, the main advantage is that it is not essential toassess all patients with exactly the same items. For instance, if sufficient information on themeasurement characteristics of the Medical Outcomes Study Short-Form Health Survey with36 items (SF-36) [6] health survey in a particular patient population were available, it wouldbe possible to obtain completely comparable estimates of health status using only the itemsmost appropriate to each individual patient. An extension of this is computerized adaptivetesting, in which each patient potentially receives a different computer-administered question-naire in which the questions offered to each patient depend on the responses given to previousquestions [7,8]. A prerequisite of this type of testing is access to a large item bank thathas been calibrated using responses from comparable patients [9,10].

Ethical considerations require that as few patients as possible are exposed to the “risk”of a novel treatment during an RCT, but it is important to ensure that enough patients areincluded to have a reasonable power of detecting the effect of interest. For this reason,calculation of the minimal sample size required to demonstrate a clinically relevant effecthas become integral to the RCT literature [11]. However, since IRT was developed as a toolfor analyzing data resulting from examinations, most technical work has concentrated onthe statistical challenges found in this field. For example, when assessing the effects ofan educational intervention, in some ways similar to an RCT, thousands of pupils, evenwhole cohorts, are often included. This means that minimal sample size and power calculationsin relation to questionnaires analyzed with IRT have received very little attention. In addition,IRT offers a framework in which the number of items used to assess patients can be easilyvaried. Thus, sample size calculations need to consider not only the number of patients, butalso the number of items used. Some work has touched on these issues [12] or consideredit as a sideline of another aspect [13], but no guidance on sample size calculations for RCTsin the context of IRT has been published.

In this paper, the relationship between the number of patients in each treatment arm, thenumber of items used to assess the patients, and the power to detect given effect sizes will

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410392

be examined using a simulation study. The results will be used to develop guidelines forthe number of patients to be used in each arm of an RCT when a questionnaire analyzedusing IRT is used as the primary outcome. The methods are illustrated in a population ofpatients with end-stage renal disease 12 months after starting dialysis [14,15] using the SF-36 [6], the SF-12 [16], and the SF-8 [17] health-status surveys. In addition, to provide aframework in which data from RCTs can be analyzed using IRT, the behavior of asymptoticmethods developed to compare the mean level of the latent trait in two groups will beconsidered in the relatively small samples encountered in RCTs.

IRT in an RCT

IRT is used to model the probability that a patient will respond to a number of itemsrelated to a latent trait in a certain way. The two-parameter logistic model [18] is fordata resulting from items with two response categories, “0” and “1,” and is one of manyIRT models developed [19]. In this model, the probability, pik(q), that patient k, with latenttrait equal to qk, will respond to item i in category 1 is given by

pik(q) �exp (ai(qk�bi))

1�exp (ai(qk�bi))(1)

where αi and bi are known as item parameters. The more widely known Rasch model [20]is similar to the two-parameter logistic model, but with the parameters αi assumed equal to1. An important assumption of IRT models is that of local independence, meaning that theprobability of a patient scoring 1 on a given item is independent of them scoring 1 on anotheritem, given their value of q. This means that the correlations between items and over patientsare fully explained by q. Models have also been developed for data resulting from items withmore than two response categories [4,21]. The generalized partial credit model [22] is afairly well-known example in which the probability pijk(q) that a patient with latent traitequal to qk will respond to an item i, with (Ji � 1) response categories, in category j, j � 0,is given by

pijk(q) �

exp (�j

u�0ai(qk�biu))

1��Ji

j�0exp (�j

u�0ai(qk�biu))

(2)

where αi is the discrimination parameter of item i and biu indicates the point at which theprobability of choosing category j or category j�1 is equal.

In IRT, the item and patient parameters are usually estimated in a two-stage procedure.First, the item parameters are estimated, often by assuming that the qk follow a normaldistribution and integrating them out of the likelihood. Second, maximum likelihood estimatesof qk are obtained using the previously estimated item parameters. In this study, it will be

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 393

assumed that the items under consideration form part of a calibrated item bank [9], meaningthat the item parameters have been previously estimated [23] from responses given bycomparable patients to the items and are assumed known for all items. It is theoretically possibleto estimate the item parameters from the responses given to the items by patients includedin an RCT, but accurate estimates can only be obtained from large samples of patients, sayover 500. Since this figure is rarely attained in RCTs, it will often not be practical to estimatethe item parameters in an RCT.

In a straightforward RCT, the patient sample is randomly divided into two groups, sayA andB. Each group receivesa different treatment regimenand primaryandsecondary outcomesare assessed once at the end of the study. The main interest is in the null hypothesis, H0,whether the mean level of the primary outcome, say q, is equal in both groups. This can bewritten as

H0 : mA � mB � 0 (3)

where mA and mB denote the mean of the distribution of q in groups A and B, respectively.Clinically, the two groups are said to differ if the ratio of the difference mA � mB and thestandard deviation of q is larger than a given effect size. A lot of work has been carried outexamining the clinical relevance of particular effect sizes in given situations. However, inpractice, interest is often in examining the arbitrarily defined minimal, moderate, and sub-stantial effect sizes of 0.2, 0.5, and 0.8 on continuous variables [11]. The number of patientsrequired to detect a given effect size with a particular power depends on the values of theeffect size and the standard errors of mA and mB.

Now consider an RCT in which the primary outcome is q, measured at the end of thestudy using a questionnaire with n items, each with two response categories, and analyzedusing IRT. Let us assume that patients 1, 2,…, K are in group A and patients K � 1, K � 2,…,2K are in group B, meaning that the total sample size is 2K. For group A, we can rewriteEq. (1) as

pik(q) �exp (ai(mA�ek�bi))

1�exp (ai(mA�ek�bi))(4)

where qk � mA � ek and mA is the mean of qk in group A. Hence, ek have mean 0 and standarddeviation sA. For group B, Eq. (1) can be rewritten in a similar way, with qk � mB � ekand the standard deviation of ek equal to sB. For RCTs carried out using a questionnaireconsisting of items with more than two response categories, Eq. (2) can be rewritten in asimilar way. The main interest is in examining the null hypothesis in Eq. (3) by obtainingmA and mB and testing whether they are significantly different from each other. When using IRT-based techniques, it is inadvisable to estimate the values of qk for all patients and thenperform standard analysis, such as t tests, on these estimates [13], since this ignores themeasurement error inherent to qk. This means that using standard methods for calculatingsample size may lead to inaccurate conclusions. The estimation equations for mA and mB arecomplex and the values of S.E. (mA) and S.E. (mB) depend not only on the number of patientsin each arm of the RCT, but also on the number of items used to assess the patients andthe relationship between ai and bi and the distribution of q. This means that it is not

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410394

possible to write down a straightforward equation for either mA and mB or the number of patientsrequired to detect a given sample size with a particular power. Consistent estimates ofmA and mB are obtained by maximizing a likelihood function that has been marginalized withrespect to q [24,25]. These estimation methods are described in more detail in the appendix.

The marginal maximum likelihood estimates of mA and mB can be combined with theirstandard errors to obtain test statistic

ZL �mA � mB

S.E.(mA � mB)(5)

where mA and mB denote the estimates described above. It has been proven that under thehypothesis mA � mB � 0 and for large sample sizes, say 2K � 500, ZL follows an asymptoticstandard Normal distribution [24]. However, since these methods were developed in the fieldof educational measurement where interventions are assessed using very large samples, thebehavior of ZL for the smaller samples common in RCTs is as yet unknown.

A simulation study

In this study, we simulated data from RCTs to examine the behavior of ZL and the powerto detect a number of effect sizes with a given number of patients and items. We wereparticularly interested in RCTs where there were 30, 40, 50, 100, 200, 300, 500, or 1000,denoted K, patients in each arm and where these patients were assessed using 5, 10, 15,20, 30, 50, 70, or 100, denoted n, items each with two response categories. These valueswere chosen to reflect the range of sample sizes often encountered in clinical research andthe number of items with which it is acceptable to assess patients in a variety of situations.In total, there were 64 (�8 × 8) different combinations of sample size and number of items inthe study.

The study was carried out by simulating 1000 RCTs at each of the 64 different combinations.In each RCT, a group of K values of q were sampled from each of N(0,1) and N(mB,1)distributions to represent the latent trait levels of patients in groups A and B, respectively.Since the standard deviation of q is equal to one in both groups, the effect size in an RCTis equal to mA�mB. In addition, for each RCT n values of b and of a were generated fromN(0,1) and log N(0.2,1) distributions, respectively, to represent the items. These values werechosen as they give a reasonably high level of statistical information on the values of q usedin this study and thus represent a carefully chosen questionnaire in terms of item characteris-tics. Each RCT was “conducted” by calculating the probability, pik, that patient k wouldrespond to item i in category 1, given their value of the latent trait, qk, and the item parametersai and bi using the formula in Eq.(1). The response, xik, made by patient k on item iwas obtained by taking an observation on a Bi(1,pik) distribution. This was repeated for all2K patients in an RCT and resulted in a data matrix with 2K rows and n columns. Theresponses, xik, were used, together with the two-parameter logistic model, the “known” valuesof a and b, and marginal maximum likelihood estimation methods to obtain estimates ofmA, mB, and the associated standard errors. These estimates were combined to obtain a value

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 395

of ZL for each RCT. The simulations were carried out in a program adapted by the authors fromOPLM, a commercially available program for estimating parameters in IRT models [26].

The distribution of ZL under the null hypothesis

In order to provide a framework for examining the small sample behavior of ZL underthe null hypothesis H0 : mA�mB � 0, the values of mA and mB were set equal to 0 and 1000RCTs carried out at each of the 64 combinations of K and n. The distribution of the valuesof ZL obtained for each combination of K and n was examined by calculating the mean,ZL, and standard deviation, S.D.(ZL), of the values of ZL obtained from the 1000 RCTsconducted with each combination of K and n. In addition, the values of ZL were tested tosee whether there was evidence that they did not form a sample from a normal distribution,with mean ZL and standard deviation, S.D.(ZL), using the Kolmogornov-Smirnov statistic.

Sample size, number of items, and power

The main objective of the simulation study was to examine the power of an RCT usinga questionnaire with n items as a primary endpoint and K patients in each arm of the RCTto detect minimal (0.2), moderate (0.5), or substantial (0.8) effect sizes. Hence, the wholesimulation process with 64 combinations of k and n was repeated three times. The value ofmA was set to 0 and sA � sB � 1. Hence, the values of mB were set at 0.2, 0.5, and 0.8 forthe first, second, and third repetitions, respectively.

The 1000 values of ZL obtained for each of the 192 combinations of sample size, numberof items, and effect size were compared to the critical values of the appropriate normaldistribution to determine how many of the test statistics were significant at the (two-sided)90%, 95%, and 99% levels. For RCTs with more than 100 patients in each arm, a standardnormal distribution was used, meaning that the critical values were �1.64, �1.96, and �2.58for the 90%, 95%, and 99% levels, respectively. The critical values for RCTs with up to100 patients in each arm were obtained in the first part of this simulation study and aregiven in the following section.

Results of the simulation study

The distribution of ZL under the null hypothesis

The mean, ZL, S.D.(ZL), and the p-value of the Kolmogornov-Smirnov test for Normality,P(KS), of the 1000 values of ZL produced at each combination of n and K for mA � mB � 0are given in Table 1. The mean value of ZL over all 64,000 replications is 0.0004 and onlytwo of the 64 values of P(KS) are less than 0.10. This indicates that there is no reason tosuspect that ZL does not attain its asymptotic normal distribution, with mean 0, for samplesconsisting of two groups, each of 30 to 1000 patients.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410396

Ta

ble

1.T

hem

ean,

stan

dard

devi

atio

n,an

dK

olm

ogor

nov-

Smir

nov

p-va

lue,

P(K

S),

for

ZL

whe

nm A

�µ B

�0

Num

ber

ofpa

tient

sin

each

ofgr

oup

Aan

dgr

oup

B(K

)

3040

5010

0N

umbe

rof

item

s(n

)Z L

S.D

.(Z

L)

P(K

S)Z

LS.

D.(

ZL)

P(K

S)Z

LS.

D.(

ZL)

P(K

S)Z

LS.

D.(

ZL)

P(K

S)

50.

0963

1.03

300.

027

0.00

061.

0432

0.46

50.

0269

1.08

860.

570

0.01

581.

0267

0.26

310

�0.

0031

1.30

920.

012

0.02

521.

1260

0.23

00.

0152

1.04

050.

639

0.00

241.

0015

0.82

615

�0.

0058

1.21

440.

318

0.02

021.

1482

0.32

7�

0.00

501.

1383

0.80

7�

0.02

281.

0515

0.11

920

0.13

661.

2138

0.42

90.

0246

1.20

590.

182

�0.

0230

1.17

290.

407

0.05

571.

0325

0.82

630

0.07

641.

1905

0.87

2�

0.00

631.

0984

0.29

6�

0.01

151.

1075

0.32

0�

0.02

551.

0175

0.85

350

0.07

941.

2374

0.28

90.

0037

1.16

480.

339

�0.

0222

1.13

560.

571

0.03

361.

0484

0.33

670

�0.

1169

1.22

620.

292

�0.

0576

1.17

280.

294

�0.

0252

1.06

120.

194

0.01

780.

9801

0.77

610

00.

0151

1.21

460.

374

�0.

0005

1.14

250.

879

�0.

0070

1.12

500.

728

0.00

971.

0267

0.61

5

Tab

le(c

onti

nued

)

Tabl

e1.

Con

tinue

d

Num

ber

ofpa

tient

sin

each

ofgr

oup

Aan

dgr

oup

B(K

)

200

300

500

1000

Num

ber

ofite

ms

(n)

Z LS.

D.(

ZL)

P(K

S)Z

LS.

D.(

ZL)

P(K

S)Z

LS.

D.(

ZL)

P(K

S)Z

LS.

D.(

ZL)

P(K

S)

5�

0.01

971.

0406

0.97

0�

0.00

680.

9882

0.32

20.

0086

1.01

810.

793

�0.

0420

1.01

830.

959

100.

0342

1.04

280.

860

�0.

0634

0.97

780.

610

�0.

0287

1.05

160.

994

�0.

0713

0.97

750.

302

15�

0.05

611.

0167

0.86

0�

0.01

421.

0064

0.65

2�

0.02

230.

9574

0.78

20.

0069

1.00

590.

219

20�

0.00

510.

9980

0.47

2�

0.03

190.

9914

0.51

5�

0.01

460.

9898

0.44

60.

0896

0.96

790.

942

300.

0112

1.02

630.

567

0.01

561.

0141

0.96

60.

0114

0.96

120.

782

�0.

0235

0.94

650.

317

50�

0.01

141.

0102

0.98

00.

0144

0.96

610.

471

�0.

0201

0.98

170.

362

�0.

0330

1.02

850.

894

700.

0142

0.99

720.

687

�0.

0364

0.96

800.

289

�0.

0227

0.99

500.

153

–0.0

286

1.02

090.

279

100

0.01

260.

9889

0.54

10.

0382

1.03

770.

561

�0.

0114

0.98

980.

697

0.00

940.

9488

0.66

1

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 397

Fig. 1. Standard deviation of ZL against the number of patients in each arm of a randomized clinical trial.

The variation of the standard deviation of the 1000 values of ZL produced at each combina-tion of conditions is illustrated with respect to K and n in Figs. 1 and 2, respectively. It doesnot appear that n has a substantial effect on the standard deviation of ZL. However, S.D.(ZL)increases as the sample size decreases. We modeled the relationship between S.D.(ZL) andK, under the null hypothesis H0 : mA � mB, using

S.D.(ZL) � 0.97�6.671

K. (6)

Eq. (6) was obtained using regression analysis, since the variance of statistic is often relatedto the reciprocal of the number of patients used to calculate the statistic. The usual assumptionswere tested and the data did not violate them. The asymptote was not forced down to 1,since Eq. (6) gave the best fit to data obtained from 30, 40, 50, and 100 patients, and primaryinterest was in studies with these numbers of patients, rather than obtaining an accuraterepresentation of smaller deviations from a standard Normal distribution. The correlationbetween S.D.(ZL) and (1�K) was 0.8843, meaning that R2 � 0.7820. The modeled distributionfor ZL along with critical values and type I error rates based on the 8000 “trials” with aneffect size of 0.0 carried out using the eight test lengths for the 90%, 95%, and 99% levelsfor RCTs using 30, 40, 50, and 100 patients per arm are given in Table 2.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410398

Fig. 2. Standard deviation of ZL against the number of items used to assess the patients.

Sample size, number, of items, and power

Table 3 contains the number of the 1000 RCTs carried out under each of the 192 combina-tions of K, n, and effect size, which resulted in a value of the ZL statistic beyond theappropriate critical values given in Table 2. In order to facilitate comparisons with the situationwhere q were known, similar to obtaining an estimate using an infinite number of items,the theoretical number of 1000 RCTs that would result in a value of the T statistic beyondthe appropriate critical values have been calculated using standard procedures [27]. Theseare presented in rows labeled ∞. If the numbers in Table 3 are divided by 1000, then an

Table 2. Modeled distribution for ZL and critical values for randomized clinical trials with small sample sizes

Number of patients ModeledCritical values Type I error rate

per arm (K) distribution of ZL 90% 95% 99% 90% 95% 99%

30 N(0,1.192) �1.96 �2.33 �3.07 0.1016 0.0565 0.013940 N(0,1.142) �1.88 �2.23 �2.94 0.1005 0.0545 0.013950 N(0,1.102) �1.81 �2.16 �2.83 0.1025 0.0558 0.0158

100 N(0,1.042) �1.71 �2.04 �2.68 0.0933 0.0470 0.0086�200 N(0,1) �1.64 �1.96 �2.58

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 399

Tabl

e3.

The

num

ber

of10

00ra

ndom

ized

clin

ical

tria

lsin

whi

chZ

Lw

asgr

eate

rth

anth

eap

prop

riat

ecr

itica

lva

lue

for

the

two-

side

dsi

gnifi

canc

ele

vel

give

n

Num

ber

ofN

umbe

rof

patie

nts

inea

chof

grou

pA

and

grou

pB

(K)

Eff

ect

size

item

s(n

)30

4050

100

Sign

ifica

nce

leve

l0.

100.

050.

010.

100.

050.

010.

100.

050.

010.

100.

050.

01

0.2

512

664

2413

676

1814

977

2621

613

942

0.2

1014

574

1513

972

3118

810

828

276

174

630.

215

141

8724

133

8130

208

117

2730

522

374

0.2

2015

591

2816

399

2918

012

637

339

215

660.

230

158

8624

186

109

3619

013

033

344

236

870.

250

170

109

2515

885

2821

012

641

336

227

870.

270

157

7833

181

117

4421

512

036

345

247

980.

210

013

996

3716

510

033

211

138

4836

525

810

00.

2∞

190

110

3022

014

040

260

160

5040

029

012

00.

55

174

104

3034

424

790

433

314

126

724

610

361

0.5

1026

718

169

438

297

132

575

445

208

837

742

542

0.5

1528

017

091

492

364

145

579

463

225

876

800

611

0.5

2030

318

570

467

349

140

606

488

267

905

854

702

0.5

3029

419

388

550

416

180

650

510

283

927

862

704

0.5

5029

120

889

559

426

208

622

492

274

939

885

706

0.5

7030

520

867

549

428

213

663

536

287

947

900

735

0.5

100

317

208

8855

741

519

767

455

029

693

588

972

70.

5∞

600

470

240

710

590

340

790

690

450

960

940

820

0.8

538

926

298

655

538

297

805

689

422

973

949

815

0.8

1049

735

416

377

466

942

590

085

263

999

599

195

60.

815

489

384

179

846

747

482

911

817

649

1000

997

984

0.8

2047

537

119

487

277

449

793

189

269

099

799

498

50.

830

538

416

205

856

789

581

950

897

750

999

999

993

0.8

5057

346

121

787

580

457

493

390

377

110

0010

0099

20.

870

535

428

229

862

773

549

952

919

799

998

998

994

0.8

100

571

427

214

891

822

618

940

899

775

1000

1000

995

0.8

∞92

086

066

097

094

082

010

0097

091

010

0010

0010

00

Tab

le(c

onti

nued

)

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410400

Ta

ble

3.C

ontin

ued

Num

ber

ofN

umbe

rof

patie

nts

inea

chof

grou

pA

and

grou

pB

(K)

Eff

ect

size

item

s(n

)20

030

050

010

00

Sign

ifica

nce

leve

l0.

100.

050.

010.

100.

050.

010.

100.

050.

010.

100.

050.

01

0.2

537

226

811

251

537

516

464

952

332

890

082

261

00.

210

492

350

170

625

508

266

810

712

474

970

949

825

0.2

1553

639

518

967

555

631

784

976

353

498

296

389

60.

220

539

424

209

688

565

334

880

802

564

992

980

923

0.2

3056

141

421

173

661

939

490

182

762

599

599

092

90.

250

598

462

267

773

657

409

904

853

673

994

989

943

0.2

7058

547

324

675

065

940

192

687

171

899

899

596

20.

210

061

248

926

276

565

941

492

087

969

799

499

196

60.

2∞

630

510

270

780

680

440

930

880

710

1000

1000

970

0.5

596

092

480

298

597

693

1—

——

——

—0.

510

994

985

923

1000

998

986

——

——

——

0.5

1599

899

295

710

0010

0099

7—

——

——

—0.

520

995

990

965

1000

1000

999

——

——

——

0.5

3099

799

196

710

0010

0010

00—

——

——

—0.

550

995

994

982

1000

1000

1000

——

——

——

0.5

7010

0099

799

110

0010

0010

00—

——

——

—0.

510

099

699

598

610

0010

0010

00—

——

——

—0.

5∞

1000

1000

1000

1000

1000

1000

——

——

——

0.8

510

0099

999

5—

——

——

——

——

0.8

1010

0010

0010

00—

——

——

——

——

0.8

1510

0010

0010

00—

——

——

——

——

0.8

2010

0010

0010

00—

——

——

——

——

0.8

3010

0010

0010

00—

——

——

——

——

0.8

5010

0010

0010

00—

——

——

——

——

0.8

7010

0010

0010

00—

——

——

——

——

0.8

100

1000

1000

1000

——

——

——

——

—0.

8∞

1000

1000

1000

——

——

——

——

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 401

estimate of the power under the given combination of factors is obtained. It can be seenthat the power to detect the given effect size increases with the effect size and the numberof items used to assess the patients. It is also apparent that while increasing the number ofitems used to assess the patients from five to ten and from ten to twenty results insubstantial increases in the power, increasing the number of items beyond 30 increases thisfigure only minimally. In addition, the power obtained using 100 items does not even approachthe theoretical maximum, using an infinite number of items, for K � 40, indicating that thecorrection introduced in Table 2 may not be sufficient for very small RCTs. Using the resultsin Table 3 and linear interpolation, the values of K required to detect the effect size at thetwo-sided 5% level and with a power of 80% using a given n was calculated and are displayedin Table 4. This significance level and power were chosen as these values are regularlyused when analyzing RCTs. It can be seen that as long as at least 20 items are used, thenumber of items barely affects the number of patients required in an RCT to detect effectsizes of 0.5 and 0.8. However, the number of items used to assess patients has more effect onthe number of patients required to detect an effect size of 0.2. For instance, if only fiverandomly selected items are used, it is necessary to include 950 patients in each arm, butif 50 items are used, only 450 are required in each arm.

An illustration using the short form instruments from the MedicalOutcomes Study

In order to illustrate the methods described in this paper, data were used from the secondphase of the Netherlands Co-operative Study on Dialysis (NECOSAD). NECOSAD is amulticenter prospective cohort study that includes all new end-stage renal disease patientswho started chronic hemodialysis or peritoneal dialysis. Treatment is not randomized butchosen after discussion with the patient and taking clinical, psychological, and social factorsinto account. The SF-36 was developed from the original Medical Outcomes Study survey[28] to measure health status in a variety of research situations [6]. The SF-36 is a reliablemeasure of health status in a wide range of patient groups [29] and has been used withclassical [30] and IRT-based [31] psychometric models. In addition, the SF-12, with 12 items[16], and a preliminary version of the SF-8, with 9 items [17], have been developed. Inthis article, the Dutch language version of the SF-36 [32] was used. The SF-36 usesbetween two and six response categories per item and, on the majority of items, a higher

Table 4. Approximate number of patients required in each arm (K) of a randomized clinical trialto demonstrate a given effect size at the 5% level with 80% power

Number of items used

Effect size 5 10 20 50 100 ∞

0.2 950 680 500 450 440 3940.5 160 125 95 90 87 640.8 70 50 45 40 39 26

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410402

score denotes a better health state. The items that are scored so that a higher score denotesa worse health state have been rescored so that a higher score denotes a better health state.A brief description of each item is given in Table 5, together with an indication of whetheran item is used in the SF-12 or SF-8, while full details are available elsewhere [1]. Themeasurement properties of the SF-36 in a sample of patients with chronic kidney failure 12months after the start of dialysis [14,15] have been examined using the generalized partialcredit model described in Eq. (2) and are summarized in Table 5. Questionnaires were sentto 1046 patients, of whom 978 completed at least one question on the SF-36. Of the 978patients, 583 were male and 395 were female, while 615 were on hemodialysis and 363were on peritoneal dialysis. The patients were between 18 and 90 years of age, with a medianof 61 years. When IRT techniques were used to examine the quality of life, the mean, m,was 0 and the standard deviation was s � 0.692. It should be emphasized that the modelparameters given in this article are for illustration purposes only and should not be regardedas a definitive calibration of these items as the fit of the model has not been extensivelytested, particularly with respect to differential item characteristics between subgroups of pa-tients.

In order to examine the power with which studies using the short form instruments coulddetect treatment effects in a population of patients with chronic kidney failure 12 monthsafter starting dialysis, a simulation study was carried out. The simulations were carried outin a similar way to those described earlier in the section “A simulation study.” We wereinterested in detecting effect sizes of 0.2, 0.5, and 0.8, denoted x, in RCTs with 30, 40, 50,100, 200, 300, 500, or 1000 patients in each arm (K) using the SF-36, the SF-12, or the SF-8 as the primary endpoint, meaning that there were 24 different combinations of x and K.In each run, K values of q were sampled from each of N(m, s2) and N(m � xs, s2), wherex represents the effect size expected in the study and s the standard deviation of quality oflife, to denote the control and treatment groups, respectively, meaning that mA � 0 andmB � 0.692x. These values were combined with the item parameters in Table 5 and Eq. (2)to generate “responses” by “patients” to the SF-36. These data were used to obtain thestatistic ZL using the items in the SF-36, in the SF-12, and in the SF-8. One thousand runs,or RCTs, were carried out at each of the 24 combinations of K and x. It was assumed thatthe item parameters were known and equal to those in Table 5.

The number of runs at each combination of K, x, and short form instrument resulting ina value of ZL more extreme than the appropriate critical value, in Table 2, for known itemparameters is recorded in Table 6. In general, the SF-36 detected a given treatment effectmore often than the SF-12, which in turn detected a treatment effect more often than theSF-8. The number of patients needed in each arm of an RCT using the SF-36, SF-12,and SF-8 to detect standard treatment effects in the underlying latent trait with a power of80% were obtained using linear interpolation and are given in Table 7. It appears that thenumber of patients required when using the SF-36 to detect an effect of 0.2 is less than ifq were known, as would be the case if an infinite number of items were used. This is nottrue but is a result of the inherent error involved in a simulation study with only 1000 trialsat each combination of K and x. It can also be seen that the differences between the numbers ofpatients required in each arm are less marked than in Table 4. This is because the items onthe SF-36 have up to six response categories leading to a total of 149, 47, and 40 different

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 403

Ta

ble

5.T

hech

arac

teri

stic

sof

the

item

sin

the

SF-3

6qu

estio

nnai

re

Num

ber

ofIt

empa

ram

eter

sIt

emre

spon

seSc

orin

gnu

mbe

rIt

emco

nten

tca

tego

ries

reve

rsed

αβ 1

β 2β 3

β 4β 5

1G

ener

alhe

alth

stat

usSF

-12

SF-8

5ye

s2.

641

�3.

890

�3.

810

0.08

64.

825

2H

ealth

com

pare

dSF

-85

yes

0.62

4�

1.75

9�

2.85

7�

2.00

6�

0.93

1w

itha

year

ago

Doe

syo

urhe

alth

limit

3avi

goro

usac

tiviti

es3

no1.

729

1.30

04.

496

3bm

oder

ate

activ

ities

SF-1

23

no2.

767

�1.

031

1.01

93c

liftin

gor

carr

ying

3no

2.24

7�

1.37

0�

0.02

5gr

ocer

ies

3dcl

imbi

ng2�

fligh

tsSF

-12

SF-8

3no

2.35

3�

0.89

00.

612

ofst

airs

3ecl

imbi

ngon

efli

ght

3no

2.27

6�

1.99

6�

2.13

2of

stai

rs3f

bend

ing,

knee

ling,

3no

1.92

4�

1.26

8�

0.38

4or

stoo

ping

3gw

alki

ngm

ore

than

3no

2.15

5�

0.20

40.

809

am

ile3h

wal

king

seve

ral

bloc

ks3

no2.

245

�1.

206

�1.

302

3iw

alki

ngon

ebl

ock

3no

2.08

0�

1.86

5�

2.52

43j

bath

ing

ordr

essi

ng3

no2.

412

�3.

777

�5.

497

your

self

Prob

lem

sw

ithw

ork

asa

resu

ltof

phys

ical

heal

th4a

redu

ced

amou

ntof

2no

2.77

00.

441

time

wor

king

4bac

com

plis

hed

less

SF-1

22

no3.

094

1.07

7th

anho

ped

4clim

ited

inki

ndof

wor

kSF

-12

2no

3.49

51.

290

4ddi

fficu

ltyin

wor

king

SF-8

2no

2.99

81.

388

Tab

le(c

onti

nued

)

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410404

Tabl

e5.

Con

tinue

d

Num

ber

ofIt

empa

ram

eter

sIt

emre

spon

seSc

orin

gnu

mbe

rIt

emco

nten

tca

tego

ries

reve

rsed

αβ 1

β 2β 3

β 4β 5

Prob

lem

sw

ithw

ork

asa

resu

ltof

emot

iona

lhe

alth

5are

duce

dam

ount

of2

no2.

587

�0.

435

time

wor

king

5bac

com

plis

hed

less

SF-1

2SF

-82

no2.

730

0.04

6th

anho

ped

5cw

orke

dle

ssca

refu

llySF

-12

2no

2.30

2�

0.68

16

Red

uced

soci

alSF

-12

SF-8

5ye

s1.

834

�2.

616

�4.

434

�5.

872

�5.

587

activ

ities

7A

mou

ntof

bodi

lypa

inSF

-86

yes

0.83

7�

1.31

8�

3.19

6�

3.19

3�

3.13

1�

3.24

18

Pain

inte

rfer

eSF

-12

5ye

s1.

664

�2.

286

�4.

055

�4.

927

�4.

638

with

wor

k

Em

otio

nsfe

lt9a

full

ofpe

p6

yes

0.63

2�

1.47

5�

2.63

8�

1.70

2�

2.59

8�

1.24

29b

very

nerv

ous

6no

0.62

3�

1.05

6�

2.25

7�

3.75

5�

3.78

0�

3.53

09c

dow

nin

the

dum

ps6

no1.

219

�2.

452

�4.

755

�6.

607

�7.

134

�7.

291

9dca

lman

dpe

acef

ulSF

-12

6ye

s0.

985

�1.

928

�3.

267

�2.

922

�3.

583

�2.

028

9elo

tsof

ener

gySF

-12

SF-8

6ye

s1.

347

�1.

740

�2.

167

�1.

103

�0.

581

1.88

49f

dow

nhea

rted

and

blue

SF-1

2SF

-86

no1.

090

�1.

633

�3.

583

�5.

401

�5.

359

�4.

948

9gw

orn

out

6no

1.45

2�

2.01

0�

3.83

4�

4.63

7�

3.52

3�

1.92

59h

aha

ppy

pers

on6

yes

0.59

1�

1.39

5�

2.82

3�

2.47

2�

3.43

5�

2.61

29i

tired

6no

1.50

6�

1.22

9�

2.10

7�

2.23

80.

016

2.45

8

10R

educ

edso

cial

activ

ities

5no

1.70

3�

2.33

3�

3.95

5�

3.62

2�

2.90

8

Rat

ere

leva

nce

ofst

atem

ents

11a

sick

easi

erth

an5

no0.

670

�0.

772

�1.

279

�0.

663

�0.

345

othe

rpe

ople

11b

asha

ppy

asan

ybod

y5

yes

0.75

8�

0.55

8�

0.01

20.

115

1.95

0I

know

11c

Iex

pect

my

heal

th5

no0.

637

�1.

213

�1.

887

�0.

121

0.51

9to

get

wor

se11

dm

yhe

alth

isex

celle

nt5

yes

1.08

3�

0.52

90.

696

0.00

22.

442

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 405

Ta

ble

6.T

henu

mbe

rof

1000

rand

omiz

edcl

inic

altr

ials

usin

gth

eSF

-36,

SF-1

2,an

dSF

-8,

inw

hich

ZL

was

grea

ter

than

the

appr

opri

ate

criti

cal

valu

efo

ra

test

atth

etw

o-si

ded

sign

ifica

nce

leve

lgi

ven

Num

ber

ofpa

tient

spe

rar

m(K

)

3040

5010

0

Eff

ect

size

Sign

ifica

nce

leve

l0.

100.

050.

010.

100.

050.

010.

100.

050.

010.

100.

050.

01

0.2

SF36

155

9229

210

137

3825

216

857

430

312

156

0.2

SF12

146

8025

197

110

3424

016

070

415

295

141

0.2

SF8

152

8423

184

114

3222

113

849

385

265

116

0.5

SF36

493

352

147

690

571

332

769

695

452

976

949

838

0.5

SF12

495

363

166

618

513

284

765

659

392

960

933

838

0.5

SF8

469

314

142

611

490

260

750

627

391

944

895

726

0.8

SF36

858

755

538

961

914

785

980

955

897

1000

999

997

0.8

SF12

834

724

500

945

893

716

986

964

856

999

999

997

0.8

SF8

840

714

451

925

860

663

977

959

858

999

998

989

Tabl

e(c

onti

nued

)

Tabl

e6.

Con

tinue

d

Num

ber

ofpa

tient

spe

rar

m(K

)

200

300

500

1000

Eff

ect

size

Sign

ifica

nce

leve

l0.

100.

050.

010.

100.

050.

010.

100.

050.

010.

100.

050.

01

0.2

SF36

666

561

353

841

782

549

952

917

786

998

995

982

0.2

SF12

636

515

300

808

739

525

928

891

751

997

997

972

0.2

SF8

637

516

288

772

687

489

936

891

742

995

990

959

0.5

SF36

1000

998

988

——

——

——

——

—0.

5SF

1299

899

799

5—

——

——

——

——

0.5

SF8

999

999

993

——

——

——

——

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410406

Table 7. Approximate number of patients required in each arm (K) of a randomized clinical trialto demonstrate a given effect size at the 5% level with 80% power

Instrument used

Effect size SF-8 SF-12 SF-36 ∞

0.2 410 380 325 3940.5 82 75 71 640.8 36 35 33 26

possible sum scores on the SF-36, SF-12, and SF-8, respectively. The items used in the firstsimulation study described in this paper had only two response categories, meaning that 149,47, and 40 items would have been necessary to obtain the same range of possible sum scores.

Discussion

This paper has described simulation-based studies into the use of IRT to analyze RCTsusing a questionnaire consisting of items with two response categories and designed toquantify a theoretical variable as the primary outcome. The study into the small samplebehavior of asymptotic results provides a framework in which data from RCTs can be analyzedusing IRT. It had been proven that the ZL statistic closely approximated an N(0,1) distributionunder the null hypothesis for studies in which each arm contained at least 500 patients [24].The results in this paper show that ZL follows an N(0,s) distribution for small RCTs, butthat s ≠ 1. The variance appears to be a function of the reciprocal of the number of patientsin each arm of an RCT. This means that, as in many situations in which asymptotic results areused, the procedure for testing whether an effect is significant needs to be adjusted forthe relatively small sample sizes often found in RCTs when IRT is used.

The main simulation study examined the relationship between the number of patients ineach arm of an RCT, the number of items used to assess the patients, and the power todetect given effect sizes when using IRT. As ever, the smaller the effect size, the larger thenumber of patients needed in each arm to detect the effect with a given power. In addition,increasing the number of items used to assess the patients can mean that given effects can bedetected using fewer patients. The number of patients required to detect a minimal effectusing five items is more than twice the number required when 50 items are used. The resultssuggest that reductions in the number of patients required are minimal for more than 20carefully chosen items, indicating that a maximum of 20 items with good measurement qualitiesis sufficient to assess patients. However, if the items have poor measurement properties, suchas very low values of ai or values of bi more extreme than mq � 2sq, indicating a lack ofdiscrimination and floor or ceiling effects in a questionnaire, respectively, then many moreitems may be required. These results also indicate that if an RCT is to be designed to detectsmall effects, it is inadvisable to use very short instruments consisting of items with tworesponse categories analyzed using IRT. Again, it should be noted that these results are basedon the assumption that the items used do not have poor measurement properties, asdiscussed above.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 407

It is common wisdom that to detect effect sizes of 0.2, 0.5, or 0.8 using a t test with asignificance level of 0.05 and a power of 80%, it is necessary to include 394, 64, or 26patients, respectively, in each arm of an RCT. The results presented in this paper show that450, 90, and 40 patients are required in each arm to detect the same effects using 50 itemsand IRT. The main reason for the differences between these figures is that when a t testis used, it is assumed that the variable of interest is measured without error, whereas IRTtakes into account that a latent variable cannot be measured without error. In addition, thesample sizes obtained using IRT are conservative as linear interpolation has been usedbetween observations, whereas it is reasonable to assume that the true function would curvetoward the top lefthand corner, indicating that the true numbers of patients requiredare marginally lower than these results suggest. It should also be emphasized that if researchersrequire very accurate estimates of the number of patients required in each arm of an RCTusing a particular set of items, they should carry out their own simulation study. There area number of advantages of the use of IRT in the analysis of RCTs. First, the ZL test describedin this article can, in contrast to the t test, always be used in the same format, regardless ofwhether the variance of the latent trait is the same in both arms of the RCT or not. Second,it is not essential that all patients are assessed using exactly the same questionnaire. Thequestionnaire can be shortened for particular groups of patients, while the estimates ofthe latent trait remain comparable over all patients.

The value of the standard errors of mA and mB depend on a complex relationship betweenthe values of the latent trait of the patients included in an RCT and the parameters of theitems used to assess them. This means that the results obtained in this study are, theoretically,local to the combination of patients and items used. However, we reselected all parametersfor each of the 1000 individual “RCTs” carried out at each combination of effect size, numberof patients per arm, and number of items in order to give a more general picture of thenumber of patients required. In addition, item parameters for the SF-36, SF-12, and SF-8quality of life instruments were estimated from a dataset collected from patients undergoingdialysis for chronic kidney failure and used to illustrate the methods described.

This article has examined the sample sizes required in an RCT when the primary outcomeis a latent trait measured by a questionnaire analyzed with IRT. It has been proven thatit is possible, following a minor transformation, to use the statistics developed for analyzinglarge-scale educational interventions in the much smaller sample sizes encountered in RCTsin clinical medicine. In addition, the relationship between the number of items used and thenumber of patients required to detect particular effects was examined. It is hoped that thisarticle will contribute to the understanding of IRT, particularly in relation to RCTs.

Acknowledgments

This research was partly supported by a grant from the Anton Meelmeijerfonds, a charitysupporting innovative research in the Academic Medical Center, Amsterdam, The Nether-lands. The authors would like to thank the researchers involved in The Netherlands Co-operative Study on the Adequacy of Dialysis (NECOSAD) for allowing their data to beused in this paper.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410408

Appendix: Obtaining mA, mB, and their standard errors

The values of mA and mB and their standard errors can be estimated using marginalmaximum likelihood methods. The likelihood L can be written as

L � �g�A,B

�k=1

kg

��∞

�i,j

pijk(q)xijk g(q|mg,sg)dq (A.1)

where pijk is as defined in Eq. (1) and (2), g(q|mg,sg) is a Normal density function with meanand standard deviation equal to mg and sg

2 for g � A, B, respectively. Furthermore, xijk

is an indicator variable taking the value 1 if patient k responds in category j of item i andthe value 0 otherwise. In the study described in this paper, we have assumed that the itemparameters ai and bi are known. In practice, this would mean that the items came from acalibrated item bank. It has been shown that mA can be estimated using

mA �1K �

K

k�1E(qk|Xk) (A.2)

where Xk is a vector with n elements containing the responses, xik, of patient k to the items,and E(qk|Xk) is the posterior expected value of qk for patient k given their pattern ofresponses, Xk [23–25]. It has also been shown that if the item parameters are consideredknown, as in this study, the asymptotic standard error S.E.(mA�mB) is equal to S.E.(mA)�S.E.(mB), where

S.E.(mA) �1

sA�4�

K

k�1(E(qk|Xk)�mA)2

(A.3)

and sA is the estimated standard deviation of θ in group A. The definition of S.E.(mB)is analogous to that of S.E.(mA). The expectation E(qk|Xk) is as defined in Eq. (A.2) andgiven by

E(qk|Xk) � ��∞

qf (qk|Xk)∂q (A.4)

where the posterior distribution of q is

f(qk|Xk) �P(Xk|qk)g(qk|mA,sA)

�∞

�∞P(Xk|qk)g(qk|mA,sA)∂qk

(A.5)

where P(Xk|qk) is the probability of the response pattern made by patient k given qk.

References

[1] McDowell I, Newall C. Measuring health: a guide to rating scales and questionnaires. Oxford: OxfordUniversity Press, 1996.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 409

[2] Cella D, Chang CH. A discussion of item response theory and its applications in health status assessment.Med Care 2000;38(Suppl):II66–II72.

[3] Fischer GH, Molenaar IW, editors. Rasch models: foundations, recent developments and applications. NewYork: Springer-Verlag, 1995.

[4] van der Linden W, Hambelton RK, editors. Handbook of modern item response theory. New York:Springer, 1997.

[5] Teresi JA, Kleinman M, Ocepek-Welikson K. Modern psychometric methods for detection of differentialitem functioning: application to cognitive assessment measures. Stat Med 2000;19:1651–1683.

[6] Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual frameworkand item selection. Med Care 1992;30:473–483.

[7] Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21stcentury. Med Care 2000;38(Suppl):II28–II42.

[8] van der Linden WJ, Glas CAW. Computerized adaptive testing. Theory and practice. Dordrecht, the Nether-lands: Kluwer Academic Publishers, 2000.

[9] Holman R, Lindeboom R, Vermeulen R, Glas CAW, de Haan RJ. The Amsterdam Linear Disability Score(ALDS) project. The calibration of an item bank to measure functional status using item response theory.Qual Life Newsletter 2001;27:4–5.

[10] Lindeboom R, Vermeulen M, Holman R, de Haan RJ. Activities of daily living instruments in clinicalneurology. Optimizing scales for neurologic assessments. Neurology 2003;60:738–742.

[11] Cohen J. Statistical power analysis for the behavioral sciences. New Jersey: Hillsdale, Lawrence ErlbaumAssociates, 1988.

[12] Formann AK. Measuring change in latent subgroups using dichotomous data-unconditional, conditionaland semiparametric maximum-likelihood-estimation. J Am Stat Assoc 1994;89:1027–1034.

[13] May K, Nicewander WA. Measuring change conventionally and adaptively. Educ Psychol Meas 1998;58:882–897.

[14] Korevaar JC, Merkus MP, Jansen MA, et al. Validation of the KDQOL-SF: a dialysis-targeted healthmeasure. Qual Life Res 2002;11:437–447.

[15] van Manen JG, Korevaar JC, Dekker FW, et al. How to adjust for comorbidity in survival studies in ESRDpatients: a comparison of different indices. Am J Kidney Dis 2002;40:82–89.

[16] Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of scales andpreliminary tests of reliability and validity. Med Care 1996;34:220–233.

[17] QualityMetric Incoporated. The SF-8 Health survey http://www.sf36.com/tools/sf8.shtml��. AccessedJanuary 2, 2003.

[18] Birnbaum A. Some latent trait models and their use in inferring an examinee’s ability. In: Lord FM,Novivk MR, editors. Statistical theories of mental test scores. Reading Massachusetts: Addison-Wesley, 1968.

[19] Thissen D, Steinberg L. A taxonomy of item response models. Psychometrika 1986;51:567–577.[20] Rasch G. Probabalistic models for some intelligence and attainment tests. Copenhagen: Danish Institute

for Educational Research, 1960.[21] Holman R, Berger MPF. Optimal calibration designs for tests of polytomously scored items described by

item response theory models. J Educ Behav Stat 2001;26:361–380.[22] Muraki E. A generalized partial credit model. In: van der Linden WJ, Hambleton RK, editors. Handbook

of modern item response theory. New York: Springer, 1997.[23] Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: an application of an

EM-algorithm. Psychometrika 1981;46:443–459.[24] Glas CAW, Verhelst ND. Extensions of the partial credit model. Psychometrika 1989;54:635–659.[25] Glas CAW. Modification indices for the 2-PL and the nominal response model. Psychometrika 1999;

64:273–294.[26] Verhelst ND, Glas CAW, Verstralen HHFM. OPLM computer program and manual. Arnhem, the Netherlands:

Centraal Instituut voor Toets Ontwikkeling (CITO), 1994.[27] Cochran WG. Sampling techniques. New York: Wiley, 1977.[28] McHorney CA, Ware JE Jr, Rogers W, Raczek AE, Lu JF. The validity and relative precision of MOS

short- and long-form health status scales and Dartmouth COOP charts. Results from the Medical OutcomesStudy. Med Care 1992;30(Suppl):MS253–MS265.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410410

[29] McHorney CA, Ware JE Jr, Lu JF, Sherbourne CD. The MOS 36-item Short-Form Health Survey (SF-36):III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care1994;32:40–66.

[30] McHorney CA, Ware JE Jr, Raczek AE. The MOS 36-Item Short-Form Health Survey (SF-36): II. Psycho-metric and clinical tests of validity in measuring physical and mental health constructs. Med Care1993;31:247–263.

[31] Raczek AE, Ware JE, Bjorner JB, et al. Comparison of Rasch and summated rating scales constructed fromSF-36 physical functioning items in seven countries: results from the IQOLA Project. International Qualityof Life Assessment. J Clin Epidemiol 1998;51:1203–1214.

[32] Aaronson NK, Muller M, Cohen PD, et al. Translation, validation, and norming of the Dutch languageversion of the SF-36 Health Survey in community and chronic disease populations. J Clin Epidemiol1998;51:1055–1068.