Power analysis in randomized clinical trials based on item response theory
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Power analysis in randomized clinical trials based on item response theory
Controlled Clinical Trials 24 (2003) 390–410
Power analysis in randomized clinical trials based on itemresponse theory
Rebecca Holman, M.Math.a,*, Cees A.W. Glas, Ph.D.b,Rob J. de Haan, Ph.D.a
aDepartment of Clinical Epidemiology and Biostatistics, Academic Medical Center,Amsterdam, The Netherlands
bDepartment of Research Methodology, Measurement and Data Analysis, University of Twente,Enschede, The Netherlands
Manuscript received May 20, 2002; manuscript accepted March 17, 2003
Abstract
Patient relevant outcomes, measured using questionnaires, are becoming increasingly popularendpoints in randomized clinical trials (RCTs). Recently, interest in the use of item response theory(IRT) to analyze the responses to such questionnaires has increased. In this paper, we used a simula-tion study to examine the small sample behavior of a test statistic designed to examine the differencein average latent trait level between two groups when the two-parameter logistic IRT model for binarydata is used. The simulation study was extended to examine the relationship between the number ofpatients required in each arm of an RCT, the number of items used to assess them, and the power todetect minimal, moderate, and substantial treatment effects. The results show that the number ofpatients required in each arm of an RCT varies with the number of items used to assess the patients.However, as long as at least 20 items are used, the number of items barely affects the number ofpatients required in each arm of an RCT to detect effect sizes of 0.5 and 0.8 with a power of 80%.In addition, the number of items used has more effect on the number of patients required to detectan effect size of 0.2 with a power of 80%. For instance, if only five randomly selected items are used,it is necessary to include 950 patients in each arm, but if 50 items are used, only 450 are required ineach arm. These results indicate that if an RCT is to be designed to detect small effects, it is inadvisableto use very short instruments analyzed using IRT. Finally, the SF-36, SF-12, and SF-8 instrumentswere considered in the same framework. Since these instruments consist of items scored in more thantwo categories, slightly different results were obtained. � 2003 Elsevier Inc. All rights reserved.
Keywords: Item response theory; Sample size; Power; Latent trait; IRT; Two-parameter model; SF-36; SF-12;SF-8
* Corresponding author: Rebecca Holman, M.Math, Department of Clinical Epidemiology and Biostatistics,Academic Medical Center, PO Box 22700, 1100 DE Amsterdam, The Netherlands. Tel.: �31-20-566-6947;fax: �31-20-691-2683.
E-mail address: [email protected]
0197-2456/03/$—see front matter � 2003 Elsevier Inc. All rights reserved.doi:10.1016/S0197-2456(03)00061-8
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 391
Introduction
In recent years, there has been an enormous increase in the use of patient relevant outcomes,such as functional status and quality of life, as endpoints in medical research, includingcontrolled randomized clinical trials (RCTs). Many patient relevant outcomes are measuredusing questionnaires designed to quantify a theoretical construct, often modeled as a latentvariable. When a questionnaire is administered to a patient, responses to individual itemsare recorded. Often the scores on each item are added together to obtain a single score foreach patient. The reliability and validity of sum scores are usually examined in the frameworkof classical test theory. This framework is widely accepted and applied in many areas ofmedical assessment [1]. However, following dissatisfaction with these methods, interest inthe use of an alternative paradigm, known as item response theory (IRT), has grown [2].IRT was developed as an alternative to the use of classic test theory when analyzing dataresulting from school examinations. An overview of IRT methods is given in this paper,while in-depth descriptions in general [3,4] and in medical applications [5] are givenelsewhere.
Advantages of using IRT to analyze an RCT include proper modeling of ceiling and flooreffects, solutions to the problem of missing data, and straightforward ways of dealing withheteroscedacity between groups. However, the main advantage is that it is not essential toassess all patients with exactly the same items. For instance, if sufficient information on themeasurement characteristics of the Medical Outcomes Study Short-Form Health Survey with36 items (SF-36) [6] health survey in a particular patient population were available, it wouldbe possible to obtain completely comparable estimates of health status using only the itemsmost appropriate to each individual patient. An extension of this is computerized adaptivetesting, in which each patient potentially receives a different computer-administered question-naire in which the questions offered to each patient depend on the responses given to previousquestions [7,8]. A prerequisite of this type of testing is access to a large item bank thathas been calibrated using responses from comparable patients [9,10].
Ethical considerations require that as few patients as possible are exposed to the “risk”of a novel treatment during an RCT, but it is important to ensure that enough patients areincluded to have a reasonable power of detecting the effect of interest. For this reason,calculation of the minimal sample size required to demonstrate a clinically relevant effecthas become integral to the RCT literature [11]. However, since IRT was developed as a toolfor analyzing data resulting from examinations, most technical work has concentrated onthe statistical challenges found in this field. For example, when assessing the effects ofan educational intervention, in some ways similar to an RCT, thousands of pupils, evenwhole cohorts, are often included. This means that minimal sample size and power calculationsin relation to questionnaires analyzed with IRT have received very little attention. In addition,IRT offers a framework in which the number of items used to assess patients can be easilyvaried. Thus, sample size calculations need to consider not only the number of patients, butalso the number of items used. Some work has touched on these issues [12] or consideredit as a sideline of another aspect [13], but no guidance on sample size calculations for RCTsin the context of IRT has been published.
In this paper, the relationship between the number of patients in each treatment arm, thenumber of items used to assess the patients, and the power to detect given effect sizes will
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410392
be examined using a simulation study. The results will be used to develop guidelines forthe number of patients to be used in each arm of an RCT when a questionnaire analyzedusing IRT is used as the primary outcome. The methods are illustrated in a population ofpatients with end-stage renal disease 12 months after starting dialysis [14,15] using the SF-36 [6], the SF-12 [16], and the SF-8 [17] health-status surveys. In addition, to provide aframework in which data from RCTs can be analyzed using IRT, the behavior of asymptoticmethods developed to compare the mean level of the latent trait in two groups will beconsidered in the relatively small samples encountered in RCTs.
IRT in an RCT
IRT is used to model the probability that a patient will respond to a number of itemsrelated to a latent trait in a certain way. The two-parameter logistic model [18] is fordata resulting from items with two response categories, “0” and “1,” and is one of manyIRT models developed [19]. In this model, the probability, pik(q), that patient k, with latenttrait equal to qk, will respond to item i in category 1 is given by
pik(q) �exp (ai(qk�bi))
1�exp (ai(qk�bi))(1)
where αi and bi are known as item parameters. The more widely known Rasch model [20]is similar to the two-parameter logistic model, but with the parameters αi assumed equal to1. An important assumption of IRT models is that of local independence, meaning that theprobability of a patient scoring 1 on a given item is independent of them scoring 1 on anotheritem, given their value of q. This means that the correlations between items and over patientsare fully explained by q. Models have also been developed for data resulting from items withmore than two response categories [4,21]. The generalized partial credit model [22] is afairly well-known example in which the probability pijk(q) that a patient with latent traitequal to qk will respond to an item i, with (Ji � 1) response categories, in category j, j � 0,is given by
pijk(q) �
exp (�j
u�0ai(qk�biu))
1��Ji
j�0exp (�j
u�0ai(qk�biu))
(2)
where αi is the discrimination parameter of item i and biu indicates the point at which theprobability of choosing category j or category j�1 is equal.
In IRT, the item and patient parameters are usually estimated in a two-stage procedure.First, the item parameters are estimated, often by assuming that the qk follow a normaldistribution and integrating them out of the likelihood. Second, maximum likelihood estimatesof qk are obtained using the previously estimated item parameters. In this study, it will be
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 393
assumed that the items under consideration form part of a calibrated item bank [9], meaningthat the item parameters have been previously estimated [23] from responses given bycomparable patients to the items and are assumed known for all items. It is theoretically possibleto estimate the item parameters from the responses given to the items by patients includedin an RCT, but accurate estimates can only be obtained from large samples of patients, sayover 500. Since this figure is rarely attained in RCTs, it will often not be practical to estimatethe item parameters in an RCT.
In a straightforward RCT, the patient sample is randomly divided into two groups, sayA andB. Each group receivesa different treatment regimenand primaryandsecondary outcomesare assessed once at the end of the study. The main interest is in the null hypothesis, H0,whether the mean level of the primary outcome, say q, is equal in both groups. This can bewritten as
H0 : mA � mB � 0 (3)
where mA and mB denote the mean of the distribution of q in groups A and B, respectively.Clinically, the two groups are said to differ if the ratio of the difference mA � mB and thestandard deviation of q is larger than a given effect size. A lot of work has been carried outexamining the clinical relevance of particular effect sizes in given situations. However, inpractice, interest is often in examining the arbitrarily defined minimal, moderate, and sub-stantial effect sizes of 0.2, 0.5, and 0.8 on continuous variables [11]. The number of patientsrequired to detect a given effect size with a particular power depends on the values of theeffect size and the standard errors of mA and mB.
Now consider an RCT in which the primary outcome is q, measured at the end of thestudy using a questionnaire with n items, each with two response categories, and analyzedusing IRT. Let us assume that patients 1, 2,…, K are in group A and patients K � 1, K � 2,…,2K are in group B, meaning that the total sample size is 2K. For group A, we can rewriteEq. (1) as
pik(q) �exp (ai(mA�ek�bi))
1�exp (ai(mA�ek�bi))(4)
where qk � mA � ek and mA is the mean of qk in group A. Hence, ek have mean 0 and standarddeviation sA. For group B, Eq. (1) can be rewritten in a similar way, with qk � mB � ekand the standard deviation of ek equal to sB. For RCTs carried out using a questionnaireconsisting of items with more than two response categories, Eq. (2) can be rewritten in asimilar way. The main interest is in examining the null hypothesis in Eq. (3) by obtainingmA and mB and testing whether they are significantly different from each other. When using IRT-based techniques, it is inadvisable to estimate the values of qk for all patients and thenperform standard analysis, such as t tests, on these estimates [13], since this ignores themeasurement error inherent to qk. This means that using standard methods for calculatingsample size may lead to inaccurate conclusions. The estimation equations for mA and mB arecomplex and the values of S.E. (mA) and S.E. (mB) depend not only on the number of patientsin each arm of the RCT, but also on the number of items used to assess the patients andthe relationship between ai and bi and the distribution of q. This means that it is not
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410394
possible to write down a straightforward equation for either mA and mB or the number of patientsrequired to detect a given sample size with a particular power. Consistent estimates ofmA and mB are obtained by maximizing a likelihood function that has been marginalized withrespect to q [24,25]. These estimation methods are described in more detail in the appendix.
The marginal maximum likelihood estimates of mA and mB can be combined with theirstandard errors to obtain test statistic
ZL �mA � mB
S.E.(mA � mB)(5)
where mA and mB denote the estimates described above. It has been proven that under thehypothesis mA � mB � 0 and for large sample sizes, say 2K � 500, ZL follows an asymptoticstandard Normal distribution [24]. However, since these methods were developed in the fieldof educational measurement where interventions are assessed using very large samples, thebehavior of ZL for the smaller samples common in RCTs is as yet unknown.
A simulation study
In this study, we simulated data from RCTs to examine the behavior of ZL and the powerto detect a number of effect sizes with a given number of patients and items. We wereparticularly interested in RCTs where there were 30, 40, 50, 100, 200, 300, 500, or 1000,denoted K, patients in each arm and where these patients were assessed using 5, 10, 15,20, 30, 50, 70, or 100, denoted n, items each with two response categories. These valueswere chosen to reflect the range of sample sizes often encountered in clinical research andthe number of items with which it is acceptable to assess patients in a variety of situations.In total, there were 64 (�8 × 8) different combinations of sample size and number of items inthe study.
The study was carried out by simulating 1000 RCTs at each of the 64 different combinations.In each RCT, a group of K values of q were sampled from each of N(0,1) and N(mB,1)distributions to represent the latent trait levels of patients in groups A and B, respectively.Since the standard deviation of q is equal to one in both groups, the effect size in an RCTis equal to mA�mB. In addition, for each RCT n values of b and of a were generated fromN(0,1) and log N(0.2,1) distributions, respectively, to represent the items. These values werechosen as they give a reasonably high level of statistical information on the values of q usedin this study and thus represent a carefully chosen questionnaire in terms of item characteris-tics. Each RCT was “conducted” by calculating the probability, pik, that patient k wouldrespond to item i in category 1, given their value of the latent trait, qk, and the item parametersai and bi using the formula in Eq.(1). The response, xik, made by patient k on item iwas obtained by taking an observation on a Bi(1,pik) distribution. This was repeated for all2K patients in an RCT and resulted in a data matrix with 2K rows and n columns. Theresponses, xik, were used, together with the two-parameter logistic model, the “known” valuesof a and b, and marginal maximum likelihood estimation methods to obtain estimates ofmA, mB, and the associated standard errors. These estimates were combined to obtain a value
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 395
of ZL for each RCT. The simulations were carried out in a program adapted by the authors fromOPLM, a commercially available program for estimating parameters in IRT models [26].
The distribution of ZL under the null hypothesis
In order to provide a framework for examining the small sample behavior of ZL underthe null hypothesis H0 : mA�mB � 0, the values of mA and mB were set equal to 0 and 1000RCTs carried out at each of the 64 combinations of K and n. The distribution of the valuesof ZL obtained for each combination of K and n was examined by calculating the mean,ZL, and standard deviation, S.D.(ZL), of the values of ZL obtained from the 1000 RCTsconducted with each combination of K and n. In addition, the values of ZL were tested tosee whether there was evidence that they did not form a sample from a normal distribution,with mean ZL and standard deviation, S.D.(ZL), using the Kolmogornov-Smirnov statistic.
Sample size, number of items, and power
The main objective of the simulation study was to examine the power of an RCT usinga questionnaire with n items as a primary endpoint and K patients in each arm of the RCTto detect minimal (0.2), moderate (0.5), or substantial (0.8) effect sizes. Hence, the wholesimulation process with 64 combinations of k and n was repeated three times. The value ofmA was set to 0 and sA � sB � 1. Hence, the values of mB were set at 0.2, 0.5, and 0.8 forthe first, second, and third repetitions, respectively.
The 1000 values of ZL obtained for each of the 192 combinations of sample size, numberof items, and effect size were compared to the critical values of the appropriate normaldistribution to determine how many of the test statistics were significant at the (two-sided)90%, 95%, and 99% levels. For RCTs with more than 100 patients in each arm, a standardnormal distribution was used, meaning that the critical values were �1.64, �1.96, and �2.58for the 90%, 95%, and 99% levels, respectively. The critical values for RCTs with up to100 patients in each arm were obtained in the first part of this simulation study and aregiven in the following section.
Results of the simulation study
The distribution of ZL under the null hypothesis
The mean, ZL, S.D.(ZL), and the p-value of the Kolmogornov-Smirnov test for Normality,P(KS), of the 1000 values of ZL produced at each combination of n and K for mA � mB � 0are given in Table 1. The mean value of ZL over all 64,000 replications is 0.0004 and onlytwo of the 64 values of P(KS) are less than 0.10. This indicates that there is no reason tosuspect that ZL does not attain its asymptotic normal distribution, with mean 0, for samplesconsisting of two groups, each of 30 to 1000 patients.
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410396
Table
1.T
hem
ean,
stan
dard
devi
atio
n,an
dK
olm
ogor
nov-
Smir
nov
p-va
lue,
P(K
S),
for
ZL
whe
nm A
�µ B
�0
Num
ber
ofpa
tient
sin
each
ofgr
oup
Aan
dgr
oup
B(K
)
3040
5010
0N
umbe
rof
item
s(n
)Z L
S.D
.(Z
L)
P(K
S)Z
LS.
D.(
ZL)
P(K
S)Z
LS.
D.(
ZL)
P(K
S)Z
LS.
D.(
ZL)
P(K
S)
50.
0963
1.03
300.
027
0.00
061.
0432
0.46
50.
0269
1.08
860.
570
0.01
581.
0267
0.26
310
�0.
0031
1.30
920.
012
0.02
521.
1260
0.23
00.
0152
1.04
050.
639
0.00
241.
0015
0.82
615
�0.
0058
1.21
440.
318
0.02
021.
1482
0.32
7�
0.00
501.
1383
0.80
7�
0.02
281.
0515
0.11
920
0.13
661.
2138
0.42
90.
0246
1.20
590.
182
�0.
0230
1.17
290.
407
0.05
571.
0325
0.82
630
0.07
641.
1905
0.87
2�
0.00
631.
0984
0.29
6�
0.01
151.
1075
0.32
0�
0.02
551.
0175
0.85
350
0.07
941.
2374
0.28
90.
0037
1.16
480.
339
�0.
0222
1.13
560.
571
0.03
361.
0484
0.33
670
�0.
1169
1.22
620.
292
�0.
0576
1.17
280.
294
�0.
0252
1.06
120.
194
0.01
780.
9801
0.77
610
00.
0151
1.21
460.
374
�0.
0005
1.14
250.
879
�0.
0070
1.12
500.
728
0.00
971.
0267
0.61
5
Tab
le(c
onti
nued
)
Tabl
e1.
Con
tinue
d
Num
ber
ofpa
tient
sin
each
ofgr
oup
Aan
dgr
oup
B(K
)
200
300
500
1000
Num
ber
ofite
ms
(n)
Z LS.
D.(
ZL)
P(K
S)Z
LS.
D.(
ZL)
P(K
S)Z
LS.
D.(
ZL)
P(K
S)Z
LS.
D.(
ZL)
P(K
S)
5�
0.01
971.
0406
0.97
0�
0.00
680.
9882
0.32
20.
0086
1.01
810.
793
�0.
0420
1.01
830.
959
100.
0342
1.04
280.
860
�0.
0634
0.97
780.
610
�0.
0287
1.05
160.
994
�0.
0713
0.97
750.
302
15�
0.05
611.
0167
0.86
0�
0.01
421.
0064
0.65
2�
0.02
230.
9574
0.78
20.
0069
1.00
590.
219
20�
0.00
510.
9980
0.47
2�
0.03
190.
9914
0.51
5�
0.01
460.
9898
0.44
60.
0896
0.96
790.
942
300.
0112
1.02
630.
567
0.01
561.
0141
0.96
60.
0114
0.96
120.
782
�0.
0235
0.94
650.
317
50�
0.01
141.
0102
0.98
00.
0144
0.96
610.
471
�0.
0201
0.98
170.
362
�0.
0330
1.02
850.
894
700.
0142
0.99
720.
687
�0.
0364
0.96
800.
289
�0.
0227
0.99
500.
153
–0.0
286
1.02
090.
279
100
0.01
260.
9889
0.54
10.
0382
1.03
770.
561
�0.
0114
0.98
980.
697
0.00
940.
9488
0.66
1
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 397
Fig. 1. Standard deviation of ZL against the number of patients in each arm of a randomized clinical trial.
The variation of the standard deviation of the 1000 values of ZL produced at each combina-tion of conditions is illustrated with respect to K and n in Figs. 1 and 2, respectively. It doesnot appear that n has a substantial effect on the standard deviation of ZL. However, S.D.(ZL)increases as the sample size decreases. We modeled the relationship between S.D.(ZL) andK, under the null hypothesis H0 : mA � mB, using
S.D.(ZL) � 0.97�6.671
K. (6)
Eq. (6) was obtained using regression analysis, since the variance of statistic is often relatedto the reciprocal of the number of patients used to calculate the statistic. The usual assumptionswere tested and the data did not violate them. The asymptote was not forced down to 1,since Eq. (6) gave the best fit to data obtained from 30, 40, 50, and 100 patients, and primaryinterest was in studies with these numbers of patients, rather than obtaining an accuraterepresentation of smaller deviations from a standard Normal distribution. The correlationbetween S.D.(ZL) and (1�K) was 0.8843, meaning that R2 � 0.7820. The modeled distributionfor ZL along with critical values and type I error rates based on the 8000 “trials” with aneffect size of 0.0 carried out using the eight test lengths for the 90%, 95%, and 99% levelsfor RCTs using 30, 40, 50, and 100 patients per arm are given in Table 2.
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410398
Fig. 2. Standard deviation of ZL against the number of items used to assess the patients.
Sample size, number, of items, and power
Table 3 contains the number of the 1000 RCTs carried out under each of the 192 combina-tions of K, n, and effect size, which resulted in a value of the ZL statistic beyond theappropriate critical values given in Table 2. In order to facilitate comparisons with the situationwhere q were known, similar to obtaining an estimate using an infinite number of items,the theoretical number of 1000 RCTs that would result in a value of the T statistic beyondthe appropriate critical values have been calculated using standard procedures [27]. Theseare presented in rows labeled ∞. If the numbers in Table 3 are divided by 1000, then an
Table 2. Modeled distribution for ZL and critical values for randomized clinical trials with small sample sizes
Number of patients ModeledCritical values Type I error rate
per arm (K) distribution of ZL 90% 95% 99% 90% 95% 99%
30 N(0,1.192) �1.96 �2.33 �3.07 0.1016 0.0565 0.013940 N(0,1.142) �1.88 �2.23 �2.94 0.1005 0.0545 0.013950 N(0,1.102) �1.81 �2.16 �2.83 0.1025 0.0558 0.0158
100 N(0,1.042) �1.71 �2.04 �2.68 0.0933 0.0470 0.0086�200 N(0,1) �1.64 �1.96 �2.58
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 399
Tabl
e3.
The
num
ber
of10
00ra
ndom
ized
clin
ical
tria
lsin
whi
chZ
Lw
asgr
eate
rth
anth
eap
prop
riat
ecr
itica
lva
lue
for
the
two-
side
dsi
gnifi
canc
ele
vel
give
n
Num
ber
ofN
umbe
rof
patie
nts
inea
chof
grou
pA
and
grou
pB
(K)
Eff
ect
size
item
s(n
)30
4050
100
Sign
ifica
nce
leve
l0.
100.
050.
010.
100.
050.
010.
100.
050.
010.
100.
050.
01
0.2
512
664
2413
676
1814
977
2621
613
942
0.2
1014
574
1513
972
3118
810
828
276
174
630.
215
141
8724
133
8130
208
117
2730
522
374
0.2
2015
591
2816
399
2918
012
637
339
215
660.
230
158
8624
186
109
3619
013
033
344
236
870.
250
170
109
2515
885
2821
012
641
336
227
870.
270
157
7833
181
117
4421
512
036
345
247
980.
210
013
996
3716
510
033
211
138
4836
525
810
00.
2∞
190
110
3022
014
040
260
160
5040
029
012
00.
55
174
104
3034
424
790
433
314
126
724
610
361
0.5
1026
718
169
438
297
132
575
445
208
837
742
542
0.5
1528
017
091
492
364
145
579
463
225
876
800
611
0.5
2030
318
570
467
349
140
606
488
267
905
854
702
0.5
3029
419
388
550
416
180
650
510
283
927
862
704
0.5
5029
120
889
559
426
208
622
492
274
939
885
706
0.5
7030
520
867
549
428
213
663
536
287
947
900
735
0.5
100
317
208
8855
741
519
767
455
029
693
588
972
70.
5∞
600
470
240
710
590
340
790
690
450
960
940
820
0.8
538
926
298
655
538
297
805
689
422
973
949
815
0.8
1049
735
416
377
466
942
590
085
263
999
599
195
60.
815
489
384
179
846
747
482
911
817
649
1000
997
984
0.8
2047
537
119
487
277
449
793
189
269
099
799
498
50.
830
538
416
205
856
789
581
950
897
750
999
999
993
0.8
5057
346
121
787
580
457
493
390
377
110
0010
0099
20.
870
535
428
229
862
773
549
952
919
799
998
998
994
0.8
100
571
427
214
891
822
618
940
899
775
1000
1000
995
0.8
∞92
086
066
097
094
082
010
0097
091
010
0010
0010
00
Tab
le(c
onti
nued
)
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410400
Table
3.C
ontin
ued
Num
ber
ofN
umbe
rof
patie
nts
inea
chof
grou
pA
and
grou
pB
(K)
Eff
ect
size
item
s(n
)20
030
050
010
00
Sign
ifica
nce
leve
l0.
100.
050.
010.
100.
050.
010.
100.
050.
010.
100.
050.
01
0.2
537
226
811
251
537
516
464
952
332
890
082
261
00.
210
492
350
170
625
508
266
810
712
474
970
949
825
0.2
1553
639
518
967
555
631
784
976
353
498
296
389
60.
220
539
424
209
688
565
334
880
802
564
992
980
923
0.2
3056
141
421
173
661
939
490
182
762
599
599
092
90.
250
598
462
267
773
657
409
904
853
673
994
989
943
0.2
7058
547
324
675
065
940
192
687
171
899
899
596
20.
210
061
248
926
276
565
941
492
087
969
799
499
196
60.
2∞
630
510
270
780
680
440
930
880
710
1000
1000
970
0.5
596
092
480
298
597
693
1—
——
——
—0.
510
994
985
923
1000
998
986
——
——
——
0.5
1599
899
295
710
0010
0099
7—
——
——
—0.
520
995
990
965
1000
1000
999
——
——
——
0.5
3099
799
196
710
0010
0010
00—
——
——
—0.
550
995
994
982
1000
1000
1000
——
——
——
0.5
7010
0099
799
110
0010
0010
00—
——
——
—0.
510
099
699
598
610
0010
0010
00—
——
——
—0.
5∞
1000
1000
1000
1000
1000
1000
——
——
——
0.8
510
0099
999
5—
——
——
——
——
0.8
1010
0010
0010
00—
——
——
——
——
0.8
1510
0010
0010
00—
——
——
——
——
0.8
2010
0010
0010
00—
——
——
——
——
0.8
3010
0010
0010
00—
——
——
——
——
0.8
5010
0010
0010
00—
——
——
——
——
0.8
7010
0010
0010
00—
——
——
——
——
0.8
100
1000
1000
1000
——
——
——
——
—0.
8∞
1000
1000
1000
——
——
——
——
—
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 401
estimate of the power under the given combination of factors is obtained. It can be seenthat the power to detect the given effect size increases with the effect size and the numberof items used to assess the patients. It is also apparent that while increasing the number ofitems used to assess the patients from five to ten and from ten to twenty results insubstantial increases in the power, increasing the number of items beyond 30 increases thisfigure only minimally. In addition, the power obtained using 100 items does not even approachthe theoretical maximum, using an infinite number of items, for K � 40, indicating that thecorrection introduced in Table 2 may not be sufficient for very small RCTs. Using the resultsin Table 3 and linear interpolation, the values of K required to detect the effect size at thetwo-sided 5% level and with a power of 80% using a given n was calculated and are displayedin Table 4. This significance level and power were chosen as these values are regularlyused when analyzing RCTs. It can be seen that as long as at least 20 items are used, thenumber of items barely affects the number of patients required in an RCT to detect effectsizes of 0.5 and 0.8. However, the number of items used to assess patients has more effect onthe number of patients required to detect an effect size of 0.2. For instance, if only fiverandomly selected items are used, it is necessary to include 950 patients in each arm, butif 50 items are used, only 450 are required in each arm.
An illustration using the short form instruments from the MedicalOutcomes Study
In order to illustrate the methods described in this paper, data were used from the secondphase of the Netherlands Co-operative Study on Dialysis (NECOSAD). NECOSAD is amulticenter prospective cohort study that includes all new end-stage renal disease patientswho started chronic hemodialysis or peritoneal dialysis. Treatment is not randomized butchosen after discussion with the patient and taking clinical, psychological, and social factorsinto account. The SF-36 was developed from the original Medical Outcomes Study survey[28] to measure health status in a variety of research situations [6]. The SF-36 is a reliablemeasure of health status in a wide range of patient groups [29] and has been used withclassical [30] and IRT-based [31] psychometric models. In addition, the SF-12, with 12 items[16], and a preliminary version of the SF-8, with 9 items [17], have been developed. Inthis article, the Dutch language version of the SF-36 [32] was used. The SF-36 usesbetween two and six response categories per item and, on the majority of items, a higher
Table 4. Approximate number of patients required in each arm (K) of a randomized clinical trialto demonstrate a given effect size at the 5% level with 80% power
Number of items used
Effect size 5 10 20 50 100 ∞
0.2 950 680 500 450 440 3940.5 160 125 95 90 87 640.8 70 50 45 40 39 26
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410402
score denotes a better health state. The items that are scored so that a higher score denotesa worse health state have been rescored so that a higher score denotes a better health state.A brief description of each item is given in Table 5, together with an indication of whetheran item is used in the SF-12 or SF-8, while full details are available elsewhere [1]. Themeasurement properties of the SF-36 in a sample of patients with chronic kidney failure 12months after the start of dialysis [14,15] have been examined using the generalized partialcredit model described in Eq. (2) and are summarized in Table 5. Questionnaires were sentto 1046 patients, of whom 978 completed at least one question on the SF-36. Of the 978patients, 583 were male and 395 were female, while 615 were on hemodialysis and 363were on peritoneal dialysis. The patients were between 18 and 90 years of age, with a medianof 61 years. When IRT techniques were used to examine the quality of life, the mean, m,was 0 and the standard deviation was s � 0.692. It should be emphasized that the modelparameters given in this article are for illustration purposes only and should not be regardedas a definitive calibration of these items as the fit of the model has not been extensivelytested, particularly with respect to differential item characteristics between subgroups of pa-tients.
In order to examine the power with which studies using the short form instruments coulddetect treatment effects in a population of patients with chronic kidney failure 12 monthsafter starting dialysis, a simulation study was carried out. The simulations were carried outin a similar way to those described earlier in the section “A simulation study.” We wereinterested in detecting effect sizes of 0.2, 0.5, and 0.8, denoted x, in RCTs with 30, 40, 50,100, 200, 300, 500, or 1000 patients in each arm (K) using the SF-36, the SF-12, or the SF-8 as the primary endpoint, meaning that there were 24 different combinations of x and K.In each run, K values of q were sampled from each of N(m, s2) and N(m � xs, s2), wherex represents the effect size expected in the study and s the standard deviation of quality oflife, to denote the control and treatment groups, respectively, meaning that mA � 0 andmB � 0.692x. These values were combined with the item parameters in Table 5 and Eq. (2)to generate “responses” by “patients” to the SF-36. These data were used to obtain thestatistic ZL using the items in the SF-36, in the SF-12, and in the SF-8. One thousand runs,or RCTs, were carried out at each of the 24 combinations of K and x. It was assumed thatthe item parameters were known and equal to those in Table 5.
The number of runs at each combination of K, x, and short form instrument resulting ina value of ZL more extreme than the appropriate critical value, in Table 2, for known itemparameters is recorded in Table 6. In general, the SF-36 detected a given treatment effectmore often than the SF-12, which in turn detected a treatment effect more often than theSF-8. The number of patients needed in each arm of an RCT using the SF-36, SF-12,and SF-8 to detect standard treatment effects in the underlying latent trait with a power of80% were obtained using linear interpolation and are given in Table 7. It appears that thenumber of patients required when using the SF-36 to detect an effect of 0.2 is less than ifq were known, as would be the case if an infinite number of items were used. This is nottrue but is a result of the inherent error involved in a simulation study with only 1000 trialsat each combination of K and x. It can also be seen that the differences between the numbers ofpatients required in each arm are less marked than in Table 4. This is because the items onthe SF-36 have up to six response categories leading to a total of 149, 47, and 40 different
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 403
Table
5.T
hech
arac
teri
stic
sof
the
item
sin
the
SF-3
6qu
estio
nnai
re
Num
ber
ofIt
empa
ram
eter
sIt
emre
spon
seSc
orin
gnu
mbe
rIt
emco
nten
tca
tego
ries
reve
rsed
αβ 1
β 2β 3
β 4β 5
1G
ener
alhe
alth
stat
usSF
-12
SF-8
5ye
s2.
641
�3.
890
�3.
810
0.08
64.
825
2H
ealth
com
pare
dSF
-85
yes
0.62
4�
1.75
9�
2.85
7�
2.00
6�
0.93
1w
itha
year
ago
Doe
syo
urhe
alth
limit
3avi
goro
usac
tiviti
es3
no1.
729
1.30
04.
496
3bm
oder
ate
activ
ities
SF-1
23
no2.
767
�1.
031
1.01
93c
liftin
gor
carr
ying
3no
2.24
7�
1.37
0�
0.02
5gr
ocer
ies
3dcl
imbi
ng2�
fligh
tsSF
-12
SF-8
3no
2.35
3�
0.89
00.
612
ofst
airs
3ecl
imbi
ngon
efli
ght
3no
2.27
6�
1.99
6�
2.13
2of
stai
rs3f
bend
ing,
knee
ling,
3no
1.92
4�
1.26
8�
0.38
4or
stoo
ping
3gw
alki
ngm
ore
than
3no
2.15
5�
0.20
40.
809
am
ile3h
wal
king
seve
ral
bloc
ks3
no2.
245
�1.
206
�1.
302
3iw
alki
ngon
ebl
ock
3no
2.08
0�
1.86
5�
2.52
43j
bath
ing
ordr
essi
ng3
no2.
412
�3.
777
�5.
497
your
self
Prob
lem
sw
ithw
ork
asa
resu
ltof
phys
ical
heal
th4a
redu
ced
amou
ntof
2no
2.77
00.
441
time
wor
king
4bac
com
plis
hed
less
SF-1
22
no3.
094
1.07
7th
anho
ped
4clim
ited
inki
ndof
wor
kSF
-12
2no
3.49
51.
290
4ddi
fficu
ltyin
wor
king
SF-8
2no
2.99
81.
388
Tab
le(c
onti
nued
)
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410404
Tabl
e5.
Con
tinue
d
Num
ber
ofIt
empa
ram
eter
sIt
emre
spon
seSc
orin
gnu
mbe
rIt
emco
nten
tca
tego
ries
reve
rsed
αβ 1
β 2β 3
β 4β 5
Prob
lem
sw
ithw
ork
asa
resu
ltof
emot
iona
lhe
alth
5are
duce
dam
ount
of2
no2.
587
�0.
435
time
wor
king
5bac
com
plis
hed
less
SF-1
2SF
-82
no2.
730
0.04
6th
anho
ped
5cw
orke
dle
ssca
refu
llySF
-12
2no
2.30
2�
0.68
16
Red
uced
soci
alSF
-12
SF-8
5ye
s1.
834
�2.
616
�4.
434
�5.
872
�5.
587
activ
ities
7A
mou
ntof
bodi
lypa
inSF
-86
yes
0.83
7�
1.31
8�
3.19
6�
3.19
3�
3.13
1�
3.24
18
Pain
inte
rfer
eSF
-12
5ye
s1.
664
�2.
286
�4.
055
�4.
927
�4.
638
with
wor
k
Em
otio
nsfe
lt9a
full
ofpe
p6
yes
0.63
2�
1.47
5�
2.63
8�
1.70
2�
2.59
8�
1.24
29b
very
nerv
ous
6no
0.62
3�
1.05
6�
2.25
7�
3.75
5�
3.78
0�
3.53
09c
dow
nin
the
dum
ps6
no1.
219
�2.
452
�4.
755
�6.
607
�7.
134
�7.
291
9dca
lman
dpe
acef
ulSF
-12
6ye
s0.
985
�1.
928
�3.
267
�2.
922
�3.
583
�2.
028
9elo
tsof
ener
gySF
-12
SF-8
6ye
s1.
347
�1.
740
�2.
167
�1.
103
�0.
581
1.88
49f
dow
nhea
rted
and
blue
SF-1
2SF
-86
no1.
090
�1.
633
�3.
583
�5.
401
�5.
359
�4.
948
9gw
orn
out
6no
1.45
2�
2.01
0�
3.83
4�
4.63
7�
3.52
3�
1.92
59h
aha
ppy
pers
on6
yes
0.59
1�
1.39
5�
2.82
3�
2.47
2�
3.43
5�
2.61
29i
tired
6no
1.50
6�
1.22
9�
2.10
7�
2.23
80.
016
2.45
8
10R
educ
edso
cial
activ
ities
5no
1.70
3�
2.33
3�
3.95
5�
3.62
2�
2.90
8
Rat
ere
leva
nce
ofst
atem
ents
11a
sick
easi
erth
an5
no0.
670
�0.
772
�1.
279
�0.
663
�0.
345
othe
rpe
ople
11b
asha
ppy
asan
ybod
y5
yes
0.75
8�
0.55
8�
0.01
20.
115
1.95
0I
know
11c
Iex
pect
my
heal
th5
no0.
637
�1.
213
�1.
887
�0.
121
0.51
9to
get
wor
se11
dm
yhe
alth
isex
celle
nt5
yes
1.08
3�
0.52
90.
696
0.00
22.
442
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 405
Table
6.T
henu
mbe
rof
1000
rand
omiz
edcl
inic
altr
ials
usin
gth
eSF
-36,
SF-1
2,an
dSF
-8,
inw
hich
ZL
was
grea
ter
than
the
appr
opri
ate
criti
cal
valu
efo
ra
test
atth
etw
o-si
ded
sign
ifica
nce
leve
lgi
ven
Num
ber
ofpa
tient
spe
rar
m(K
)
3040
5010
0
Eff
ect
size
Sign
ifica
nce
leve
l0.
100.
050.
010.
100.
050.
010.
100.
050.
010.
100.
050.
01
0.2
SF36
155
9229
210
137
3825
216
857
430
312
156
0.2
SF12
146
8025
197
110
3424
016
070
415
295
141
0.2
SF8
152
8423
184
114
3222
113
849
385
265
116
0.5
SF36
493
352
147
690
571
332
769
695
452
976
949
838
0.5
SF12
495
363
166
618
513
284
765
659
392
960
933
838
0.5
SF8
469
314
142
611
490
260
750
627
391
944
895
726
0.8
SF36
858
755
538
961
914
785
980
955
897
1000
999
997
0.8
SF12
834
724
500
945
893
716
986
964
856
999
999
997
0.8
SF8
840
714
451
925
860
663
977
959
858
999
998
989
Tabl
e(c
onti
nued
)
Tabl
e6.
Con
tinue
d
Num
ber
ofpa
tient
spe
rar
m(K
)
200
300
500
1000
Eff
ect
size
Sign
ifica
nce
leve
l0.
100.
050.
010.
100.
050.
010.
100.
050.
010.
100.
050.
01
0.2
SF36
666
561
353
841
782
549
952
917
786
998
995
982
0.2
SF12
636
515
300
808
739
525
928
891
751
997
997
972
0.2
SF8
637
516
288
772
687
489
936
891
742
995
990
959
0.5
SF36
1000
998
988
——
——
——
——
—0.
5SF
1299
899
799
5—
——
——
——
——
0.5
SF8
999
999
993
——
——
——
——
—
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410406
Table 7. Approximate number of patients required in each arm (K) of a randomized clinical trialto demonstrate a given effect size at the 5% level with 80% power
Instrument used
Effect size SF-8 SF-12 SF-36 ∞
0.2 410 380 325 3940.5 82 75 71 640.8 36 35 33 26
possible sum scores on the SF-36, SF-12, and SF-8, respectively. The items used in the firstsimulation study described in this paper had only two response categories, meaning that 149,47, and 40 items would have been necessary to obtain the same range of possible sum scores.
Discussion
This paper has described simulation-based studies into the use of IRT to analyze RCTsusing a questionnaire consisting of items with two response categories and designed toquantify a theoretical variable as the primary outcome. The study into the small samplebehavior of asymptotic results provides a framework in which data from RCTs can be analyzedusing IRT. It had been proven that the ZL statistic closely approximated an N(0,1) distributionunder the null hypothesis for studies in which each arm contained at least 500 patients [24].The results in this paper show that ZL follows an N(0,s) distribution for small RCTs, butthat s ≠ 1. The variance appears to be a function of the reciprocal of the number of patientsin each arm of an RCT. This means that, as in many situations in which asymptotic results areused, the procedure for testing whether an effect is significant needs to be adjusted forthe relatively small sample sizes often found in RCTs when IRT is used.
The main simulation study examined the relationship between the number of patients ineach arm of an RCT, the number of items used to assess the patients, and the power todetect given effect sizes when using IRT. As ever, the smaller the effect size, the larger thenumber of patients needed in each arm to detect the effect with a given power. In addition,increasing the number of items used to assess the patients can mean that given effects can bedetected using fewer patients. The number of patients required to detect a minimal effectusing five items is more than twice the number required when 50 items are used. The resultssuggest that reductions in the number of patients required are minimal for more than 20carefully chosen items, indicating that a maximum of 20 items with good measurement qualitiesis sufficient to assess patients. However, if the items have poor measurement properties, suchas very low values of ai or values of bi more extreme than mq � 2sq, indicating a lack ofdiscrimination and floor or ceiling effects in a questionnaire, respectively, then many moreitems may be required. These results also indicate that if an RCT is to be designed to detectsmall effects, it is inadvisable to use very short instruments consisting of items with tworesponse categories analyzed using IRT. Again, it should be noted that these results are basedon the assumption that the items used do not have poor measurement properties, asdiscussed above.
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 407
It is common wisdom that to detect effect sizes of 0.2, 0.5, or 0.8 using a t test with asignificance level of 0.05 and a power of 80%, it is necessary to include 394, 64, or 26patients, respectively, in each arm of an RCT. The results presented in this paper show that450, 90, and 40 patients are required in each arm to detect the same effects using 50 itemsand IRT. The main reason for the differences between these figures is that when a t testis used, it is assumed that the variable of interest is measured without error, whereas IRTtakes into account that a latent variable cannot be measured without error. In addition, thesample sizes obtained using IRT are conservative as linear interpolation has been usedbetween observations, whereas it is reasonable to assume that the true function would curvetoward the top lefthand corner, indicating that the true numbers of patients requiredare marginally lower than these results suggest. It should also be emphasized that if researchersrequire very accurate estimates of the number of patients required in each arm of an RCTusing a particular set of items, they should carry out their own simulation study. There area number of advantages of the use of IRT in the analysis of RCTs. First, the ZL test describedin this article can, in contrast to the t test, always be used in the same format, regardless ofwhether the variance of the latent trait is the same in both arms of the RCT or not. Second,it is not essential that all patients are assessed using exactly the same questionnaire. Thequestionnaire can be shortened for particular groups of patients, while the estimates ofthe latent trait remain comparable over all patients.
The value of the standard errors of mA and mB depend on a complex relationship betweenthe values of the latent trait of the patients included in an RCT and the parameters of theitems used to assess them. This means that the results obtained in this study are, theoretically,local to the combination of patients and items used. However, we reselected all parametersfor each of the 1000 individual “RCTs” carried out at each combination of effect size, numberof patients per arm, and number of items in order to give a more general picture of thenumber of patients required. In addition, item parameters for the SF-36, SF-12, and SF-8quality of life instruments were estimated from a dataset collected from patients undergoingdialysis for chronic kidney failure and used to illustrate the methods described.
This article has examined the sample sizes required in an RCT when the primary outcomeis a latent trait measured by a questionnaire analyzed with IRT. It has been proven thatit is possible, following a minor transformation, to use the statistics developed for analyzinglarge-scale educational interventions in the much smaller sample sizes encountered in RCTsin clinical medicine. In addition, the relationship between the number of items used and thenumber of patients required to detect particular effects was examined. It is hoped that thisarticle will contribute to the understanding of IRT, particularly in relation to RCTs.
Acknowledgments
This research was partly supported by a grant from the Anton Meelmeijerfonds, a charitysupporting innovative research in the Academic Medical Center, Amsterdam, The Nether-lands. The authors would like to thank the researchers involved in The Netherlands Co-operative Study on the Adequacy of Dialysis (NECOSAD) for allowing their data to beused in this paper.
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410408
Appendix: Obtaining mA, mB, and their standard errors
The values of mA and mB and their standard errors can be estimated using marginalmaximum likelihood methods. The likelihood L can be written as
L � �g�A,B
�k=1
kg
��∞
∞
�i,j
pijk(q)xijk g(q|mg,sg)dq (A.1)
where pijk is as defined in Eq. (1) and (2), g(q|mg,sg) is a Normal density function with meanand standard deviation equal to mg and sg
2 for g � A, B, respectively. Furthermore, xijk
is an indicator variable taking the value 1 if patient k responds in category j of item i andthe value 0 otherwise. In the study described in this paper, we have assumed that the itemparameters ai and bi are known. In practice, this would mean that the items came from acalibrated item bank. It has been shown that mA can be estimated using
mA �1K �
K
k�1E(qk|Xk) (A.2)
where Xk is a vector with n elements containing the responses, xik, of patient k to the items,and E(qk|Xk) is the posterior expected value of qk for patient k given their pattern ofresponses, Xk [23–25]. It has also been shown that if the item parameters are consideredknown, as in this study, the asymptotic standard error S.E.(mA�mB) is equal to S.E.(mA)�S.E.(mB), where
S.E.(mA) �1
sA�4�
K
k�1(E(qk|Xk)�mA)2
(A.3)
and sA is the estimated standard deviation of θ in group A. The definition of S.E.(mB)is analogous to that of S.E.(mA). The expectation E(qk|Xk) is as defined in Eq. (A.2) andgiven by
E(qk|Xk) � ��∞
∞
qf (qk|Xk)∂q (A.4)
where the posterior distribution of q is
f(qk|Xk) �P(Xk|qk)g(qk|mA,sA)
�∞
�∞P(Xk|qk)g(qk|mA,sA)∂qk
(A.5)
where P(Xk|qk) is the probability of the response pattern made by patient k given qk.
References
[1] McDowell I, Newall C. Measuring health: a guide to rating scales and questionnaires. Oxford: OxfordUniversity Press, 1996.
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 409
[2] Cella D, Chang CH. A discussion of item response theory and its applications in health status assessment.Med Care 2000;38(Suppl):II66–II72.
[3] Fischer GH, Molenaar IW, editors. Rasch models: foundations, recent developments and applications. NewYork: Springer-Verlag, 1995.
[4] van der Linden W, Hambelton RK, editors. Handbook of modern item response theory. New York:Springer, 1997.
[5] Teresi JA, Kleinman M, Ocepek-Welikson K. Modern psychometric methods for detection of differentialitem functioning: application to cognitive assessment measures. Stat Med 2000;19:1651–1683.
[6] Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual frameworkand item selection. Med Care 1992;30:473–483.
[7] Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21stcentury. Med Care 2000;38(Suppl):II28–II42.
[8] van der Linden WJ, Glas CAW. Computerized adaptive testing. Theory and practice. Dordrecht, the Nether-lands: Kluwer Academic Publishers, 2000.
[9] Holman R, Lindeboom R, Vermeulen R, Glas CAW, de Haan RJ. The Amsterdam Linear Disability Score(ALDS) project. The calibration of an item bank to measure functional status using item response theory.Qual Life Newsletter 2001;27:4–5.
[10] Lindeboom R, Vermeulen M, Holman R, de Haan RJ. Activities of daily living instruments in clinicalneurology. Optimizing scales for neurologic assessments. Neurology 2003;60:738–742.
[11] Cohen J. Statistical power analysis for the behavioral sciences. New Jersey: Hillsdale, Lawrence ErlbaumAssociates, 1988.
[12] Formann AK. Measuring change in latent subgroups using dichotomous data-unconditional, conditionaland semiparametric maximum-likelihood-estimation. J Am Stat Assoc 1994;89:1027–1034.
[13] May K, Nicewander WA. Measuring change conventionally and adaptively. Educ Psychol Meas 1998;58:882–897.
[14] Korevaar JC, Merkus MP, Jansen MA, et al. Validation of the KDQOL-SF: a dialysis-targeted healthmeasure. Qual Life Res 2002;11:437–447.
[15] van Manen JG, Korevaar JC, Dekker FW, et al. How to adjust for comorbidity in survival studies in ESRDpatients: a comparison of different indices. Am J Kidney Dis 2002;40:82–89.
[16] Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of scales andpreliminary tests of reliability and validity. Med Care 1996;34:220–233.
[17] QualityMetric Incoporated. The SF-8 Health survey http://www.sf36.com/tools/sf8.shtml��. AccessedJanuary 2, 2003.
[18] Birnbaum A. Some latent trait models and their use in inferring an examinee’s ability. In: Lord FM,Novivk MR, editors. Statistical theories of mental test scores. Reading Massachusetts: Addison-Wesley, 1968.
[19] Thissen D, Steinberg L. A taxonomy of item response models. Psychometrika 1986;51:567–577.[20] Rasch G. Probabalistic models for some intelligence and attainment tests. Copenhagen: Danish Institute
for Educational Research, 1960.[21] Holman R, Berger MPF. Optimal calibration designs for tests of polytomously scored items described by
item response theory models. J Educ Behav Stat 2001;26:361–380.[22] Muraki E. A generalized partial credit model. In: van der Linden WJ, Hambleton RK, editors. Handbook
of modern item response theory. New York: Springer, 1997.[23] Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: an application of an
EM-algorithm. Psychometrika 1981;46:443–459.[24] Glas CAW, Verhelst ND. Extensions of the partial credit model. Psychometrika 1989;54:635–659.[25] Glas CAW. Modification indices for the 2-PL and the nominal response model. Psychometrika 1999;
64:273–294.[26] Verhelst ND, Glas CAW, Verstralen HHFM. OPLM computer program and manual. Arnhem, the Netherlands:
Centraal Instituut voor Toets Ontwikkeling (CITO), 1994.[27] Cochran WG. Sampling techniques. New York: Wiley, 1977.[28] McHorney CA, Ware JE Jr, Rogers W, Raczek AE, Lu JF. The validity and relative precision of MOS
short- and long-form health status scales and Dartmouth COOP charts. Results from the Medical OutcomesStudy. Med Care 1992;30(Suppl):MS253–MS265.
R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410410
[29] McHorney CA, Ware JE Jr, Lu JF, Sherbourne CD. The MOS 36-item Short-Form Health Survey (SF-36):III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care1994;32:40–66.
[30] McHorney CA, Ware JE Jr, Raczek AE. The MOS 36-Item Short-Form Health Survey (SF-36): II. Psycho-metric and clinical tests of validity in measuring physical and mental health constructs. Med Care1993;31:247–263.
[31] Raczek AE, Ware JE, Bjorner JB, et al. Comparison of Rasch and summated rating scales constructed fromSF-36 physical functioning items in seven countries: results from the IQOLA Project. International Qualityof Life Assessment. J Clin Epidemiol 1998;51:1203–1214.
[32] Aaronson NK, Muller M, Cohen PD, et al. Translation, validation, and norming of the Dutch languageversion of the SF-36 Health Survey in community and chronic disease populations. J Clin Epidemiol1998;51:1055–1068.