Exact evaluation of bias in Rasch model residuals
Transcript of Exact evaluation of bias in Rasch model residuals
Exact evaluation of bias in Rasch model residuals
Svend Kreiner Karl Bang Christensen
Research Report 07/2
Department of Biostatistics University of Copenhagen
Exact evaluation of bias in Rasch model residuals
Svend Kreiner*
Department of Biostatistics, University of Copenhagen
Karl Bang Christensen
National Institute of Occupational Health, Denmark
Abstract
This paper compares conventional Rasch model residuals and mean squared outfit statistics to
residuals and outfit statistics based either on conditional probabilities of item responses given the
total person score, or on exact conditional probabilities given both item margins and person scores.
Exact residuals are asymptotically unbiased, but residuals and outfit statistics based on
unconditional probabilities are biased to a degree that may influence analysis of fit. The bias
appears to depend mainly on the dispersion of item parameters, sample size and targeting of items
have little effect on the bias. Very little bias is found for residuals based on conditional probabilities
of item responses given person scores.
Keywords: Rasch models, residuals, mean squared residuals, outfit statistics, Markov Chain Monte
Carlo methods.
1 Introduction Let pv i= P(Yvi=1) be the probability that a person, v, responds positively to an item, i, according to
the Rasch model for dichotomous items. Response residuals, Yvi- , compare observed responses
estimates to estimates of the probabilities. Outfit statistics (Wright, 1977 and Smith, 2004) summar-
ize squared standardized response residuals over persons and/or items. Outfit statistics defined in
this way are useful for item analysis where they are used in different ways in several programs to
test the fit of item responses to Rasch models (Linacre and Wright, 2000, Andrich et.al.,2000 and
Wu et al, 1998). The inherent possibilities in statistical analyses of response residuals are, however,
vip̂
* Corresponding author Svend Kreiner, Department of Biostatistics, University of Copenhagen, Øster Farimagsgade 5,B, P.O.B. 2099, DK-1014 Copenhagen K, email [email protected]
1
offset by the fact that the results concerning the distributions of these statistics are based on
heuristic arguments that are easily shown to be wrong. The degree to which this is a problem in
practical applications and what to do about it will be addressed in this paper.
Outline of the paper
The Rasch model is briefly described in Section 2. Response residuals and outfit statistics are also
defined in this section. The distribution of residuals depends on the frame of inference (FOI)
adapted for the statistical analysis. Three FOIs for item analysis by Rasch models are defined in
Section 3: the unconditional FOI, the conditional FOI and the exact FOI. It is argued that the exact
FOI is the proper one for analysis of residuals. Markov Chain Monte Carlo methods for analysis in
the exact FOI are also described in Section 3. The bias of unconditional and conditional residuals
and outfit statistics is discussed in Section 4 and investigated in Sections 5 and 6 by means of
analyses of real and simulated data.
2 The Rasch model Let Y= (Yvi) be the matrix of responses of n subjects to k dichotomous items with responses coded
0 and 1. The vectors of person scores, S = (S1,..,Sn), and item margins, M = (M1,..,Mk), are given by
and v viS =∑ iY vii v
M Y=∑ . A = (Asi) is the (k+1)×k matrix with elements Asi equal to the number
of persons with Sv = s and Yvi = 1. Under the Rasch model (Rasch, 1960) the probabilities
v ivi v i
v i
exp( )P(Y 1) ( , )1 exp( )
θ −β= = φ θ β =
+ θ −β (1)
of positive responses on items depend on item parameters, β = (β1,…, βk) and person parameters, θ
= (θ1,…, θn).
The vectors of person scores and item margins, (S,M), are minimal sufficient statistics for (θ,β).
The conditional probability that the response of person v on item i is positive given that Sv = s is
i
i
j s 1 1,.., j 1, j 1,..,rvi v i
s
exp( ) ( )P(Y 1| S s) (s, )
( )− − +−β γ β
= = = ϕ β =γ β
(2)
2
where γs(β) is the elementary symmetrical functions. The conditional probability that person v
responds positively to item i given the complete vectors of person scores and item margins is equal
to
vi viP(Y 1| S,M) (S,M)= = π (3)
Since S and M are sufficient statistics, the conditional distribution, P(Y| S,M) does not depend on
any unknown parameters. Rasch (1960, Chapter X) shows that the conditional distribution of the
item response matrix, Y, given S and M is uniform over all matrices of responses with the same
margins,
=
MS
1)M,S|Y(P , where is the number of 0/1 matrices fitting the margins. From
this it follows that
MS
=π
MS
)M,S(K)M,S( vivi (4)
where Kvi(S,M) is the number of matrices fitting the same margins with Yvi = 1.
Switching rows v and w in the response matrix will leave the vectors of person scores and item
margins unchanged if Sv = Sw. From this it follows that πvi(S,M) = πwi(S,M). Rather than referring
to the conditional probability of item responses for a specific person, it will often be convenient to
refer to the conditional distribution of the response to item i by persons with a specific total score s,
| 1 1 1( , ) ( 1| ,.., , , ,.., , )i s vi v v v nS M P Y S S S s S S Mπ − += = = (5)
Thus πi|s(S,M) like πvi(S,M) is a conditional probability given the total score and item margin
vectors with the additional condition that the total score of the person is equal to s.
Response residuals
Standardized response residuals are functions of the mean E(Yvi) = pvi and variance VAR(Yvi) =
pvi(1- pvi), and are defined if 0< E(Yvi)<1. We write
3
( )( )
,1
vi vivi vi vi
vi vi
Y pZ Y pp p
−=
− (6)
All three probabilities (1) – (3) can be used in (6). Residuals are called unconditional, conditional,
and exact when expected values are based on (1), (2), and (3) respectively:
( ) ( , )( , ( , ))( , )(1 ( , ))
u vi v ivi vi vi v i
v i v i
YZ Z Y φ θ βφ θ βφ θ β φ θ β
−= =
− (7)
(c) vi i vvi vi vi i v
i v i v
Y (S , )Z Z (Y , (S , ))(S , )(1 (S , ))
−ϕ β= ϕ β =
ϕ β −ϕ β (8)
( ) ( , )( , ( , ))( , )(1 ( , ))
e vi vivi vi vi vi
vi vi
Y S MZ Z Y S MS M S M
πππ π
−= =
− (9)
Dimitrov and Smith (2006) refer to (8) as adjusted residuals in the context of Person-Fit statistics.
Unconditional and conditional residuals, (7) and (8), are theoretical residuals that are functions of
unknown parameters. Residuals where expected values are calculated using estimated parameters
are called estimated residuals. Conventional statistical analyses of estimated residuals assume that
parameter estimates are consistent; therefore, they use the distribution of the theoretical residuals as
approximations to the distributions of estimated residuals. One of the problems with Rasch model
response residuals is that this assumption is not justified, since some item parameter estimates and
all conventional person parameter estimates are inconsistent.
Outfit statistics
The means of squared residuals (MSR) summarized over persons and/or items, are sometimes
referred to as outfit statistics (Smith, 2004). Item outfit statistics are given by
2i viv
1MSR (Y,P) Z (Y ,p )n
= ∑ vi vi (10)
and the total outfit statistic by
∑=i i )P,Y(MSR
k1)P,Y(MSR (11)
4
Person outfit statistics defined by summarizing residuals over items can also be considered. These
are, however, not discussed in this paper.
Zvi(pvi) is a conventional standardized residual, but the squared residual has somewhat atypical
properties, because it relates to a Bernoulli variable. Assume that outcomes on Yvi are coded 0 and
1. From this it follows that and therefore that is a linear function of Yvi2vi YY = )p(Z vi
2vi vi,
2
2 vi vi vi vi vi vivi vi vi vi
vi vi vi vi vi
Y 2p Y p p 1 2pZ (Y ,p ) Yp (1 p ) (1 p ) p (1 p )− + −
= = +− − −
(12)
It follows from (12) that 2vi vi viE(Z (Y ,p )) 1= , ( ) 2
vi2vi vi vi
vi vi
1 2pVAR(Y , Z (p ))
p (1 p )−
= − , =1, and
VAR( )=0. VAR( ) is very large when E(Y
)5.0(Z2vi
)5.0(Z2vi )p(Z2
vi vi) is close to 0 and 1.
Estimated outfit statistics
In practice, outfit statistics are always estimated, summarizing estimated rather than theoretical
residuals. There is a one-to-one relationship between all conventional person parameter estimates
and the sufficient person scores. Estimates of the unconditional probabilities (1) therefore share the
property that all persons with the same score have the same probability, written pi|s, with the
conditional and exact response probabilities. For this reason, we may regard estimated outfit
statistics as functions of data summarized in the matrix A = (Asi), where Asi is the number of
persons with positive response on item i and total score s, and the matrix P(c) = (pi|s) of conditional
probabilities. Expected item responses in conditional and exact residuals are the same for all
persons with score s and (10) can thus be written
( )k 1
i|s i|s(c)i s
s 1 i|s i|s i|s
ˆ ˆp 1 2p1ˆMSR A,P n Aˆ ˆ ˆn (1 p ) p (1 p )
−
=
−= + − − ∑ si (13)
Persons with extreme scores are not included, since standardized residuals are not defined for these
persons. From (13) it follows not only that the expected value ( )( )(c)i
ˆE MSR A, P | S, M =1 when
5
s|ip̂ is a consistent estimate, but also that ( )(c)i
ˆMSR A,P =1 when there is a perfect model fit in the
sense that |ˆ=si s i sA n p
π̂
.Depending on the choice of residual (7), (8), or (9), we define unconditional,
conditional and exact outfit statistics
MSRMSR )u(i =
MSRMSR )c(i =
MSRMSR )e(i =
(i|s
expˆ
1 expφ =
+
)ˆ,A(i φ (14)
)ˆ,A(i ϕ (15)
)ˆ,A(i π (16)
where ϕ and are the (k-1)×k matrices with estimates of the probabilities given by (2) and (5),
and φ is the (k-1)×k matrix with elements
ˆ
ˆ ( ) )( )( )
i
i
ˆ ˆsˆ ˆs
θ −β
θ −β where ( )ˆ sθ is the estimate of
the parameter for persons with a total score equal to s.
Finally, it must be noted that (13) applies to unconditional residuals, if they assign infinite person
parameter estimates to persons with extreme scores. If finite person parameter estimates are
assigned to extreme scores, we have to exclude the extreme score groups form (13) in order to avoid
systematic bias. This problem does not occur for conditional and exact outfit statistics.
3 Frames of inference for analysis of response residuals The three probabilities (1) – (3) define different frames of inference with different probability
estimates and different outfit distribution.
The unconditional frame of inference
The unconditional FOI is given by the unconditional distribution P(Y;θ,β). The probabilities in this
inference frame are given by (1). Maximum likelihood estimates in this framework are referred to
as joint parameter estimates. Conventional wisdom has it that standardized unconditional residuals
“are distributed as approximate unit normals”, that summarizing squared standardized residuals
over items and/or persons “results in a chi-square” (Smith, 2004), and that a Wilson-Hilferty cube
root transformation,
6
( )3
)outfit(VAR)outfit(VAR
31outfitt 3 +−= (17)
(Wilson & Hilferty, 1931) produces a transformed statistics with a distribution that is close to a
standardized normal distribution (Smith, 1991).
Standardizing response residuals does not, however, change the fact that response residuals are
dichotomous variables. All of the above statements are therefore obviously erroneous, and whether
the distribution of the Wilson-Hilferty transformed outfit statistic may be approximated by the
standardized normal distribution is open for discussion.
The situation complicates considerably when unconditional outfit statistics are calculated with
estimated rather than known parameters, because estimates from the unconditional FOI are
asymptotically biased (Neyman & Scott, 1948; Andersen, 1980, p.244). In addition to the bias
imposed by the joint estimates, we also notice that the standard error of person parameter estimates
cannot approach zero as n → ∞ if the number of items is fixed. In order to see this, one must notice
that there is one person parameter, , for every possible score such that the probabilities of the
distribution of the person parameter estimate given the true person and item parameters are equal to
the probabilities of the total score,
sθ̂
( )βθ ,; vsθ̂P = ( )βθ=∑ ,;sYP vi vi . These probabilities do not
depend on the number of persons in the sample. The standard errors of the person parameter
estimates and the standard error of the estimates of the unconditional probabilities will therefore
never approach zero. Standardized response residuals based on these estimates may therefore both
be biased and have standard deviations larger than 1.
The conditional frame of inference
The conditional FOI is defined by the conditional distribution of the item responses given the
person scores, P(Y|S;β), with response probabilities given by (2). Formula (12) shows that outfit
statistics are linear functions (12). The central limit theorem therefore suggests that the distribution
of the conditional item outfit may be approximated by a normal distribution with mean equal to 1
and variance equal to
7
( )( )
2k 1is
i 2s 1 i i
1 2 (s, )nVAR(MSR (Y, ))n (s, ) 1 (s, )
−
=
− ϕ βϕ =
ϕ β −ϕ β ∑
(18)
since all persons within a specific score group have the same conditional probability of a positive
response. In the conditional framework, we therefore evaluate the significance of the item outfit by
comparison of
( )
i
k 1i
ss 1 i i
n(MSR (Y, ) 1)t1 2 (s, )n
(s, ) 1 (s, )
−
=
ϕ −=
− ϕ β ϕ β −ϕ β
∑ (19)
to a standardized normal distribution.
Analysis of response residuals in the conditional FOI requires item parameter estimates. The
maximum likelihood item parameter estimates in this framework are the conditional maximum
likelihood (CML) estimates of Andersen (1970). These are known to be both consistent and
asymptotically unbiased. Probability estimates using CML item parameter estimates instead of the
unknown item parameters in (2) will therefore also be consistent and asymptotically unbiased.
The exact frame of inference
The exact FOI is given by the conditional distribution, P(Y|S,M). There are no unknown parameters
in this framework, where residuals are calculated relative to (3) and where significance is always
assessed relative to the exact distribution of fit statistics. The Markov Chain Monte Carlo procedure
first described by Besag and Clifford (1989) and later extended to cover item response matrices
with missing responses by Holst (1994) provides consistent estimates of both response probabilities
and p-values of fit statistics.
To estimate the probability πi|s(S,M) that a person with a total score equal to s answers positively on
an item in the exact conditional frame of inference, we generate NSIM random item response
matrices as described above and calculate the number of different responses on items in different
score groups, ( ) k,..1i,k,..,0s)j(
si)j( AA === for j=1,..,NSIM. Given these tables, unbiased estimates of
πi|s(S,M) are given by
8
∑=⋅
=πNSIM
1j
)j(si
ss|i A
nNSIM1ˆ (17)
These estimates are used in (9) and (16) to calculate exact residuals and outfit statistics. The
evaluation of the significance of the exact outfit then requires a second MCMC sample similar to
the one used to estimate response probabilities. Let MSR be the observed outfit statistic, while
(MSR1,..,MSRNSIM) is the MCMC sample of outfit statistics. Li = 1 if MSRi ≤ MSR and 0 otherwise.
Hi = 1 if MSRi ≥ MSR and 0 otherwise. The MCMC estimate of the one-sided p-values are then
estimated by
∑=
=NSIM
1iilow L
NSIM1p̂ (18)
and
∑=
=NSIM
1iihigh H
NSIM1p̂ (19)
The precision of MCMC estimates depends on NSIM and on the degree of dependence between the
sampled item response matrices. For each of the analyses presented in this paper 5,100,000,000
item response matrices were generated, but only every 100,000th of these were actually sampled.
This reduced the correlation between consecutive values of the outfit statistic (all were between -
0.02 and +0.02). The MCMC samples thus function as conventional Monte Carlo samples. In order
to reduce dependency of sampled matrices and the observed item responses, the first 100 sampled
matrices were discarded. Each MCMC sample thus consisted of 5000 random item response
matrices.
4 Exact evaluation of bias The frame of inference of estimated residuals and outfit statistics
It follows from the one-to-one relationship between the sufficient margins and the person and item
parameter estimates that estimates of unconditional and conditional response probabilities are
functions of (S,M). The unconditional and conditional inference frames, P(Y; ) and P(Y|S;ˆ ˆ,θ β β̂ ),
with estimates instead of known parameters are consequently equal to the exact FOI. As a result, the
exact FOI is therefore the proper frame of inference for an analysis of the bias of estimated
9
residuals and outfit statistics. Discussing the use of residuals for tests of fit of the Rasch model,
Smith (2004) appears to take the same stance arguing that the “issue of fit … can be reduced to a
single question. How well do the marginals, translated through the estimated item and person
parameters, reproduce the actual data?”.
Standardized residuals should have expectation equal to zero. The bias of unconditional and
conditional residuals is ( )vi viˆE Z ( ) | S,Mφ and ( )vi viˆE Z ( ) | S,Mϕ respectively, both of which may be
estimated by the MCMC procedure discussed above. The bias of the unconditional and conditional
outfit statistics is evaluated by comparison to the expected value of 1 for the exact outfit statistics.
The outfit statistics are linear functions of the item-by-score matrix (13). The bias of the
unconditional and conditional outfit statistics is therefore equal to MSR and
, where E = (e
1)ˆ,E(i −φ
1)ˆ,E(MSR i −ϕ si) is the matrix of expected numbers of positive item responses in
different score groups in the exact FOI, esi = E(Asi|S,M). These may also be estimated by the
MCMC procedure.
In the following, two examples are used to illustrate the analysis of residuals by MCMC methods:
(i) a well-known small-sample example with 35 persons and 18 items, and (ii) simulated data
generated to get a better idea about the behaviour of residuals and outfit statistics under different
conditions.
Both examples use conditional maximum likelihood estimates of item parameters and maximum
likelihood estimates of person parameters. Unconditional and conditional outfit test statistics are
then compared to MCMC estimates of expected outfit statistics in the exact frame of inference.
5 Exact evaluation of bias for Knox Cube test data Analysis of data from the Knox Cube Test is described in Wright and Stone (1979). The data is also
included as one of the standard examples for the WINSTEPS program (Linacre and Wright, 2000).
A total of 35 students responded to 18 items. All students gave the same response to items 1, 2, 3,
and 18. These items are not included in the analysis. In what follows, data from the 34 persons with
non-trivial scores on the remaining 14 items is considered.
10
Table 1 shows the score distribution and the item margins together with estimates of person and
item parameters. The interested reader may compare these estimates to the joint estimates presented
in Wright and Stone (1979).
Fit of items to the Rasch model is usually examined in two different ways. First, observed
frequencies of positive responses to items in score groups are plotted against person parameter
estimates corresponding to score groups together with plots of ICC curves. When observed
frequencies depart in a systematic way from the ICC curves it is taken as evidence against the
validity of the item . Second, item fit statistics are calculated. The evidence provided by item outfit
statistics is either assessed in terms of significance or relative to predetermined fixed limits.
WINSTEPS (Linacre and Wright, 2000), for instance, flags items if outfit statistics are larger than
1.5 or smaller than 0.6. Comparison of observed frequencies to estimated ICC curves only makes
sense if probability estimates are unbiased. Flagging items when outfit statistics lie below or above
specific limits also requires that outfit statistics are unbiased. Bias may therefore be a problem for
both procedures. Assessment of significance of an outfit statistic requires a good approximation to
the exact distribution of the statistic under the Rasch model. The purpose of this section is to
examine the degree to which these requirements are met for the Knox Cube Test data. We only
present detailed results pertaining to Item 11. The results for this item correspond, however, closely
to the results for all other items in this example.
Table 2 shows the estimates of unconditional, conditional and exact probabilities of responses to
Item 11. The unconditional estimates correspond to points on the ICC curve forθ . The
ICC curve for Item 11 together with the estimates of the exact probabilities is shown in Figure 1.
ˆ ˆ(2),..., (11)θ
11
Table 1. Score distribution, maximum likelihood estimates of person parameters, item margins and conditional maximum likelihood estimates of item parameters
Score s
Score distribution
ns
Person parameter
)s(θ̂
Item
i
Item margin
Mi
Item parameter
iβ̂ 2 1 -3.909 4 32 -3.778 3 2 -3.215 5 31 -3.365 4 2 -2.574 6 30 -2.956 5 2 -1.900 7 31 -3.365 6 3 -1.124 8 27 -2.004 7 12 -0.202 9 30 -2.956 8 5 0.770 10 24 -1.281 9 4 1.665 11 12 0.638 10 1 2.496 12 6 1.913 11 2 3.304 13 7 1.658 14 3 2.946
Mean 6.7 15 1 4.216 s.d. 2.4 16 1 4.216
17 1 4.216
Table 2. Estimated probabilities and observed frequencies of positive responses on Item 11 in different score groups. The CML estimate of the item parameter is equal toβ = 0.638 11
ˆ
Score s
Unconditional probability
)ˆˆexp(1)ˆˆexp(
1111
1111
β−θ+
β−θ
Conditional Probability
11 s 1 1..10,12,..14
s
ˆ ˆexp( ) ( )ˆ( )
−−β γ β
γ β
Exact Probability
s|11π̂
2 0.0105 0.0084 0.0112 3 0.0208 0.0161 0.0177 4 0.0387 0.0287 0.0314 5 0.0732 0.0515 0.0533 6 0.1466 0.0990 0.1019 7 0.3016 0.2164 0.2148 8 0.5330 0.5864 0.5901 9 0.7364 0.7912 0.7913 10 0.8650 0.9013 0.8960 11 0.9350 0.9541 0.9448
12
Figure 1. The item characteristic curve of Item 11 together with the exact probabilities, π11|s, plotted against the person parameter estimates in each score group.
6,004,002,000,00-2,00-4,00-6,00
1,0
0,8
0,6
0,4
0,2
0,0
In each score group, the observed frequency of positive responses should be compared to estimates
of the conditional probabilities πi|s, rather than to unconditional probabilities, because the estimated
person parameters are the same for all persons with the same score. In this case, the difference
between unconditional and exact probabilities results in evidence of too high item discrimination
when frequencies of correct response in each score group are compared to the unconditional ICC
curve.
The difference between unconditional and exact probabilities will also influence standardized
residuals and outfit statistics. Table 3 shows the bias of the unconditional and conditional residuals
for Item 11. Unconditional residuals are more biased than conditional residuals. The bias in this
example is particular pronounced in score groups with large numbers of cases.
Table 3. Expected unconditional and conditional standardized response residuals for Item 11. Score
Sv = s
Unconditional standardized residuals( )(u) (e)
v,11 v,11E Z Z | S,M− Conditional standardized residuals
( )(c) (e)v,11 v,11E Z Z | S,M−
2 0.007 0.031 3 -0.022 0.013 4 -0.038 0.016 5 -0.077 0.008 6 -0.126 0.010 7 -0.189 -0.004 8 0.115 0.008 9 0.124 0.000 10 0.091 -0.018 11 0.040 -0.044
13
Expected values of unconditional and conditional outfit statistics are shown in Table 4 for all items
(Exact residuals always have expectation equal to 1). Unconditional outfit statistics are
systematically biased and diminish evidence of ill-fitting items, whereas the bias of the conditional
outfit statistics is unsystematic and less pronounced. The significance of the departure of outfit
statistics from what one should expect under the Rasch model may be assessed by MCMC estimates
of exact conditional p-values. If these methods are unavailable, one will have to rely on
approximations by chi squared or normal distributions based on tentative and less than convincing
arguments.
Table 4. Expected values of the item outfit statistics. Item
i Expected unconditional outfit ( )(u)
iE MSR | S,MExpected conditional
outfit ( )(c)iE MSR | S,M
4 0.686 0.990 5 0.715 1.008 6 0.724 0.998 7 0.704 0.987 8 0.779 1.025 9 0.732 1.014 10 0.822 1.029 11 0.856 1.036 12 0.893 1.040 13 0.793 1.023 14 0.699 0.984 15 0.696 1.026 16 0.709 1.052 17 0.705 1.044
Total 0.744 1.018
Figure 2 shows the distribution of the Wilson-Hilferty transformed unconditional item outfit for
Item 11. Being skewed with expected value equal to -0.25 and standard deviation equal to 0.80, The
distribution is far from a standardized normal distribution. The 95 % confidence region is equal to [-
1.30, 1.82].
14
Figure 2. Distribution of the Wilson-Hilferty transformed unconditional outfit statistics for Item 11
5,004,003,002,001,000,00-1,00-2,00
700
600
500
400
300
200
100
0
Frequency
Table 5 shows observed conditional and unconditional outfit statistics for all items together with
MCMC estimates of 95 % confidence regions. Eight unconditional and seven conditional outfit
values are below 0.6, while one unconditional and two conditional outfit statistics are larger than
1.5. The outfit statistics are, however, only marginally significant for three items.
Table 5. Outfit statistics with exact 95% confidence regions. Unconditional
outfit Conditional
outfit
Item
Observed value
95% confidence
region
95 % confidence region.W-H transformed
outfit statistics
Observed value
95% confidence
region
95 % confidence region. W-H transformed
outfit statistics
4 0.308 0.14 – 3.21 0.04 – 1.38 0.351 0.14 – 5.27 0.34 – 1.72 5 0.460 0.15 – 2.75 -0.30 – 1.28 0.540 0.14 – 4.46 -0.02 – 1.63 6 0.758 0.18 – 3.24 -0.53 – 1.52 1.078 0.16 – 4.84 -0.27 – 1.80 7 1.523 0.15 – 2.67 -0.30 – 1.25 2.490 0.14 – 4.36 -0.02 – 1.61 8 0.395 0.28 – 1.94 -1.00 – 1.18 0.470 0.25 – 2.83 -0.75 – 1.56 9 0.220 0.18 – 3.24 -0.53 – 1.52 0.206 0.16 – 4.84 -0.27 – 1.80
10 0.766 0.38 – 2.05 -1.28 – 1.62 0.914 0.36 – 2.89 -1.00 – 2.04 11 0.736 0.42 – 2.11 -1.30 – 1.82 0.897 0.37 – 2.69 -1.17 – 2.13 12 0.840 0.26 – 3.13 -0.80 – 1.76 1.205 0.23 – 4.24 -0.65 – 2.01 13 0.385 0.32 – 2.56 -0.82 – 1.59 0.382 0.30 – 3.41 -0.67 – 1.85 14 1.049 0.15 – 2.12 -0.30 – 1.07 1.636 0.14 – 3.32 -0.11 – 1.40 15 0.109 0.11 – 2.49 0.54 – 1.38 0.119 0.12 – 4.41 0.81 – 1.74 16 0.109 0.11 – 2.49 0.54 – 1.38 0.119 0.12 – 4.41 0.81 – 1.74 17 0.109 0.11 – 2.49 0.54 – 1.39 0.119 0.12 – 4.41 0.81 – 1.74
15
6 Exact evaluation of bias for simulated data There are at least three reasons why unconditional and conditional outfit may be biased: (i) the
probabilities (1) and (2) are poor approximations of exact probabilities, (ii) person parameter
estimates are inconsistent, and (iii) the error associated with person and item parameter estimates in
small sample studies. This section presents the results of a simulation study with sample sizes 50,
100, 200, 500, and 1000. In the latter of these, the error of the item parameter estimates should be
considerably reduced. We simulated responses to 21 items from a Rasch model where item
parameter values were equidistant ranging from -2.5 to +2.5 and person parameters came from
standard normal distribution. In order to investigate the effect of scale length and of the targeting of
items to the population, analyses were performed for nine subsets of items (cf. Table 6). The last
subset was added after analysis of the first eight subsets in order to best illustrate the conclusions.
Table 6 shows the expected value of total unconditional outfit statistics, MSR(u), for all sample sizes
and for the total conditional outfit statistics, MSR(u), for n= 50 and 1000. Conditional outfit statistics
are virtually unbiased, while unconditional outfit statistics are systematically biased with
E(MSR(u)|S,M) < 1. The only factor that appears to have a profound effect on the bias of the
unconditional outfit is the dispersion of the items. Item set 4, and in particular item set 8, are item
sets with highly dispersed item and a strong degree of bias. The analysis of item set 9 (added after
analysis of the first eight sets) underscore this finding. Compared to the effect of item dispersion,
the number of items has less of an effect. Targeting of items and sample size has virtually no effect
at all.
Item outfit statistics for selected items from Item Set 1 are shown in Table 7. The bias of the
unconditional outfit statistics is easily seen, as is the fact that conditional outfit statistics provide
close approximations of exact outfit statistics. Despite the bias of the unconditional outfit statistics,
there is virtually no difference between conditional and unconditional outfit statistics if exact p-
values are used. Finally, the asymptotic p-values of the conditional outfit statistics based on the
approximate normal distribution of the standardized outfit defined by formula (17) appear to
provide reasonable close to the exact p-values when sample sizes are moderate. In small sample
studies with n = 50 and 100, the asymptotic p-values are conservative.
16
Table 6. Item parameters and item sets together with MCMC estimates of expected total outfit statistics
Item no.
Item parameters
Item set 1
Item set 2
Item set 3
Item set 4
Item set 5
Item set 6
Item set 7
Item set 8
Item set 9
1 -2.50 + + + + + 2 -2.25 + + + + 3 -2.00 + + + + + + + 4 -1.75 + + + 5 -1.50 + + + + + 6 -1.25 + + + 7 -1.00 + + + + + + 8 -0.75 + + + 9 -0.50 + + + + + + 10 -0.25 + + + + 11 0.00 + + + + + + + 12 0.25 + + + 13 0.50 + + + + 14 0.75 + + 15 1.00 + + + + 16 1.25 + + 17 1.50 + + 18 1.75 + 19 2.00 + + + + 20 2.25 + + 21 2.50 + + +
Outfit statistics
n Expected outfit statistics
MSR(u) 50 0.945 0.968 0.958 0.863 0.934 0.973 0.967 0.757 0.599 100 0.946 0.965 0.952 0.881 0.953 0.946 0.967 0.776 0.659 200 0.948 0.959 0.965 0.889 0.964 0.935 0.976 0.804 0.614 500 0.949 0.963 0.965 0.891 0.971 0.933 0.979 0.784 0.613 1000 0.947 0.962 0.965 0.889 0.977 0.929 0.978 0.784 0.602
MSR(c) 50 1.004 1.001 1.002 1.002 1.000 0.999 1.000 1.003 0.999 1000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999
17
Table 7 Outfit statistics1 and 1-sided p-values2 for items 1, 6 and 11 in Item set 1.
Exact Conditional Unconditional n Item )e(
iMSR )c(iMSR pexact pasymptotic )u(
iMSR pexact
50 1 1.047 1.021 0.344 0.492 0.873 0.344 6 1.056 1.054 0.348 0.432 0.956 0.346 11 1.109 1.115 0.261 0.299 1.075 0.261
100 1 1.258 1.239 0.233 0.380 1.094 0.233 6 0.842 0.841 0.184 0.220 0.815 0.188 11 0.988 1.003 0.471 0.488 0.967 0.483
200 1 0.790 0.791 0.320 0.335 0.706 0.319 6 0.919 0.919 0.277 0.289 0.891 0.289 11 0.964 0.964 0.411 0.376 0.942 0.403
500 1 1.536 1.511 0.031 0.030 1.355 0.031 6 0.987 0.986 0.477 0.448 0.947 0.488 11 0.927 0.927 0.139 0.147 0.909 0.139
1000 1 1.102 1.097 0.211 0.285 0.993 0.213 6 0.984 0.985 0.440 0.426 0.942 0.448 11 1.018 1.019 0.336 0.357 0.990 0.337
1: Outfit statistics are defined in (14), (15) and (16). 2: Exact p-values are defined in (18) and (19). The asymptotic p values of the conditional outfit is defined by (17)
7 Discussion Unconditional response residuals are used in several widely used computer programs for evaluation
of item fit in the Rasch model. Items are flagged as potentially flawed if outfit statistics are either
larger than 1.5 or smaller than 0.6 and/or if they depart significantly from 1. Two recent
publications assessing fit of items to the Rasch model report nothing but outfit statistics and/or infit
statistics (Chen et. al. 2006; Conrad et. al. 2006). Both papers appear to disregard outfit and infit
statistics that are smaller than 1, thereby ignoring evidence that might suggest that a 2-parameter
IRT model might provide a better description of the data than the Rasch model.
The results of this paper suggest that unconditional outfit statistics are systematically biased with
expected outfit statistics smaller than 1. Outfit statistics will therefore both tend to overlook
misfitting items and indicate that perfect items are too good to be true because they have stronger
item discrimination than assumed by the Rasch model. Evaluation of significance based on the
18
19
assumption that the distribution of Wilson-Hilferty transformed outfit statistics can be approximated
by a standardized normal distribution is also shown to be erroneous. Based on these results, we
suspect many published results relying exclusively on out- and infits to be mistaken concerning the
fit of the Rasch model.
This paper points out that these problems can be avoided if inference is performed in the exact
frame of inference defined by the conditional distribution of item responses given both person
scores and item margins. According to Rasch (1960) this is the natural frame of inference for
analysis of fit of item responses to Rasch models. In this framework, it is possible to assess the
significance of unconditional, conditional and exact outfit statistics in a way that is not influenced
by the bias of the outfit statistics. Because of this bias, flagging items due to the size of
unconditional outfit statistics appears to be unwise and should be avoided, but conditional outfit
statistics are unbiased and may be used instead.
Weighted means of standardized residuals referred to as infit statistics are also used. These statistics
were also looked into during this study, yielding similar results. Unconditional infit statistics are
biased, and conditional infit statistics are unbiased. The Wilson-Hilferty transformation does not
work, because infit statistics are not approximately chi squared distributed, but significance may be
assessed in the exact frame of inference.
The significance of conditional outfit statistics may also be assessed in the conditional frame of
inference, using the transformation (15). The transformation is based on the central limit theorem
and should only be carefully applied. We did not use this transformation in the analysis of the Knox
Cube Test example where it would have been completely inappropriate due to the presence of score
groups with too few persons.
References Andersen, E. B. (1970). Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal Statistical Society B, 32, 283-301.
20
Andersen, E. B. (1980). Discrete Statistical Models with Social Science Applications. Amsterdam:
North-Holland.
Andrich, D., Lyne, A., Sheridan, B. & Luo, G. (2000). RUMM2010 Computer Program. Perth :
Rumm Laboratory Pty, Ltd.
Besag, J. & Clifford. P. (1989). Generalized Monte Carlo Significance Tests. Biometrika, 76, 633-
642
Chen, S-P. C., Bezrucko, N. and Ryan-Henry, S. (2006) Rasch Analysis of a New Construct:
Functional Caregiving for Adult Children with Intellectual Disabilities. Journal of
Applied Measurement, 7, 141-159
Conrad, K. J., Matters, M. D., Luchins, D. J., Hanrahan, P., Quasius, D. L. and Lutz, G. (2006).
Development of a Money Mismanagement Measure and Cross-Validation Due to
Suspected Range Restriction. Journal of Applied Measurement, 7, 206-224
Dimitrov, D. M. and Smith, R. M. (2006) Adjusted Rasch Person-Fit Statistics. Journal of Applied
measurement, 7, 170-183
Holst, C. (1994) Item response Theory. Copenhagen: The Danish national Institute for Educational
Research.
Linacre, J.M. and Wright, B. D. (2000) A user’s guide to WINSTEPS Chicago: MESA Press
Neyman, J. & Scott, E. L. (1948). Consistent estimates based on partially consistent observations.
Econometrika, vol.16, p. 1-32
Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen:
Nielsen & Lydiche.
Smith, R. M. (1991). The Distribution Properties of Rasch Item Fit Statistics. Educational and
21
Psychological Measurement, 51, 541-565
Smith, R. M. (2004) Fit Analysis in latent Trait Measurement Models. In Smith, E.V and Smith,
R.M. (eds.) Introduction to Rasch measurement. Maple Grove, Minnesota: JAM
Press, 73 - 92
Wright, B. D. and Stone, M. H. (1979). Best Test Design. Chicago: MESA Press
Wilson, E. B. and Hilferty, M. M. (1931). The distribution of chi-square. Proceedings Of The
National Academy Of Sciences Of The United States Of America-Physical
Sciences, 17, 684-688
Wu, M., Adams, R. J. & Wilson, M. R. (1998). ACER Conquest: Generalised Item Response
Modelling Software. Australian Council for Educational Research.
Research Reports available from Department of Biostatistics http://www.pubhealth.ku.dk/bs/publikationer ________________________________________________________________________________ Department of Biostatistics University of Copenhagen Øster Farimagsgade 5 P.O. Box 2099 1014 Copenhagen K Denmark 06/1 Carstensen, B. Demography and epidemiology: Age-Period-Cohort models in the computer
age. 06/2 Carstensen, B. Demography and epidemiology: Practical use of the Lexis diagram in the
computer age or: who needs the Cox-model anyway? 06/3 Christensen, K.B., Andersen, P.K., Smith-Hansen, L., Nielsen, M.L., Kristensen, T.S.
Analyzing sickness absence using statistical models for survival data. 06/4 Christensen, K.B. & Kreiner, S. A Monte Carlo approach to unidimensionality testing in
polytomous Rasch models. 06/5 Keiding, N. Event history analysis and the cross-section. 06/6 Ditlevsen, S.D. & Lansky, P. Estimation of the input parameters in the Feller neuronal
model. 06/7 Kvist, K., Andersen, P.K. & Kessing, L.V. Repeated events and total time on test. 06/8 Ditlevsen, S.D. & Ditlevsen, O. Parameter estimation from observations of first-passage
times of the Ornstein-Uhlenbeck process and the Feller Process. 06/9 Budtz-Jørgensen, E. Estimation of the benchmark dose by structural equation models. 06/10 Christensen, K.B., Feveille, H., Kreiner, S. & Bjorner, J.B. Adjusting for mode of
administration effect in surveys using mailed questionnaire and telephone interview data. 06/11 Dalgaard, P. New R functions for multivariate analysis. 06/12 Picchini, U., Gaetano, A.D. & Ditlevsen, S. Parameter Estimation in Stochastic Differential
Mixed-Effects Models. 06/13 Kvist, K., Harhoff, M.G., Andersen, P.K. & Kessing, L.V. Non-parametric estimation and
model checking procedures for marginal gap time distributions for recurrent events. 07/1 Meira-Machado, L., de Uña-Álvarez, Cardarso-Suárez, C. & Andersen, P.K. Multi-state
models for the analysis of time to event data. 07/2 Kreiner, S. & Christensen, K.B. Exact evaluation of bias in Rasch model residuals.