Predicting Fragmentation Propagation Probabilities for ... - DTIC
Conditional Item-Exposure Control in Adaptive Testing Using Item-Ineligibility Probabilities
Transcript of Conditional Item-Exposure Control in Adaptive Testing Using Item-Ineligibility Probabilities
Conditional Item-Exposure Control in Adaptive
Testing Using Item-Ineligibility Probabilities
Wim J. van der Linden
Bernard P. Veldkamp
University of Twente
Two conditional versions of the exposure-control method with item-ineligibilityconstraints for adaptive testing in van der Linden and Veldkamp (2004) are pre-sented. The first version is for unconstrained item selection, the second for itemselection with content constraints imposed by the shadow-test approach. In bothversions, the exposure rates of the items are controlled using probabilities ofitem ineligibility given y that adapt the exposure rates automatically to a goalvalue for the items in the pool. In an extensive empirical study with an adaptiveversion of the Law School Admission Test, the authors show how the methodcan be used to drive conditional exposure rates below goal values as low as0.025. Obviously, the price to be paid for minimal exposure rates is a decreasein the accuracy of the ability estimates. This trend is illustrated with empiricaldata.
Keywords: adaptive testing; conditional item-exposure control; item eligibility method;
uniform exposure rates
Items in adaptive tests are selected as a solution to an optimization problem in
which an objective function is maximized over the item pool. If the test has to
meet a set of content specifications, the optimization becomes an instance of a
more complicated constrained combinatorial optimization problem. A popular
choice for the objective function in these problems is the value of the informa-
tion function at the current ability estimate for the items in the pool.
Suppose the items have been calibrated using the three-parameter logistic
(3-PL) model
piðyÞ≡ PrfUij ¼ 1g≡ ci þ ð1� ciÞexp½aiðy� biÞ�
1þ exp½aiðy� biÞ�; ð1Þ
where y ∈ ð�∞;∞Þ is a parameter representing the ability of the test taker and
bi ∈ ð�∞;∞Þ, ai > 0, and ci ∈ ½0; 1� are the difficulty, discriminating power,
The authors are grateful to Wim M. M. Tielen for his computational assistance.
398
Journal of Educational and Behavioral Statistics
December 2007, Vol. 32, No. 4, pp. 398–418
DOI: 10.3102/1076998606298044
� 2007 AERA and ASA. http://jebs.aera.net
and guessing parameter for item i ¼ 1; . . . ;N in the pool, respectively (Birn-
baum, 1968). For this model, the item information function is
IiðyÞ ¼½p0iðyÞ�
2
½piðyÞ�½1� piðyÞ�: ð2Þ
Let g ¼ 1; . . . ; n denote the items in the adaptive test and byðg�1Þ the estimate of
y after the first g� 1 items. If the gth item is selected, the objective function that
is maximized over the items in the pool is the information in Equation 2 at
y ¼ byðg�1Þ.Because we optimize, the items in the test tend to be picked only from a small
subset of the pool. This point is illustrated in the topmost curve in Figure 1,
which is composed of the segments of the information functions in Equation 2
that are locally best among the items in a pool for a section from the Law School
Admission Test (LSAT). We refer to this curve as the Level 1 dominance curve
for the item pool. The small subset of items on which this curve is based (in this
case, only 6 items from a pool of 305) dominates the other items in the pool
everywhere over the interval; consequently, they are always chosen. The basic
message from this curve is that unless special precaution is taken, the majority
of the items in the pool are bound to remain inactive.
If the adaptive test is administered in a continuous testing program, a set of
dominant items is easily memorized. If the stakes for the test takers are high, it
thus is possible for a few of them to plot and identify a critical portion of the
FIGURE 1. Dominance curves for levels 1, 2, 3, 4, 5, 10, 20, . . ., 100 for an item poolfrom the Law School Admission Test.Note: Each curve is composed of the segments of the information functions of the items that domi-
nate the other items at the y values.
Conditional Item-Exposure Control
399
item pool. Subsequent test takers are then able to familiarize with these items
and increase their test scores.
Fortunately, selecting items below the Level 1 dominance set does not neces-
sarily involve a large loss. The second curve in Figure 1 shows the Level 2 domi-
nance curve for the same item pool, that is, the curve consisting of the segments
of the informations for the locally second-best items in the pool. Because the
curves for the two levels hardly differ in height (except at the upper end of the
scale where the pool had a few strongly discriminating items), not much accuracy
of scoring would be lost if we relaxed the criterion for item selection somewhat
and selected items from both subsets. This fact has led to the idea of probabilistic
control of item exposure in adaptive testing.
The first probabilistic method was proposed by McBride and Martin (1983).
Their method simply consisted of picking the items randomly from the first five
levels of dominance along the y scale (i.e., the first five curves in Figure 1). The
fact that the method is probabilistic is important because it treats all test takers
at a given ability level equally fair in the sense that each of them has the same
probability of getting each item.
A more advanced probabilistic method was proposed by Sympson and Hetter
(1985; see also Hetter & Sympson, 1997). This method, which will be discussed
in more detail in the next section, is based on a probabilistic experiment that
is conducted each time an item is selected. The outcome of this experiment is
either the decision to go on and administer the item or to pass and rerun the
experiment for the next best item in the pool. The conditional probabilities of
administration given the selection of an item are the control parameters used
to restrict the item-exposure rates. The values of these parameters are to be set
through an iterative process of simulated adaptive test administrations.
An alternative method of probabilistic exposure control was presented in van
der Linden and Veldkamp (2004). The probability experiment in this method
differs from that in the Sympson-Hetter method in the following three aspects.
First, the experiment is not conducted each time after an item is selected but only
once before a test taker begins the test. Second, the critical parameters in the
experiment are not the conditional probabilities of administering an item given it
has been selected but the probabilities of the items being eligible for the test taker.
If an item is eligible, it is available for administration to the test taker. If the item
is ineligible, it is removed from the pool for the test taker. Third, the probabilities
of ineligibility are used in a recurrence relation that allows them to adapt automa-
tically to appropriate levels during testing. The differences between the two meth-
ods will become precise when we discuss them in more detail in the following.
The present article serves three different goals. Our first goal is to generalize
the item-ineligibility method, which was developed for use with constrained item
selection through the shadow-test method, to any type of item selection in adap-
tive testing. The generalization makes the basic structure of the method trans-
parent and allows us to highlight some of its features. Our second goal is to
van der Linden and Veldkamp
400
formulate a version of the method for conditional item-exposure control given y.
Conditional control is generally accepted as more effective because it reduces the
likelihood that test takers of approximately the same ability level detect the subset
of items in the pool specific to them (Stocking & Lewis, 1998). As will become
clear in the following, to formulate a conditional version of the method, we have
to reconceptualize the adaptive test as one conducted from multiple item pools.
Our final goal is to explore the behavior of the new versions of these methods
when the exposure rates of the items are driven to their minimum. This appears
to be possible provided we are willing to pay the price of a decrease in the accu-
racy of the ability estimates. The Level 10 through Level 100 curves in Figure 1
explain the decrease in accuracy: If the exposure rates are set lower and lower,
we are required to select items from the lower dominance levels in the pool and
eventually have to accept considerable loss of accuracy in testing. In addition, if
content constraints are imposed on the test, lowering the exposure rates may
eventually lead to overconstraining of the item selection, namely, the case where
no feasible test is left in the pool. Thus, if the conditional item-exposure rates are
chosen to be too low, the test becomes ineffective at some of the ability levels.
Sympson-Hetter Method
To highlight the differences with the ineligibility method in the next sections,
we briefly discuss the Sympson-Hetter (hereafter SH) method, which is based
on the following two events for each item in the pool:
Si: item i is selected;
Ai: item i is administered.
Because an item can never be administered without being selected, it always
holds that
Ai ⊂ Si; ð3Þ
for all i. Hence, for the conditional exposure rate of item i given y (i.e., probabil-
ity PðAi | yÞ), it follows that
PðAi | yÞ ¼ PðAi; Si | yÞ ¼ PðAi | Si; yÞPðSi | yÞ ð4Þ
for all possible values of y.
The SH method is used to force the item-exposure rates of all items in the
pool below an upper bound rmax. Typically, rmax is chosen to be in the range of
.20 to .30. From Equations 3 and 4 it follows that the bound is realized when
PðAi | Si; yÞPðSi | yÞ≤ rmax ð5Þ
for i ¼ 1; . . . ;N.
Conditional Item-Exposure Control
401
The probability of item selection, PðSi|yÞ, depends on a variety of factors,
including the distribution of the item parameters in the pool, the objective func-
tion that is optimized, and the initial item that is chosen. Because each of these
factors is fixed by design, the only parameters left in Equation 5 to manipulate
the exposure rates are the conditional probabilities PðAi|Si; yÞ.It is always possible to meet the bound in Equation 5 by setting the probabil-
ities PðAi|Si; yÞ at low values for all items. However, implicit in Equation 4 is
the idea that the exposure rates for the best items in the pool should not be much
lower than rmax because we do not want to lose them entirely. In other words,
rmax should be viewed as a goal value that has to be approached from below
rather than just an upper bound.
Values for PðAi|Si; yÞ that approach the goal value cannot be found by an
analytic method. Sympson and Hetter therefore proposed to find them using an
iterative process of computer simulations that has to be continued until admissi-
ble exposure rates are found. At each step in this process, a large set of simu-
lated test administrations is replicated at the y values for which the exposure
rates have to be controlled. At the end of the step, the new conditional exposure
rates of the items given these values are estimated and the control parameters
are adjusted. The fact that the control parameters are set at the true y values,
whereas in operational testing we control the exposure rates at estimated yvalues, is not much of a problem provided the set of y values is well chosen
(Stocking & Lewis, 2000).
Let t ¼ 1; 2; . . . denote the sets of simulations. The adjustment rule used in
the SH method is
Pðtþ1ÞðAi|Si; yÞ :¼ 1 if PðtÞðSi|yÞ≤ rmax;rmax=PðtÞðSi|yÞ if PðtÞðSi|yÞ > rmax;
�ð6Þ
where i ¼ 1; . . . ;N. The rule is based on the idea that if at step t an item was
selected with a probability smaller than rmax, no control is needed. However, if
an item was selected with a conditional probability larger than rmax, its control
parameter should have been set such that PðAi|yÞ ¼ rmax. From Equation 4 it
follows that this requirement would have been realized if PðtÞðAi|Si; yÞ had been
equal to rmax=PðtÞðSi|yÞ. Hence, the new value of Pðtþ1ÞðAi|Si; y) is set at this
level. For a more formal treatment of the SH method and some alternative meth-
ods based on variations of this adjustment rule, the reader should consult van
der Linden (2003).
In practical settings, the use of the SH method has been found to be time con-
suming. The number of y values at which the exposure rates are controlled is
usually in the range of 10 to 12. The number of iterations of the adaptive test
simulations for one y value is generally of the same order. It is therefore not
uncommon to have to run 100 to 200 simulations before an admissible set of
exposure rates is found. In addition, if some of the items have to be replaced
van der Linden and Veldkamp
402
during operational testing because they are compromised or appear to be flawed,
the values of the control parameters become invalid and the procedure has to be
repeated (Chang & Harris, 2002).
A more fundamental problem with the SH method is the fact that the item-
exposure rates do not necessarily converge to values below rmax during the
adjustment process. It can be regularly observed that the rates of some of the
overexposed items increase rather than decrease after adjustment. Also, rates
that were below rmax for some steps may suddenly jump back to a value larger
than this target. Because of this behavior, it is necessary to inspect the exposure
rates of all items after each step and use personal judgment to decide when to
stop.
Conditional Item-Ineligibility Methods
We first discuss the case of adaptive testing without content constraints. The
modifications of the method necessary to deal with adaptive testing with content
constraints are introduced in a later section.
Testing Without Content Constraints
To formulate the item-ineligibility method we consider the following two
events:
Ei: item i is eligible;
Ai: item i is administered.
If the item is eligible, it remains in the pool during the entire test for the test
taker; otherwise it is removed. Unlike the SH method, it is not necessary to
allow for an event Si of selecting item i: An item is always administered if it is
selected. More formally, it holds that Ai ¼ Si, and we need not consider the
latter.
Analogous to Equation 3,
Ai ⊂Ei ð7Þ
for all i. We are therefore able to write
PðAi|yÞ ¼ PðAi;Ei|yÞ ¼ PðAi|Ei; yÞPðEi|yÞ ð8Þ
for all possible values of y.
If we impose rmax as a goal value for the exposure rates PðAi|yÞ, we obtain
PðAi|yÞ ¼ PðAi|Ei; yÞPðEi|yÞ≤ rmax; ð9Þ
Conditional Item-Exposure Control
403
or
PðEi|yÞ≤ rmax
PðAi|Ei; yÞ; ð10Þ
with PðAi|Ei; yÞ > 0: From Equation 7,
PðAi|Ei; yÞ ¼PðAi|yÞPðEi|yÞ ; ð11Þ
with PðEi|yÞ> 0. Hence, Equation 10 can be rewritten as
PðEi|yÞ≤ rmax
PðAi|yÞPðEi|yÞ; ð12Þ
still with PðAi|yÞ> 0.
The basic idea is to conceive of Equation 12 as a recurrence relation. Suppose
j test takers have already taken the test, and we want to establish the probabil-
ities of eligibility for test taker j þ 1. If rmax is our goal value for the exposure
rates at selected points yk, k ¼ 1; . . . ;K, the probabilities of eligibility
Pðjþ1ÞðEi|yk) can be calculated as
Pðjþ1ÞðEi|ykÞ ¼ minrmax
PðjÞðAi|ykÞPðjÞðEi|ykÞ; 1
� �; ð13Þ
with PðjÞðAi|ykÞ> 0.
The rationale for the recurrence relation in Equation 13 is that it automatically
maintains rmax as a goal value for the exposure rates. It is easy to show that
if PðjÞðAi|ykÞ > rmax; then Pðjþ1ÞðEi|ykÞ<PðjÞðEi|ykÞif PðjÞðAi|ykÞ ¼ rmax; then Pðjþ1ÞðEi|ykÞ ¼ PðjÞðEi|ykÞif PðjÞðAi|ykÞ< rmax; then Pðjþ1ÞðEi|ykÞ>PðjÞðEi|ykÞ:
ð14Þ
Thus, if an exposure rate is larger than the goal value, the probability of eligibil-
ity of the item always goes down. As a result, because of Equation 7, the
expected exposure rate also goes down. On the other hand, if an exposure rate is
below rmax, the probability of eligibility of the item goes up.
Practical Implementation
To implement the method for exposure control conditional on a set of values
yk, k ¼ 1; . . . ;K, we have to abandon the idea of an adaptive test from a single
item pool. Before a person takes the test, the current probabilities of eligibility
are used to decide which items are eligible for each of the values yk. The result
van der Linden and Veldkamp
404
of these experiments is K different versions of the item pool, one at each yk.During the test, the person visits the version of the item pool that is closest to
his or her current ability estimate.
To estimate the probabilities of eligibility, we have to record the following
two counts:
aij k: number of test takers through j who visited item pool k and took item i;
eijk: number of test takers through j who visited item pool k when item i was eligible.
PðjÞðAi|ykÞ and PðjÞðEi|yk) can then be estimated as aijk=j and eijk=j, and
the estimated probability of eligibility for test taker j þ 1 in Equation 13 is
obtained as
bPðjþ1ÞðEi|ykÞ ¼ minrmaxeijk
aijk
; 1
� �; ð15Þ
with aijk < 0. The estimates in Equation 15 ignore the differences between the
test taker’s true and estimated ability. Just as for the SH method, we expect the
impact of estimation error on the actual exposure rates to be negligible for all
practical purposes (Stocking & Lewis, 2000). The simulation results presented
later in this article confirm this expectation.
Two different implementations of the method are possible:
1. The method can be used to control the exposure rates on the fly, that is, without
any prior adjustment of the eligibility probabilities.
2. The probabilities can be adjusted prior to operational testing through a computer
simulation of administrations of the test.
In the previous study with the unconditional version of the method (van der
Linden & Veldkamp, 2004), we were able to report that for a typical goal value
of rmax ¼ :25, both the probabilities of eligibility and the exposure rates were
already stable after 1,000 test administrations. In the empirical study reported
later in this article, we minimized the goal values and found that for values close
to their minimum (see Equation 25 in the following), the method should be used
in combination with the technique of fading to get stability after the same number
of administrations. (The technique of fading is explained in a following section.)
Whatever implementation is used, it is always possible to deal with the repla-
cement of a few items in the operational pool, for instance, because they appear
to be compromised, on the fly.
Testing With Content Constraints
Usually, content constraints are to be imposed on the adaptive test. If so, the
shadow-test approach offers an effective implementation. Shadow tests are full-
size tests calculated prior to the selection of the items that (a) are optimal at the
Conditional Item-Exposure Control
405
last ability estimate, (b) meet all content constraints, and (c) include all items
already administered to the current person. The next item to be administered is
the best item among the free items in the current shadow test. Shadow tests can
be easily assembled using the technique of 0-1 integer programming (van der
Linden, 2000, 2005; van der Linden & Reese, 1998).
A natural way to implement the control in adaptive testing with content con-
straints is through the inclusion of ineligibility constraints in the models for the
shadow tests. If the decision is that item i is ineligible for the test taker, the fol-
lowing constraint is added to the model
xi ¼ 0; ð16Þ
where xi is the 0-1 decision variable for item i in the model (that is, if xi ¼ 1,
item i is selected but it is not if xi ¼ 0Þ. If item i remains eligible, no constraint
is added.
The extension of the test-assembly model for the shadow test with these inelig-
ibility constraints gives rise to two new issues. The first is potential overconstrain-
ing of the item selection. In principle, it is possible that a temporary combination
of content and ineligibility constraints in the model is unfortunate and no feasible
solution is left. Generally however, for a typical testing program, the number of
ineligibility constraints in the model is small, and the likelihood of an infeasible
solution can be ignored. In fact, in our empirical studies with the unconditional
version of this method, infeasibility never occurred (van der Linden & Veldkamp,
2004). We expect the same to happen with the conditional version of the method
except when the exposure rates are driven to their minimum (see the next section).
Also, the likelihood of an infeasibility depends entirely on the appropriateness of
the item pool for the test. A method of item-pool assembly that guarantees a
balanced distribution of the items in the pool with respect to the content con-
straints is presented in van der Linden, Ariel, and Veldkamp (2006). For methods
of item-pool design that guarantee such distributions, see van der Linden (2005).
If infeasibility occurs, a straightforward solution is to remove all ineligibility
constraints from the model and use the full pool for item selection. This measure
may lead to an occasional extra exposure for some of the items, but the adaptive
mechanism in Equation 14 automatically corrects for them.
The second issue has to do with the fact that the use of shadow tests reinvokes
the distinction between item selection and administration on which the SH
method rests. An item can now be selected for the shadow test but may not be
administered because it was dominated by the other free items in the test.
To deal with these new issues, we distinguish the following possible events:
Ei: item i is eligible;
F: the model for the shadow test with the ineligibility constraints is feasible;
Si: item i is selected in a shadow test;
Ai: item i is administered.
van der Linden and Veldkamp
406
It holds for these four events that
Ai ⊂ Si ⊂ fEi∪Fg; ð17Þ
where F is the event of an infeasible shadow test. Following the same argument
as in van der Linden and Veldkamp (2004, Equations 8 through 13), an upper
bound rmax on the conditional exposure rates PðAi|yÞ can be shown to lead to
PðEi|yÞ≤ 1� 1
PðF|yÞ þrmaxPðEi∪F|yÞPðAi|yÞPðF|yÞ ; ð18Þ
with PðAi|yÞ> 0 and PðF|yÞ> 0.
This inequality implies the following version of Equation 13 for the case of
adaptive testing with content constraints:
Pðjþ1ÞðEi|ykÞ ¼ min 1� 1
PðjÞðF|ykÞþ rmaxPðjÞðEi ∪F|ykÞ
PðjÞðAi|ykÞPðjÞðF|ykÞ; 1
� �; ð19Þ
still with PðjÞðAi|Ei; ykÞ> 0 and PðjÞðF|ykÞ> 0.
This equation looks more complicated than Equation 13, but the relation
between the two becomes clear if we consider the case that the shadow test is
always feasible and set PðjÞðF|ykÞ ¼ 1. It then holds that PðjÞðEi ∪F|ykÞ ¼PðjÞðEi|ykÞ, and Equation 13 directly follows from Equation 19. In addition,
Equations 13 and 19 have the same adaptive behavior. The only difference
between the two relations is a correction of the probability of item eligibility for
possible infeasibility in Equation 19. For these and other details, see van der
Linden and Veldkamp (2004).
To estimate the right-hand probabilities in Equation 19, we need the following
counts:
njk: number of test takers through j who visited item pool k;
aijk: number of test takers through j who visited item pool k and took item i;
jjk: number of test takers through j who visited item pool k when the test was
feasible;
rijk: number of test takers through j who visited item pool k when item i was eligible
or the test was infeasible.
The left-hand probabilities are then estimated as
bPðjþ1ÞðEi|ykÞ ¼ min 1� njk
jjk
þrmaxnjkrijk
aijkjjk
; 1
( ); ð20Þ
with aijk > 0; jjk > 0:
Conditional Item-Exposure Control
407
To begin the test, it is recommend to set bPðjþ1ÞðEi|ykÞ ¼ 1 for item i until
both aijk > 0 and jjk > 0. This initialization helps us prevent indeterminate
values because the conditions in Equation 20 are not yet satisfied.
Fading
In van der Linden and Veldkamp (2004), it is recommended to update the
counts using the technique of fading, which is used in applications of Bayesian
networks for updating posterior probabilities. The basic idea underlying this
technique is to weigh the old data by a fading factor when new data are added.
As a result, the effective size of the sample remains fixed at a predetermined
level (Jensen, 2001), and the probabilities of eligibility have the same level of
stability throughout operational testing. For a demonstration of the effectiveness
of fading for updating eligibility probabilities, see van der Linden and Veld-
kamp (2004).
Suppose we use a fading factor w. The number of test takers njk who visited
item pool k in Equation 20 is no longer a direct count but a number updated as
n∗ðjþ1Þk ¼ wn∗jk þ 1: ð21Þ
The updates of the other counts are analogous. For example, the number of test
takers who visited pool k and received item i is updated as
a∗ðjþ1Þik ¼wa∗ijk þ 1; if item i was administered to test taker j;wa∗ijk; otherwise.
�ð22Þ
These updates produce estimates of the probabilities based on a moving win-
dow with weights close to one for recent test takers but approaching zero for ear-
lier test takers. The method can be shown to have estimates based on an effective
sample size equal to 1=ð1� wÞ (Jensen, 2001). In the following empirical study,
we used w ¼ :999, which amounts to a sample size of 1,000.
The use of fading is particularly important when the goal value for the expo-
sure rates is set close to its minimum (see the next section). Typically, the prob-
abilities of eligibility for the item pool go down to a value smaller than 1 in an
order determined by the level of dominance of the items (see Figure 1). When
the goal value approaches its minimum, the process has to be continued until the
last items in the pool are reached. But when these items become active, the num-
bers of test takers that have visited the item pools, njk, have already grown large.
As a result, Equation 20 adapts only slowly to the changes in the counts of the
item administrations, aijk, which had been close to zero thus far. The technique
of fading prevents this slower adaptation because the weighted counts in
Equation 20 are based on the same effective number of test takers for later items
as for earlier ones.
van der Linden and Veldkamp
408
Minimizing Exposure Rates
It is tempting to explore how low rmax can be set before the method breaks
down. Unfortunately, although we are able to discuss a useful lower bound on
the marginal exposure rates, exact bounds for conditional rates are hard to find.
Marginal Rates
For the marginal exposure rates, the following relation holds for any popula-
tion of test takers
XN
i¼1
PðAiÞ ¼ n; ð23Þ
where n is the length of the adaptive test.
The relation was presented without formal proof in van der Linden (2003).
However, a straightforward proof runs as follows: Let xij be an indicator vari-
able equal to 1 if examinee j takes item i and equal to 0 otherwise. Then, for a
population of J test takers,
XJ
j¼1
xij=J
is the empirical marginal exposure rate of item i and
XN
i¼1
xij ¼ n
is the common length of the test. Hence, the sum of the exposure rates can be
written as
XN
i¼1
PðAiÞ ¼XN
i¼1
eXJ
j¼1
xij=J
!
¼XJ
j¼1
eXN
i¼1
xij
!=J
¼ n; ð24Þ
where e denotes the expectation over replicated test administrations. The transi-
tion form the second to the third line in Equation 24 is valid because the test
length is supposed to be the same for all j.
Conditional Item-Exposure Control
409
From a security point of view, it would be ideal if the exposure rates of all
items were distributed uniformly with a common low value. From Equation 23,
it follows that this type of distribution is only possible for
PðAiÞ ¼ nN�1, for all i: ð25Þ
To illustrate the capability of the current method to realize a uniform distribu-
tion of exposure rates, we simulated adaptive administrations of a test of 25 items
from a pool of 305 items. The pool and the test are described in the section on the
empirical study that follows. For this combination of test length and pool size,
uniform exposure rates are only possible with PðAiÞ ¼ 25=305 ¼ :0814 for all
items. We simulated 10,000 administrations of the test to random test takers at
y ¼ �2:0;�1:5; . . . 2:0 for each of the goal values rmax ¼ :25; :20; :15; :10, and
.0814. (To make the results comparable, the range of y values was chosen to be
identical to that in the main study that follows.)
The resulting exposure rates for the items are shown in Figure 2. The lower
the goal value, the lower the maximum exposure rate and the larger the portion
of the item pool that became active in the test. The curve for rmax ¼ :0814 shows
a uniform distribution except for minor disturbances at its extremes due to the
probabilistic nature of the method. It is interesting to observe the difference
between the curves for rmax ¼ :10 and rmax ¼ :0814. Although the difference
between the two goal values seems negligible, for the former, some 50 items
were still inactive, whereas all items became active for the latter.
Conditional Rates
For conditional exposure rates, the version of Equation 23 is more compli-
cated. In this case, we have a different set of rates for each item pool at yk. In
addition, the number of test takers who visit a pool is no longer fixed but ran-
dom. Consequently, to derive an expression for the sum of the exposure rates,
we have to weigh the sums of the rates for the individual pools by the probabil-
ity of a visit to the pool. Because the probability of a visit depends on how far
the test taker is in the test, we arrive at the following version of Equation 23:
XK
k¼1
Xn
g¼1
XN
i¼1
PgðAi|ykÞPgk ¼ n; ð26Þ
where g ¼ 1; . . . ; n still indexes the items in the adaptive test, Pgk is the prob-
ability of a test taker visiting the item pool at yk for the selection of item g, and
PgðAi|ykÞ is the probability of administering item i during this visit.
Because all probabilities in Equation 26 are dependent on each other, it is
impossible to use this relation for deriving lower bounds on the conditional
item-exposure rates. This conclusion holds a fortiori for adaptive testing with
van der Linden and Veldkamp
410
item content constraints. We therefore took an empirical approach and explored
what happened to the conditional exposure rates when the goal values rmax were
systematically decreased in a series of simulated test administrations.
Empirical Study
An adaptive version of a section of the LSAT was simulated for 10,000 test
takers at yk ¼ �2:0;�1:5; . . . ; 2:0. The test consisted of 25 discrete items,
whereas the pool had a size of 305 items.
The following versions of the test were simulated:
1. Tests with and without content constraints.
2. Tests with conditional item-exposure control at yk ¼ �2:0;�1:5; . . . ; 2:0 with the
goal values rmax ¼ 1 (no control), .20, .15, .10, .05, and .025.
The content constraints were the usual constraints for the paper-and-pencil
version of the LSAT section; they dealt with such attributes as item type and
content, answer key distribution, word counts, and gender/minority orientation
of the items. The constraints were imposed on the adaptive test using the sha-
dow-test approach. The total number of constraints was equal to 30. The last
two levels of exposure control were lower than the bound of .0814 calculated
from Equation 25 for the marginal exposure control in the preceding section;
they were therefore expected to be critical.
The tests administrations were simulated with the maximum-information
criterion for item selection in Equation 2. The initial ability estimate was set
FIGURE 2. Estimated marginal exposure rates of the items in the pool for the goalvalues rmax ¼ .25, .20, .15, .10, and .0814.Note: For each curve, the items on the horizontal axis are ordered by their exposure rates.
Conditional Item-Exposure Control
411
equal to byð0Þ ¼ 0 for all test takers. The interim estimates of y were calculated
using the method of expected a posteriori (EAP) estimation with a noninforma-
tive prior.
During the simulations, we recorded (a) the actual exposure rates and (b) the
errors in the ability estimates at each yk.The exposure rates for adaptive testing without the content constraints are
shown in Figure 3. The four panels in this figure are for yk ¼ �1:5;�0:5; 0:5,
and 1.5. (The results at the other values of yk fitted the patterns in these panels
exactly and are omitted for space.) Without exposure control, the maximum
exposure rates tended to be close to 1.0. The largest rate was that of .99 observed
FIGURE 3. Estimated conditional exposure rates of the items given y ¼ �1:5, −0.5,
0.5, and 2.0 for adaptive testing without the content constraints.Note: Each curve is for a different goal value rmax ¼ :25, .20, .15, .10, .05, and .025. For each curve,
the items on the horizontal axis are ordered by their exposure rates. The maximum values for the
curves decrease with rmax.
van der Linden and Veldkamp
412
at yk ¼ 0:5: For the conditions with control, the maximum rate decreased sys-
tematically with rmax. The lowest goal value of rmax ¼ :025 had a maximum rate
of .04 observed for a few items at each yk. The number of items that were still
inactive for this goal value varied from 10 to 15 items at yk ¼ �1:5 and 1.5 to 60
to 70 items at yk ¼ �0:5 and 0:5. These results suggest that for the current item
pool, rmax ¼ :025 must have been close to the minimum value possible at the
two outmost values of yk but that a lower value might have been possible at the
yk values in the middle of the scale.
FIGURE 4. Estimated conditional exposure rates of the items given y = −1.5, −0.5,0.5, and 2.0 for adaptive testing with the content constraints.Note: Each curve is for a different goal value rmax ¼ :25, .20, .15, .10, .05, and .025. For each curve,
the items on the horizontal axis are ordered by their exposure rates. The maximum values for the
curves decrease with rmax. The only exception is the maximum value for rmax ¼ :025, which has
moved to the second position for y ¼ �1:5 and y ¼ 1:5.
Conditional Item-Exposure Control
413
The exposure rates for adaptive testing with the content constraint in Figure 4
show the same general pattern. The only exceptions are the rates for rmax ¼ :025
at yk ¼ �1:5 and 1.5. The curves for this goal value jumped to the second posi-
tion (that is, directly after the condition without control). The reason for this jump
was that the goal value had become much too low to produce feasible shadow test
at each of the values yk. Table 1 shows the percentages of feasible shadow tests
for rmax ¼ :025 during the simulations. Although infeasibility was hardly a pro-
blem at yk ¼ 0, for the more extreme values of yk the percentage of feasibility
quickly decreased to 7.1 ðyk ¼ �2:0Þ and 16.4 ðyk ¼ 2:0Þ. Recall that when
infeasibility occurs, the items were replaced in the pool. For the higher values of
rmax, replacement is an occasional random event, which is automatically corrected
for by the adaptive mechanism in Equation 14. However, for rmax ¼ :025 we
appeared to have dived below what was possible at the majority of the yk values,
and the method was no longer able to cope with the number of replacements.
It is interesting to note that for the combination of rmax ¼ :025 and adaptive
testing without the content constraints in Figure 3, only 2 out of the 9× 10; 000
simulated administrations resulted in an infeasible shadow test. The reason for
rmax ¼ :025 being too low for adaptive testing with the content constraints was
thus not that the number of eligible items in the pools was smaller than the
length of the test but that none of the possible combination of eligible items met
satisfied the set of content constraints for the test.
We also recorded the errors in the ability estimates during the test. Figures 5
and 6 show the estimated bias and mean square error (MSE) functions calculated
from these errors. As expected, the curves were generally ordered in the values
of rmax; for smaller goal values, the curves were closer to the horizontal axis.
It should be observed that for rmax ¼ :025, the errors were smaller for adap-
tive testing with the content constraints than without them. This finding might
surprise because the addition of content constraints to adaptive testing generally
results in less space for optimizing the item selection and consequently lar-
ger errors. However, the finding is explained by the fact discussed earlier that
rmax ¼ :025 was too low to satisfy the content constraints at the majority of the
yk values. Because the algorithm frequently had to return all the items to the
pool, the exposure rates of the more dominant items went up and the accuracy
of the ability estimation improved.
TABLE 1Percentage of Feasible Shadow Tests During Adaptive Testing With ContentConstraints for rmax = .025
yk
−2.0 −1.5 −1.0 −0.5 0 0.5 1.0 1.5 2.0
% Feasible 7.1 14.5 9.8 87.2 99.9 57.3 29.0 12.1 16.4
van der Linden and Veldkamp
414
Practical Conclusion
For a high-stakes testing program, we expect item security to be the number
one priority. On the other hand, the accuracy of the ability estimates cannot be
sacrificed too much. The empirical study in this article was based on only one
−0.5
−0.4
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0x
θ
Bia
s no control
rmax = .025
rmax = .025
0.0
0.1
0.2
0.3
0.4
0.5
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
θ
MS
E
no control
FIGURE 5. Estimated conditional bias and MSE functions for adaptive testingwithout content constraints for rmax ¼ :25, .20, .15, .10, .05, and .025.Note: The curves in both panels are ordered in rmax. MSE ¼ mean square error.
Conditional Item-Exposure Control
415
combination of test and item pool. It is therefore dangerous to generalize. But if
we had to choose a goal value for the conditional exposure rates for this combi-
nation, our choice would have been rmax ¼ :15 or 10. The bias and MSE curves
for the two lower levels of .05 and .025 in Figures 5 and 6 are much more less
favorable. Such levels may only become acceptable if the item pool is made
larger and/or the test shorter (see the lower bound on the exposure rates in
Equation 25).
−0.5
−0.4
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
θ
Bia
s
rmax = .025
rmax = .025
no control
0.0
0.1
0.2
0.3
0.4
0.5
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
θ
MS
E
no control
FIGURE 6. Estimated conditional bias and MSE functions for adaptive testing
with content constraints for rmax ¼ :25, .20, .15, .10, .05, and .025.Note: The curves in both panels are ordered in rmax. MSE ¼ mean square error.
van der Linden and Veldkamp
416
References
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s
ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores
(pp. 397-479). Reading, MA: Addison-Wesley.
Chang, S.-W., & Harris, D. J. (2002, April). Redeveloping the exposure control para-
meters of CAT items when a pool is modified. Paper presented at the annual meeting of
the American Educational Research Association, New Orleans, LA.
Hetter, R. R., & Sympson, J. B. (1997). Item-exposure in CAT-ASVAB. In W. A. Sands,
J. R. Waters, & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to
operation (pp. 141-144). Washington, DC: American Psychological Association.
Jensen, F. V. (2001). Bayesian networks and graphs. New York: Springer.
McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in
a military setting. In D. J. Weiss (Ed.), New horizons in testing (pp. 223-236). San
Diego, CA: Academic Press.
Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in
computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23,
57-75.
Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in
CAT. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing:
Theory and practice (pp. 163-182). Boston: Kluwer.
Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in com-
puterized adaptive testing. In Proceedings of the 27th annual meeting of the Military
Testing Association (pp. 973-977). San Diego, CA: Navy Personnel Research and
Development Center.
van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J.
van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and
practice (pp. 27-52). Boston: Kluwer.
van der Linden, W. J. (2003). Some alternatives to Sympson-Hetter item-exposure con-
trol in computerized adaptive testing. Journal of Educational and Behavioral Statis-
tics, 28, 249-265.
van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer.
van der Linden, W. J., Ariel, A., & Veldkamp, B. P. (2006). Assembling a CAT item
pool as a set of linear test forms. Journal of Educational and Behavioral Statistics, 31,
81-100.
van der Linden, W. J., & Reese, L. M. (1998). A model for optimal constrained adaptive
testing. Applied Psychological Measurement, 22, 259-270.
van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in compu-
terized adaptive testing with shadow tests. Journal of Educational and Behavioral
Statistics, 29, 273-291.
Authors
WIM J. VAN DER LINDEN is professor, Department of Research Methodology, Mea-
surement, and Data Analysis, University of Twente, PO Box 217, 7500 AE Enschede,
The Netherlands; [email protected]. His areas of interest are test theory,
applied statistics, and research methods.
Conditional Item-Exposure Control
417
BERNARD P. VELDKAMP is assistant professor, Department of Research Methodol-
ogy, Measurement, and Data Analysis, University of Twente, PO Box 217, 7500 AE
Enschede, The Netherlands; [email protected]. His areas of interest are educa-
tional measurement and statistics.
Manuscript received June 21, 2005
Accepted March 17, 2006
van der Linden and Veldkamp
418