Conditional Item-Exposure Control in Adaptive Testing Using Item-Ineligibility Probabilities

21
Conditional Item-Exposure Control in Adaptive Testing Using Item-Ineligibility Probabilities Wim J. van der Linden Bernard P. Veldkamp University of Twente Two conditional versions of the exposure-control method with item-ineligibility constraints for adaptive testing in van der Linden and Veldkamp (2004) are pre- sented. The first version is for unconstrained item selection, the second for item selection with content constraints imposed by the shadow-test approach. In both versions, the exposure rates of the items are controlled using probabilities of item ineligibility given y that adapt the exposure rates automatically to a goal value for the items in the pool. In an extensive empirical study with an adaptive version of the Law School Admission Test, the authors show how the method can be used to drive conditional exposure rates below goal values as low as 0.025. Obviously, the price to be paid for minimal exposure rates is a decrease in the accuracy of the ability estimates. This trend is illustrated with empirical data. Keywords: adaptive testing; conditional item-exposure control; item eligibility method; uniform exposure rates Items in adaptive tests are selected as a solution to an optimization problem in which an objective function is maximized over the item pool. If the test has to meet a set of content specifications, the optimization becomes an instance of a more complicated constrained combinatorial optimization problem. A popular choice for the objective function in these problems is the value of the informa- tion function at the current ability estimate for the items in the pool. Suppose the items have been calibrated using the three-parameter logistic (3-PL) model p i ðyÞ PrfU ij ¼ 1g c i þð1 c i Þ exp½a i ðy b i Þ 1 þ exp½a i ðy b i Þ ; ð1Þ where y ð; Þ is a parameter representing the ability of the test taker and b i ð; Þ, a i > 0, and c i ½0; 1 are the difficulty, discriminating power, The authors are grateful to Wim M. M. Tielen for his computational assistance. 398 Journal of Educational and Behavioral Statistics December 2007, Vol. 32, No. 4, pp. 398–418 DOI: 10.3102/1076998606298044 Ó 2007 AERA and ASA. http://jebs.aera.net

Transcript of Conditional Item-Exposure Control in Adaptive Testing Using Item-Ineligibility Probabilities

Conditional Item-Exposure Control in Adaptive

Testing Using Item-Ineligibility Probabilities

Wim J. van der Linden

Bernard P. Veldkamp

University of Twente

Two conditional versions of the exposure-control method with item-ineligibilityconstraints for adaptive testing in van der Linden and Veldkamp (2004) are pre-sented. The first version is for unconstrained item selection, the second for itemselection with content constraints imposed by the shadow-test approach. In bothversions, the exposure rates of the items are controlled using probabilities ofitem ineligibility given y that adapt the exposure rates automatically to a goalvalue for the items in the pool. In an extensive empirical study with an adaptiveversion of the Law School Admission Test, the authors show how the methodcan be used to drive conditional exposure rates below goal values as low as0.025. Obviously, the price to be paid for minimal exposure rates is a decreasein the accuracy of the ability estimates. This trend is illustrated with empiricaldata.

Keywords: adaptive testing; conditional item-exposure control; item eligibility method;

uniform exposure rates

Items in adaptive tests are selected as a solution to an optimization problem in

which an objective function is maximized over the item pool. If the test has to

meet a set of content specifications, the optimization becomes an instance of a

more complicated constrained combinatorial optimization problem. A popular

choice for the objective function in these problems is the value of the informa-

tion function at the current ability estimate for the items in the pool.

Suppose the items have been calibrated using the three-parameter logistic

(3-PL) model

piðyÞ≡ PrfUij ¼ 1g≡ ci þ ð1� ciÞexp½aiðy� biÞ�

1þ exp½aiðy� biÞ�; ð1Þ

where y ∈ ð�∞;∞Þ is a parameter representing the ability of the test taker and

bi ∈ ð�∞;∞Þ, ai > 0, and ci ∈ ½0; 1� are the difficulty, discriminating power,

The authors are grateful to Wim M. M. Tielen for his computational assistance.

398

Journal of Educational and Behavioral Statistics

December 2007, Vol. 32, No. 4, pp. 398–418

DOI: 10.3102/1076998606298044

� 2007 AERA and ASA. http://jebs.aera.net

and guessing parameter for item i ¼ 1; . . . ;N in the pool, respectively (Birn-

baum, 1968). For this model, the item information function is

IiðyÞ ¼½p0iðyÞ�

2

½piðyÞ�½1� piðyÞ�: ð2Þ

Let g ¼ 1; . . . ; n denote the items in the adaptive test and byðg�1Þ the estimate of

y after the first g� 1 items. If the gth item is selected, the objective function that

is maximized over the items in the pool is the information in Equation 2 at

y ¼ byðg�1Þ.Because we optimize, the items in the test tend to be picked only from a small

subset of the pool. This point is illustrated in the topmost curve in Figure 1,

which is composed of the segments of the information functions in Equation 2

that are locally best among the items in a pool for a section from the Law School

Admission Test (LSAT). We refer to this curve as the Level 1 dominance curve

for the item pool. The small subset of items on which this curve is based (in this

case, only 6 items from a pool of 305) dominates the other items in the pool

everywhere over the interval; consequently, they are always chosen. The basic

message from this curve is that unless special precaution is taken, the majority

of the items in the pool are bound to remain inactive.

If the adaptive test is administered in a continuous testing program, a set of

dominant items is easily memorized. If the stakes for the test takers are high, it

thus is possible for a few of them to plot and identify a critical portion of the

FIGURE 1. Dominance curves for levels 1, 2, 3, 4, 5, 10, 20, . . ., 100 for an item poolfrom the Law School Admission Test.Note: Each curve is composed of the segments of the information functions of the items that domi-

nate the other items at the y values.

Conditional Item-Exposure Control

399

item pool. Subsequent test takers are then able to familiarize with these items

and increase their test scores.

Fortunately, selecting items below the Level 1 dominance set does not neces-

sarily involve a large loss. The second curve in Figure 1 shows the Level 2 domi-

nance curve for the same item pool, that is, the curve consisting of the segments

of the informations for the locally second-best items in the pool. Because the

curves for the two levels hardly differ in height (except at the upper end of the

scale where the pool had a few strongly discriminating items), not much accuracy

of scoring would be lost if we relaxed the criterion for item selection somewhat

and selected items from both subsets. This fact has led to the idea of probabilistic

control of item exposure in adaptive testing.

The first probabilistic method was proposed by McBride and Martin (1983).

Their method simply consisted of picking the items randomly from the first five

levels of dominance along the y scale (i.e., the first five curves in Figure 1). The

fact that the method is probabilistic is important because it treats all test takers

at a given ability level equally fair in the sense that each of them has the same

probability of getting each item.

A more advanced probabilistic method was proposed by Sympson and Hetter

(1985; see also Hetter & Sympson, 1997). This method, which will be discussed

in more detail in the next section, is based on a probabilistic experiment that

is conducted each time an item is selected. The outcome of this experiment is

either the decision to go on and administer the item or to pass and rerun the

experiment for the next best item in the pool. The conditional probabilities of

administration given the selection of an item are the control parameters used

to restrict the item-exposure rates. The values of these parameters are to be set

through an iterative process of simulated adaptive test administrations.

An alternative method of probabilistic exposure control was presented in van

der Linden and Veldkamp (2004). The probability experiment in this method

differs from that in the Sympson-Hetter method in the following three aspects.

First, the experiment is not conducted each time after an item is selected but only

once before a test taker begins the test. Second, the critical parameters in the

experiment are not the conditional probabilities of administering an item given it

has been selected but the probabilities of the items being eligible for the test taker.

If an item is eligible, it is available for administration to the test taker. If the item

is ineligible, it is removed from the pool for the test taker. Third, the probabilities

of ineligibility are used in a recurrence relation that allows them to adapt automa-

tically to appropriate levels during testing. The differences between the two meth-

ods will become precise when we discuss them in more detail in the following.

The present article serves three different goals. Our first goal is to generalize

the item-ineligibility method, which was developed for use with constrained item

selection through the shadow-test method, to any type of item selection in adap-

tive testing. The generalization makes the basic structure of the method trans-

parent and allows us to highlight some of its features. Our second goal is to

van der Linden and Veldkamp

400

formulate a version of the method for conditional item-exposure control given y.

Conditional control is generally accepted as more effective because it reduces the

likelihood that test takers of approximately the same ability level detect the subset

of items in the pool specific to them (Stocking & Lewis, 1998). As will become

clear in the following, to formulate a conditional version of the method, we have

to reconceptualize the adaptive test as one conducted from multiple item pools.

Our final goal is to explore the behavior of the new versions of these methods

when the exposure rates of the items are driven to their minimum. This appears

to be possible provided we are willing to pay the price of a decrease in the accu-

racy of the ability estimates. The Level 10 through Level 100 curves in Figure 1

explain the decrease in accuracy: If the exposure rates are set lower and lower,

we are required to select items from the lower dominance levels in the pool and

eventually have to accept considerable loss of accuracy in testing. In addition, if

content constraints are imposed on the test, lowering the exposure rates may

eventually lead to overconstraining of the item selection, namely, the case where

no feasible test is left in the pool. Thus, if the conditional item-exposure rates are

chosen to be too low, the test becomes ineffective at some of the ability levels.

Sympson-Hetter Method

To highlight the differences with the ineligibility method in the next sections,

we briefly discuss the Sympson-Hetter (hereafter SH) method, which is based

on the following two events for each item in the pool:

Si: item i is selected;

Ai: item i is administered.

Because an item can never be administered without being selected, it always

holds that

Ai ⊂ Si; ð3Þ

for all i. Hence, for the conditional exposure rate of item i given y (i.e., probabil-

ity PðAi | yÞ), it follows that

PðAi | yÞ ¼ PðAi; Si | yÞ ¼ PðAi | Si; yÞPðSi | yÞ ð4Þ

for all possible values of y.

The SH method is used to force the item-exposure rates of all items in the

pool below an upper bound rmax. Typically, rmax is chosen to be in the range of

.20 to .30. From Equations 3 and 4 it follows that the bound is realized when

PðAi | Si; yÞPðSi | yÞ≤ rmax ð5Þ

for i ¼ 1; . . . ;N.

Conditional Item-Exposure Control

401

The probability of item selection, PðSi|yÞ, depends on a variety of factors,

including the distribution of the item parameters in the pool, the objective func-

tion that is optimized, and the initial item that is chosen. Because each of these

factors is fixed by design, the only parameters left in Equation 5 to manipulate

the exposure rates are the conditional probabilities PðAi|Si; yÞ.It is always possible to meet the bound in Equation 5 by setting the probabil-

ities PðAi|Si; yÞ at low values for all items. However, implicit in Equation 4 is

the idea that the exposure rates for the best items in the pool should not be much

lower than rmax because we do not want to lose them entirely. In other words,

rmax should be viewed as a goal value that has to be approached from below

rather than just an upper bound.

Values for PðAi|Si; yÞ that approach the goal value cannot be found by an

analytic method. Sympson and Hetter therefore proposed to find them using an

iterative process of computer simulations that has to be continued until admissi-

ble exposure rates are found. At each step in this process, a large set of simu-

lated test administrations is replicated at the y values for which the exposure

rates have to be controlled. At the end of the step, the new conditional exposure

rates of the items given these values are estimated and the control parameters

are adjusted. The fact that the control parameters are set at the true y values,

whereas in operational testing we control the exposure rates at estimated yvalues, is not much of a problem provided the set of y values is well chosen

(Stocking & Lewis, 2000).

Let t ¼ 1; 2; . . . denote the sets of simulations. The adjustment rule used in

the SH method is

Pðtþ1ÞðAi|Si; yÞ :¼ 1 if PðtÞðSi|yÞ≤ rmax;rmax=PðtÞðSi|yÞ if PðtÞðSi|yÞ > rmax;

�ð6Þ

where i ¼ 1; . . . ;N. The rule is based on the idea that if at step t an item was

selected with a probability smaller than rmax, no control is needed. However, if

an item was selected with a conditional probability larger than rmax, its control

parameter should have been set such that PðAi|yÞ ¼ rmax. From Equation 4 it

follows that this requirement would have been realized if PðtÞðAi|Si; yÞ had been

equal to rmax=PðtÞðSi|yÞ. Hence, the new value of Pðtþ1ÞðAi|Si; y) is set at this

level. For a more formal treatment of the SH method and some alternative meth-

ods based on variations of this adjustment rule, the reader should consult van

der Linden (2003).

In practical settings, the use of the SH method has been found to be time con-

suming. The number of y values at which the exposure rates are controlled is

usually in the range of 10 to 12. The number of iterations of the adaptive test

simulations for one y value is generally of the same order. It is therefore not

uncommon to have to run 100 to 200 simulations before an admissible set of

exposure rates is found. In addition, if some of the items have to be replaced

van der Linden and Veldkamp

402

during operational testing because they are compromised or appear to be flawed,

the values of the control parameters become invalid and the procedure has to be

repeated (Chang & Harris, 2002).

A more fundamental problem with the SH method is the fact that the item-

exposure rates do not necessarily converge to values below rmax during the

adjustment process. It can be regularly observed that the rates of some of the

overexposed items increase rather than decrease after adjustment. Also, rates

that were below rmax for some steps may suddenly jump back to a value larger

than this target. Because of this behavior, it is necessary to inspect the exposure

rates of all items after each step and use personal judgment to decide when to

stop.

Conditional Item-Ineligibility Methods

We first discuss the case of adaptive testing without content constraints. The

modifications of the method necessary to deal with adaptive testing with content

constraints are introduced in a later section.

Testing Without Content Constraints

To formulate the item-ineligibility method we consider the following two

events:

Ei: item i is eligible;

Ai: item i is administered.

If the item is eligible, it remains in the pool during the entire test for the test

taker; otherwise it is removed. Unlike the SH method, it is not necessary to

allow for an event Si of selecting item i: An item is always administered if it is

selected. More formally, it holds that Ai ¼ Si, and we need not consider the

latter.

Analogous to Equation 3,

Ai ⊂Ei ð7Þ

for all i. We are therefore able to write

PðAi|yÞ ¼ PðAi;Ei|yÞ ¼ PðAi|Ei; yÞPðEi|yÞ ð8Þ

for all possible values of y.

If we impose rmax as a goal value for the exposure rates PðAi|yÞ, we obtain

PðAi|yÞ ¼ PðAi|Ei; yÞPðEi|yÞ≤ rmax; ð9Þ

Conditional Item-Exposure Control

403

or

PðEi|yÞ≤ rmax

PðAi|Ei; yÞ; ð10Þ

with PðAi|Ei; yÞ > 0: From Equation 7,

PðAi|Ei; yÞ ¼PðAi|yÞPðEi|yÞ ; ð11Þ

with PðEi|yÞ> 0. Hence, Equation 10 can be rewritten as

PðEi|yÞ≤ rmax

PðAi|yÞPðEi|yÞ; ð12Þ

still with PðAi|yÞ> 0.

The basic idea is to conceive of Equation 12 as a recurrence relation. Suppose

j test takers have already taken the test, and we want to establish the probabil-

ities of eligibility for test taker j þ 1. If rmax is our goal value for the exposure

rates at selected points yk, k ¼ 1; . . . ;K, the probabilities of eligibility

Pðjþ1ÞðEi|yk) can be calculated as

Pðjþ1ÞðEi|ykÞ ¼ minrmax

PðjÞðAi|ykÞPðjÞðEi|ykÞ; 1

� �; ð13Þ

with PðjÞðAi|ykÞ> 0.

The rationale for the recurrence relation in Equation 13 is that it automatically

maintains rmax as a goal value for the exposure rates. It is easy to show that

if PðjÞðAi|ykÞ > rmax; then Pðjþ1ÞðEi|ykÞ<PðjÞðEi|ykÞif PðjÞðAi|ykÞ ¼ rmax; then Pðjþ1ÞðEi|ykÞ ¼ PðjÞðEi|ykÞif PðjÞðAi|ykÞ< rmax; then Pðjþ1ÞðEi|ykÞ>PðjÞðEi|ykÞ:

ð14Þ

Thus, if an exposure rate is larger than the goal value, the probability of eligibil-

ity of the item always goes down. As a result, because of Equation 7, the

expected exposure rate also goes down. On the other hand, if an exposure rate is

below rmax, the probability of eligibility of the item goes up.

Practical Implementation

To implement the method for exposure control conditional on a set of values

yk, k ¼ 1; . . . ;K, we have to abandon the idea of an adaptive test from a single

item pool. Before a person takes the test, the current probabilities of eligibility

are used to decide which items are eligible for each of the values yk. The result

van der Linden and Veldkamp

404

of these experiments is K different versions of the item pool, one at each yk.During the test, the person visits the version of the item pool that is closest to

his or her current ability estimate.

To estimate the probabilities of eligibility, we have to record the following

two counts:

aij k: number of test takers through j who visited item pool k and took item i;

eijk: number of test takers through j who visited item pool k when item i was eligible.

PðjÞðAi|ykÞ and PðjÞðEi|yk) can then be estimated as aijk=j and eijk=j, and

the estimated probability of eligibility for test taker j þ 1 in Equation 13 is

obtained as

bPðjþ1ÞðEi|ykÞ ¼ minrmaxeijk

aijk

; 1

� �; ð15Þ

with aijk < 0. The estimates in Equation 15 ignore the differences between the

test taker’s true and estimated ability. Just as for the SH method, we expect the

impact of estimation error on the actual exposure rates to be negligible for all

practical purposes (Stocking & Lewis, 2000). The simulation results presented

later in this article confirm this expectation.

Two different implementations of the method are possible:

1. The method can be used to control the exposure rates on the fly, that is, without

any prior adjustment of the eligibility probabilities.

2. The probabilities can be adjusted prior to operational testing through a computer

simulation of administrations of the test.

In the previous study with the unconditional version of the method (van der

Linden & Veldkamp, 2004), we were able to report that for a typical goal value

of rmax ¼ :25, both the probabilities of eligibility and the exposure rates were

already stable after 1,000 test administrations. In the empirical study reported

later in this article, we minimized the goal values and found that for values close

to their minimum (see Equation 25 in the following), the method should be used

in combination with the technique of fading to get stability after the same number

of administrations. (The technique of fading is explained in a following section.)

Whatever implementation is used, it is always possible to deal with the repla-

cement of a few items in the operational pool, for instance, because they appear

to be compromised, on the fly.

Testing With Content Constraints

Usually, content constraints are to be imposed on the adaptive test. If so, the

shadow-test approach offers an effective implementation. Shadow tests are full-

size tests calculated prior to the selection of the items that (a) are optimal at the

Conditional Item-Exposure Control

405

last ability estimate, (b) meet all content constraints, and (c) include all items

already administered to the current person. The next item to be administered is

the best item among the free items in the current shadow test. Shadow tests can

be easily assembled using the technique of 0-1 integer programming (van der

Linden, 2000, 2005; van der Linden & Reese, 1998).

A natural way to implement the control in adaptive testing with content con-

straints is through the inclusion of ineligibility constraints in the models for the

shadow tests. If the decision is that item i is ineligible for the test taker, the fol-

lowing constraint is added to the model

xi ¼ 0; ð16Þ

where xi is the 0-1 decision variable for item i in the model (that is, if xi ¼ 1,

item i is selected but it is not if xi ¼ 0Þ. If item i remains eligible, no constraint

is added.

The extension of the test-assembly model for the shadow test with these inelig-

ibility constraints gives rise to two new issues. The first is potential overconstrain-

ing of the item selection. In principle, it is possible that a temporary combination

of content and ineligibility constraints in the model is unfortunate and no feasible

solution is left. Generally however, for a typical testing program, the number of

ineligibility constraints in the model is small, and the likelihood of an infeasible

solution can be ignored. In fact, in our empirical studies with the unconditional

version of this method, infeasibility never occurred (van der Linden & Veldkamp,

2004). We expect the same to happen with the conditional version of the method

except when the exposure rates are driven to their minimum (see the next section).

Also, the likelihood of an infeasibility depends entirely on the appropriateness of

the item pool for the test. A method of item-pool assembly that guarantees a

balanced distribution of the items in the pool with respect to the content con-

straints is presented in van der Linden, Ariel, and Veldkamp (2006). For methods

of item-pool design that guarantee such distributions, see van der Linden (2005).

If infeasibility occurs, a straightforward solution is to remove all ineligibility

constraints from the model and use the full pool for item selection. This measure

may lead to an occasional extra exposure for some of the items, but the adaptive

mechanism in Equation 14 automatically corrects for them.

The second issue has to do with the fact that the use of shadow tests reinvokes

the distinction between item selection and administration on which the SH

method rests. An item can now be selected for the shadow test but may not be

administered because it was dominated by the other free items in the test.

To deal with these new issues, we distinguish the following possible events:

Ei: item i is eligible;

F: the model for the shadow test with the ineligibility constraints is feasible;

Si: item i is selected in a shadow test;

Ai: item i is administered.

van der Linden and Veldkamp

406

It holds for these four events that

Ai ⊂ Si ⊂ fEi∪Fg; ð17Þ

where F is the event of an infeasible shadow test. Following the same argument

as in van der Linden and Veldkamp (2004, Equations 8 through 13), an upper

bound rmax on the conditional exposure rates PðAi|yÞ can be shown to lead to

PðEi|yÞ≤ 1� 1

PðF|yÞ þrmaxPðEi∪F|yÞPðAi|yÞPðF|yÞ ; ð18Þ

with PðAi|yÞ> 0 and PðF|yÞ> 0.

This inequality implies the following version of Equation 13 for the case of

adaptive testing with content constraints:

Pðjþ1ÞðEi|ykÞ ¼ min 1� 1

PðjÞðF|ykÞþ rmaxPðjÞðEi ∪F|ykÞ

PðjÞðAi|ykÞPðjÞðF|ykÞ; 1

� �; ð19Þ

still with PðjÞðAi|Ei; ykÞ> 0 and PðjÞðF|ykÞ> 0.

This equation looks more complicated than Equation 13, but the relation

between the two becomes clear if we consider the case that the shadow test is

always feasible and set PðjÞðF|ykÞ ¼ 1. It then holds that PðjÞðEi ∪F|ykÞ ¼PðjÞðEi|ykÞ, and Equation 13 directly follows from Equation 19. In addition,

Equations 13 and 19 have the same adaptive behavior. The only difference

between the two relations is a correction of the probability of item eligibility for

possible infeasibility in Equation 19. For these and other details, see van der

Linden and Veldkamp (2004).

To estimate the right-hand probabilities in Equation 19, we need the following

counts:

njk: number of test takers through j who visited item pool k;

aijk: number of test takers through j who visited item pool k and took item i;

jjk: number of test takers through j who visited item pool k when the test was

feasible;

rijk: number of test takers through j who visited item pool k when item i was eligible

or the test was infeasible.

The left-hand probabilities are then estimated as

bPðjþ1ÞðEi|ykÞ ¼ min 1� njk

jjk

þrmaxnjkrijk

aijkjjk

; 1

( ); ð20Þ

with aijk > 0; jjk > 0:

Conditional Item-Exposure Control

407

To begin the test, it is recommend to set bPðjþ1ÞðEi|ykÞ ¼ 1 for item i until

both aijk > 0 and jjk > 0. This initialization helps us prevent indeterminate

values because the conditions in Equation 20 are not yet satisfied.

Fading

In van der Linden and Veldkamp (2004), it is recommended to update the

counts using the technique of fading, which is used in applications of Bayesian

networks for updating posterior probabilities. The basic idea underlying this

technique is to weigh the old data by a fading factor when new data are added.

As a result, the effective size of the sample remains fixed at a predetermined

level (Jensen, 2001), and the probabilities of eligibility have the same level of

stability throughout operational testing. For a demonstration of the effectiveness

of fading for updating eligibility probabilities, see van der Linden and Veld-

kamp (2004).

Suppose we use a fading factor w. The number of test takers njk who visited

item pool k in Equation 20 is no longer a direct count but a number updated as

n∗ðjþ1Þk ¼ wn∗jk þ 1: ð21Þ

The updates of the other counts are analogous. For example, the number of test

takers who visited pool k and received item i is updated as

a∗ðjþ1Þik ¼wa∗ijk þ 1; if item i was administered to test taker j;wa∗ijk; otherwise.

�ð22Þ

These updates produce estimates of the probabilities based on a moving win-

dow with weights close to one for recent test takers but approaching zero for ear-

lier test takers. The method can be shown to have estimates based on an effective

sample size equal to 1=ð1� wÞ (Jensen, 2001). In the following empirical study,

we used w ¼ :999, which amounts to a sample size of 1,000.

The use of fading is particularly important when the goal value for the expo-

sure rates is set close to its minimum (see the next section). Typically, the prob-

abilities of eligibility for the item pool go down to a value smaller than 1 in an

order determined by the level of dominance of the items (see Figure 1). When

the goal value approaches its minimum, the process has to be continued until the

last items in the pool are reached. But when these items become active, the num-

bers of test takers that have visited the item pools, njk, have already grown large.

As a result, Equation 20 adapts only slowly to the changes in the counts of the

item administrations, aijk, which had been close to zero thus far. The technique

of fading prevents this slower adaptation because the weighted counts in

Equation 20 are based on the same effective number of test takers for later items

as for earlier ones.

van der Linden and Veldkamp

408

Minimizing Exposure Rates

It is tempting to explore how low rmax can be set before the method breaks

down. Unfortunately, although we are able to discuss a useful lower bound on

the marginal exposure rates, exact bounds for conditional rates are hard to find.

Marginal Rates

For the marginal exposure rates, the following relation holds for any popula-

tion of test takers

XN

i¼1

PðAiÞ ¼ n; ð23Þ

where n is the length of the adaptive test.

The relation was presented without formal proof in van der Linden (2003).

However, a straightforward proof runs as follows: Let xij be an indicator vari-

able equal to 1 if examinee j takes item i and equal to 0 otherwise. Then, for a

population of J test takers,

XJ

j¼1

xij=J

is the empirical marginal exposure rate of item i and

XN

i¼1

xij ¼ n

is the common length of the test. Hence, the sum of the exposure rates can be

written as

XN

i¼1

PðAiÞ ¼XN

i¼1

eXJ

j¼1

xij=J

!

¼XJ

j¼1

eXN

i¼1

xij

!=J

¼ n; ð24Þ

where e denotes the expectation over replicated test administrations. The transi-

tion form the second to the third line in Equation 24 is valid because the test

length is supposed to be the same for all j.

Conditional Item-Exposure Control

409

From a security point of view, it would be ideal if the exposure rates of all

items were distributed uniformly with a common low value. From Equation 23,

it follows that this type of distribution is only possible for

PðAiÞ ¼ nN�1, for all i: ð25Þ

To illustrate the capability of the current method to realize a uniform distribu-

tion of exposure rates, we simulated adaptive administrations of a test of 25 items

from a pool of 305 items. The pool and the test are described in the section on the

empirical study that follows. For this combination of test length and pool size,

uniform exposure rates are only possible with PðAiÞ ¼ 25=305 ¼ :0814 for all

items. We simulated 10,000 administrations of the test to random test takers at

y ¼ �2:0;�1:5; . . . 2:0 for each of the goal values rmax ¼ :25; :20; :15; :10, and

.0814. (To make the results comparable, the range of y values was chosen to be

identical to that in the main study that follows.)

The resulting exposure rates for the items are shown in Figure 2. The lower

the goal value, the lower the maximum exposure rate and the larger the portion

of the item pool that became active in the test. The curve for rmax ¼ :0814 shows

a uniform distribution except for minor disturbances at its extremes due to the

probabilistic nature of the method. It is interesting to observe the difference

between the curves for rmax ¼ :10 and rmax ¼ :0814. Although the difference

between the two goal values seems negligible, for the former, some 50 items

were still inactive, whereas all items became active for the latter.

Conditional Rates

For conditional exposure rates, the version of Equation 23 is more compli-

cated. In this case, we have a different set of rates for each item pool at yk. In

addition, the number of test takers who visit a pool is no longer fixed but ran-

dom. Consequently, to derive an expression for the sum of the exposure rates,

we have to weigh the sums of the rates for the individual pools by the probabil-

ity of a visit to the pool. Because the probability of a visit depends on how far

the test taker is in the test, we arrive at the following version of Equation 23:

XK

k¼1

Xn

g¼1

XN

i¼1

PgðAi|ykÞPgk ¼ n; ð26Þ

where g ¼ 1; . . . ; n still indexes the items in the adaptive test, Pgk is the prob-

ability of a test taker visiting the item pool at yk for the selection of item g, and

PgðAi|ykÞ is the probability of administering item i during this visit.

Because all probabilities in Equation 26 are dependent on each other, it is

impossible to use this relation for deriving lower bounds on the conditional

item-exposure rates. This conclusion holds a fortiori for adaptive testing with

van der Linden and Veldkamp

410

item content constraints. We therefore took an empirical approach and explored

what happened to the conditional exposure rates when the goal values rmax were

systematically decreased in a series of simulated test administrations.

Empirical Study

An adaptive version of a section of the LSAT was simulated for 10,000 test

takers at yk ¼ �2:0;�1:5; . . . ; 2:0. The test consisted of 25 discrete items,

whereas the pool had a size of 305 items.

The following versions of the test were simulated:

1. Tests with and without content constraints.

2. Tests with conditional item-exposure control at yk ¼ �2:0;�1:5; . . . ; 2:0 with the

goal values rmax ¼ 1 (no control), .20, .15, .10, .05, and .025.

The content constraints were the usual constraints for the paper-and-pencil

version of the LSAT section; they dealt with such attributes as item type and

content, answer key distribution, word counts, and gender/minority orientation

of the items. The constraints were imposed on the adaptive test using the sha-

dow-test approach. The total number of constraints was equal to 30. The last

two levels of exposure control were lower than the bound of .0814 calculated

from Equation 25 for the marginal exposure control in the preceding section;

they were therefore expected to be critical.

The tests administrations were simulated with the maximum-information

criterion for item selection in Equation 2. The initial ability estimate was set

FIGURE 2. Estimated marginal exposure rates of the items in the pool for the goalvalues rmax ¼ .25, .20, .15, .10, and .0814.Note: For each curve, the items on the horizontal axis are ordered by their exposure rates.

Conditional Item-Exposure Control

411

equal to byð0Þ ¼ 0 for all test takers. The interim estimates of y were calculated

using the method of expected a posteriori (EAP) estimation with a noninforma-

tive prior.

During the simulations, we recorded (a) the actual exposure rates and (b) the

errors in the ability estimates at each yk.The exposure rates for adaptive testing without the content constraints are

shown in Figure 3. The four panels in this figure are for yk ¼ �1:5;�0:5; 0:5,

and 1.5. (The results at the other values of yk fitted the patterns in these panels

exactly and are omitted for space.) Without exposure control, the maximum

exposure rates tended to be close to 1.0. The largest rate was that of .99 observed

FIGURE 3. Estimated conditional exposure rates of the items given y ¼ �1:5, −0.5,

0.5, and 2.0 for adaptive testing without the content constraints.Note: Each curve is for a different goal value rmax ¼ :25, .20, .15, .10, .05, and .025. For each curve,

the items on the horizontal axis are ordered by their exposure rates. The maximum values for the

curves decrease with rmax.

van der Linden and Veldkamp

412

at yk ¼ 0:5: For the conditions with control, the maximum rate decreased sys-

tematically with rmax. The lowest goal value of rmax ¼ :025 had a maximum rate

of .04 observed for a few items at each yk. The number of items that were still

inactive for this goal value varied from 10 to 15 items at yk ¼ �1:5 and 1.5 to 60

to 70 items at yk ¼ �0:5 and 0:5. These results suggest that for the current item

pool, rmax ¼ :025 must have been close to the minimum value possible at the

two outmost values of yk but that a lower value might have been possible at the

yk values in the middle of the scale.

FIGURE 4. Estimated conditional exposure rates of the items given y = −1.5, −0.5,0.5, and 2.0 for adaptive testing with the content constraints.Note: Each curve is for a different goal value rmax ¼ :25, .20, .15, .10, .05, and .025. For each curve,

the items on the horizontal axis are ordered by their exposure rates. The maximum values for the

curves decrease with rmax. The only exception is the maximum value for rmax ¼ :025, which has

moved to the second position for y ¼ �1:5 and y ¼ 1:5.

Conditional Item-Exposure Control

413

The exposure rates for adaptive testing with the content constraint in Figure 4

show the same general pattern. The only exceptions are the rates for rmax ¼ :025

at yk ¼ �1:5 and 1.5. The curves for this goal value jumped to the second posi-

tion (that is, directly after the condition without control). The reason for this jump

was that the goal value had become much too low to produce feasible shadow test

at each of the values yk. Table 1 shows the percentages of feasible shadow tests

for rmax ¼ :025 during the simulations. Although infeasibility was hardly a pro-

blem at yk ¼ 0, for the more extreme values of yk the percentage of feasibility

quickly decreased to 7.1 ðyk ¼ �2:0Þ and 16.4 ðyk ¼ 2:0Þ. Recall that when

infeasibility occurs, the items were replaced in the pool. For the higher values of

rmax, replacement is an occasional random event, which is automatically corrected

for by the adaptive mechanism in Equation 14. However, for rmax ¼ :025 we

appeared to have dived below what was possible at the majority of the yk values,

and the method was no longer able to cope with the number of replacements.

It is interesting to note that for the combination of rmax ¼ :025 and adaptive

testing without the content constraints in Figure 3, only 2 out of the 9× 10; 000

simulated administrations resulted in an infeasible shadow test. The reason for

rmax ¼ :025 being too low for adaptive testing with the content constraints was

thus not that the number of eligible items in the pools was smaller than the

length of the test but that none of the possible combination of eligible items met

satisfied the set of content constraints for the test.

We also recorded the errors in the ability estimates during the test. Figures 5

and 6 show the estimated bias and mean square error (MSE) functions calculated

from these errors. As expected, the curves were generally ordered in the values

of rmax; for smaller goal values, the curves were closer to the horizontal axis.

It should be observed that for rmax ¼ :025, the errors were smaller for adap-

tive testing with the content constraints than without them. This finding might

surprise because the addition of content constraints to adaptive testing generally

results in less space for optimizing the item selection and consequently lar-

ger errors. However, the finding is explained by the fact discussed earlier that

rmax ¼ :025 was too low to satisfy the content constraints at the majority of the

yk values. Because the algorithm frequently had to return all the items to the

pool, the exposure rates of the more dominant items went up and the accuracy

of the ability estimation improved.

TABLE 1Percentage of Feasible Shadow Tests During Adaptive Testing With ContentConstraints for rmax = .025

yk

−2.0 −1.5 −1.0 −0.5 0 0.5 1.0 1.5 2.0

% Feasible 7.1 14.5 9.8 87.2 99.9 57.3 29.0 12.1 16.4

van der Linden and Veldkamp

414

Practical Conclusion

For a high-stakes testing program, we expect item security to be the number

one priority. On the other hand, the accuracy of the ability estimates cannot be

sacrificed too much. The empirical study in this article was based on only one

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

0.5

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0x

θ

Bia

s no control

rmax = .025

rmax = .025

0.0

0.1

0.2

0.3

0.4

0.5

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

θ

MS

E

no control

FIGURE 5. Estimated conditional bias and MSE functions for adaptive testingwithout content constraints for rmax ¼ :25, .20, .15, .10, .05, and .025.Note: The curves in both panels are ordered in rmax. MSE ¼ mean square error.

Conditional Item-Exposure Control

415

combination of test and item pool. It is therefore dangerous to generalize. But if

we had to choose a goal value for the conditional exposure rates for this combi-

nation, our choice would have been rmax ¼ :15 or 10. The bias and MSE curves

for the two lower levels of .05 and .025 in Figures 5 and 6 are much more less

favorable. Such levels may only become acceptable if the item pool is made

larger and/or the test shorter (see the lower bound on the exposure rates in

Equation 25).

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

0.5

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

θ

Bia

s

rmax = .025

rmax = .025

no control

0.0

0.1

0.2

0.3

0.4

0.5

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

θ

MS

E

no control

FIGURE 6. Estimated conditional bias and MSE functions for adaptive testing

with content constraints for rmax ¼ :25, .20, .15, .10, .05, and .025.Note: The curves in both panels are ordered in rmax. MSE ¼ mean square error.

van der Linden and Veldkamp

416

References

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s

ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores

(pp. 397-479). Reading, MA: Addison-Wesley.

Chang, S.-W., & Harris, D. J. (2002, April). Redeveloping the exposure control para-

meters of CAT items when a pool is modified. Paper presented at the annual meeting of

the American Educational Research Association, New Orleans, LA.

Hetter, R. R., & Sympson, J. B. (1997). Item-exposure in CAT-ASVAB. In W. A. Sands,

J. R. Waters, & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to

operation (pp. 141-144). Washington, DC: American Psychological Association.

Jensen, F. V. (2001). Bayesian networks and graphs. New York: Springer.

McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in

a military setting. In D. J. Weiss (Ed.), New horizons in testing (pp. 223-236). San

Diego, CA: Academic Press.

Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in

computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23,

57-75.

Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in

CAT. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing:

Theory and practice (pp. 163-182). Boston: Kluwer.

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in com-

puterized adaptive testing. In Proceedings of the 27th annual meeting of the Military

Testing Association (pp. 973-977). San Diego, CA: Navy Personnel Research and

Development Center.

van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J.

van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and

practice (pp. 27-52). Boston: Kluwer.

van der Linden, W. J. (2003). Some alternatives to Sympson-Hetter item-exposure con-

trol in computerized adaptive testing. Journal of Educational and Behavioral Statis-

tics, 28, 249-265.

van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer.

van der Linden, W. J., Ariel, A., & Veldkamp, B. P. (2006). Assembling a CAT item

pool as a set of linear test forms. Journal of Educational and Behavioral Statistics, 31,

81-100.

van der Linden, W. J., & Reese, L. M. (1998). A model for optimal constrained adaptive

testing. Applied Psychological Measurement, 22, 259-270.

van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in compu-

terized adaptive testing with shadow tests. Journal of Educational and Behavioral

Statistics, 29, 273-291.

Authors

WIM J. VAN DER LINDEN is professor, Department of Research Methodology, Mea-

surement, and Data Analysis, University of Twente, PO Box 217, 7500 AE Enschede,

The Netherlands; [email protected]. His areas of interest are test theory,

applied statistics, and research methods.

Conditional Item-Exposure Control

417

BERNARD P. VELDKAMP is assistant professor, Department of Research Methodol-

ogy, Measurement, and Data Analysis, University of Twente, PO Box 217, 7500 AE

Enschede, The Netherlands; [email protected]. His areas of interest are educa-

tional measurement and statistics.

Manuscript received June 21, 2005

Accepted March 17, 2006

van der Linden and Veldkamp

418