Determining targets for multi-stage adaptive tests using integer programming

10
Interfaces with Other Disciplines Determining targets for multi-stage adaptive tests using integer programming Ronald D. Armstrong a , Mabel T. Kung b, * , Louis A. Roussos c a Department of Management Science and Information Systems, Rutgers Business School, Rutgers University, Piscataway, NJ 08854-8054, USA b Department of Information Systems and Decision Sciences, Mihaylo College of Business and Economics, California State University at Fullerton, 800 North State College Blvd., Fullerton, CA 92834, USA c Measured Progress, 100 Education Way, Dover, NH 03920, USA article info Article history: Received 19 March 2009 Accepted 5 December 2009 Available online 11 December 2009 Keywords: Integer programming Simulation Testing Item response theory Poisson trials abstract This paper considers a multi-stage adaptive test (MST) where the testlets at each stage are determined prior to the administration. The assembly of a MST requires target information and target response func- tions for the MST design. The targets are chosen to create tests with accurate scoring and high utilization of items in an operational pool. Forcing all MSTs to have information and response function plots to be within an interval about the targets will yield parallel MSTs, in the sense that standardized paper-and- pencil tests are considered parallel. The objective of this paper is to present a method to determine tar- gets for the MST design based on an item pool and an assumed distribution of examinee ability. The approach is applied to a Skills Readiness Inventory test designed to identify logical reasoning deficiencies of examinees. This method can be applied to obtain item response theory targets for a linear test as this is a special case of a MST. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Skill assessment test selection and web-based learning have be- come important business models related to recommendation and personalized systems (Csondes et al., 2002; Chen and Duh, 2008). Because the ability of individuals may be based on their experi- ences and willingness to learn different fields of study, considering learner ability can promote knowledge assessment and future learning performance to adapt and succeed. Recent studies con- sider personalized mechanisms to promote insights in learning performance using techniques such as item response theory (Chen et al., 2005; Hatzilygeroudis and Prentzas, 2004), heuristics and integer programming (Csondes et al., 2002), causal modeling meth- odologies and Bayesian networks (Anderson and Vastag, 2004) and structural equation modeling (Lu et al., 2007; Gupta and Kim, 2008). The classical linear test has a fixed number of items where an item consists of a stimulus (passage) and a question relating to the stimulus. The linear test can be administered to a large number of examinees. For example, a single version of the Law School Admission Test (LSAT) is given concurrently to around 30,000 examinees. The large sample size facilitates equitable scoring and detailed analyses of the items’ statistical properties. Every form (version) of the LSAT is assembled to be parallel to all previous forms. Parallel forms have approximately the same measurement error and number correct score distribution, and the items cover the same cognitive skills. The use of integer programming allows testing agencies to assemble parallel linear forms where small adjustments are needed to equate the results across administra- tions. The number of correct responses from a parallel linear test can translate directly to a reported score that is comparable to scores of previous administrations. An item response theory (IRT) model (Lord, 1980) is used to represent the information and re- sponse pattern obtained over a range of ability. The item response function can be used to calculate the probability of a correct re- sponse to an item when given an ability level. A plot of this func- tion is the item characteristic curve. Armstrong et al. (2005) present an overview of the assembly process for the linear version of the LSAT. A computerized adaptive test (CAT) tailors each form to the ability level of the individual examinee. The item selection routine of CAT adapts based on responses to items delivered earlier in the form. After each response an ability estimate is updated and the next item is selected such that it has optimal properties according to the new estimate (Adema et al., 1991; van der Linden and Glas, 2003). The estimated ability of the examinee is re-evaluated until the accuracy of the estimate reaches a statistically acceptable level or when some limit is reached, such as a maximum number of items. The reported score is determined from the ability estimate and, as a result, the percentage of questions answered correctly has only indirect effect on the reported score. Current operational CAT systems include the armed services vocational aptitude bat- 0377-2217/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2009.12.009 * Corresponding author. Tel.: +1 657 278 2221; fax: +1 657 278 5940. E-mail addresses: [email protected] (R.D. Armstrong), [email protected] (M.T. Kung), [email protected] (L.A. Roussos). European Journal of Operational Research 205 (2010) 709–718 Contents lists available at ScienceDirect European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor

Transcript of Determining targets for multi-stage adaptive tests using integer programming

European Journal of Operational Research 205 (2010) 709–718

Contents lists available at ScienceDirect

European Journal of Operational Research

journal homepage: www.elsevier .com/locate /e jor

Interfaces with Other Disciplines

Determining targets for multi-stage adaptive tests using integer programming

Ronald D. Armstrong a, Mabel T. Kung b,*, Louis A. Roussos c

a Department of Management Science and Information Systems, Rutgers Business School, Rutgers University, Piscataway, NJ 08854-8054, USAb Department of Information Systems and Decision Sciences, Mihaylo College of Business and Economics, California State University at Fullerton,800 North State College Blvd., Fullerton, CA 92834, USAc Measured Progress, 100 Education Way, Dover, NH 03920, USA

a r t i c l e i n f o

Article history:Received 19 March 2009Accepted 5 December 2009Available online 11 December 2009

Keywords:Integer programmingSimulationTestingItem response theoryPoisson trials

0377-2217/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.ejor.2009.12.009

* Corresponding author. Tel.: +1 657 278 2221; faxE-mail addresses: [email protected] (R.D. Arms

(M.T. Kung), [email protected] (L.A. Ro

a b s t r a c t

This paper considers a multi-stage adaptive test (MST) where the testlets at each stage are determinedprior to the administration. The assembly of a MST requires target information and target response func-tions for the MST design. The targets are chosen to create tests with accurate scoring and high utilizationof items in an operational pool. Forcing all MSTs to have information and response function plots to bewithin an interval about the targets will yield parallel MSTs, in the sense that standardized paper-and-pencil tests are considered parallel. The objective of this paper is to present a method to determine tar-gets for the MST design based on an item pool and an assumed distribution of examinee ability. Theapproach is applied to a Skills Readiness Inventory test designed to identify logical reasoning deficienciesof examinees. This method can be applied to obtain item response theory targets for a linear test as this isa special case of a MST.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Skill assessment test selection and web-based learning have be-come important business models related to recommendation andpersonalized systems (Csondes et al., 2002; Chen and Duh, 2008).Because the ability of individuals may be based on their experi-ences and willingness to learn different fields of study, consideringlearner ability can promote knowledge assessment and futurelearning performance to adapt and succeed. Recent studies con-sider personalized mechanisms to promote insights in learningperformance using techniques such as item response theory (Chenet al., 2005; Hatzilygeroudis and Prentzas, 2004), heuristics andinteger programming (Csondes et al., 2002), causal modeling meth-odologies and Bayesian networks (Anderson and Vastag, 2004) andstructural equation modeling (Lu et al., 2007; Gupta and Kim,2008).

The classical linear test has a fixed number of items where anitem consists of a stimulus (passage) and a question relating tothe stimulus. The linear test can be administered to a large numberof examinees. For example, a single version of the Law SchoolAdmission Test (LSAT) is given concurrently to around 30,000examinees. The large sample size facilitates equitable scoring anddetailed analyses of the items’ statistical properties. Every form(version) of the LSAT is assembled to be parallel to all previous

ll rights reserved.

: +1 657 278 5940.trong), [email protected]).

forms. Parallel forms have approximately the same measurementerror and number correct score distribution, and the items coverthe same cognitive skills. The use of integer programming allowstesting agencies to assemble parallel linear forms where smalladjustments are needed to equate the results across administra-tions. The number of correct responses from a parallel linear testcan translate directly to a reported score that is comparable toscores of previous administrations. An item response theory (IRT)model (Lord, 1980) is used to represent the information and re-sponse pattern obtained over a range of ability. The item responsefunction can be used to calculate the probability of a correct re-sponse to an item when given an ability level. A plot of this func-tion is the item characteristic curve. Armstrong et al. (2005)present an overview of the assembly process for the linear versionof the LSAT.

A computerized adaptive test (CAT) tailors each form to theability level of the individual examinee. The item selection routineof CAT adapts based on responses to items delivered earlier in theform. After each response an ability estimate is updated and thenext item is selected such that it has optimal properties accordingto the new estimate (Adema et al., 1991; van der Linden and Glas,2003). The estimated ability of the examinee is re-evaluated untilthe accuracy of the estimate reaches a statistically acceptable levelor when some limit is reached, such as a maximum number ofitems. The reported score is determined from the ability estimateand, as a result, the percentage of questions answered correctlyhas only indirect effect on the reported score. Current operationalCAT systems include the armed services vocational aptitude bat-

Table 1A MST design with 3 stages and 6 bins is given. Each bin in the depicted MST istargeted for a particular population ability percentile range as indicated in the leftmargin. The number of items assigned to each bin is given in the column header.

Percentile Stage

1 2 310 items 10 items 10 items

[67,100] bin 6[50,100] bin 3[0,100], [33,67] bin 1 bin 5[0,50] bin 2[0,33] bin 4

710 R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718

tery for the US Department of Defense, Nursing Certification Exam,employment and training, scenario planning, hybrid types of e-learning and assessment of non-cognitive skills in the industry(van der Linden and Glas, 2003; O’Brien, 2004; Mavrommatis,2008).

A compromise between a linear test and a CAT is a multi-stageadaptive test (MST) which consists of testlets. A testlet is a group ofitems administered to an examinee at one stage of the exam andthe examinee must complete the testlet before continuing. Thetestlets of a MST are assembled before the administration and eachtestlet is intended for a subpopulation of the examinees. An exam-inee is routed to a testlet based on either an ability level estimateor number of correct responses. A MST combines the test designcomponents of a linear test with the adaptive nature of a CAT(Armstrong et al., 2004; Luecht and Nungester, 2000). The refer-ences show that a MST implementation of a standardized test ob-tains as much scoring accuracy as the linear version with theadministration of about one-third fewer items.

The major advantages of a MST over a standard CAT are (1) theability to have a review by test specialists because all MSTs are pre-assembled, (2) more accurate equating across tests because a sin-gle MST would have a large number of examinees responding tothe same set of items and (3) better test security when an MSTform is administered concurrently to a large group of examineesand no items on the form have been previously exposed. Recentarticles using the design features of MSTs include Papanikolaouet al. (2002), Hatzilygeroudis and Prentzas (2004), Xing and Ham-bleton (2004) and Chen et al. (2005). Belov and Armstrong (2008)present a Monte Carlo solution technique for handling MST analy-sis. Edmonds and Armstrong (2008) use mathematical program-ming techniques for test assembly (van der Linden, 2005).

A major concern in any high-stakes testing environment is testsecurity. The security issue becomes even more pronounced whenthe test is delivered on a computer. A linear test, for example theLSAT, can be delivered concurrently to several thousand examineesacross the United States and Canada. Once administered, the itemsare disclosed and never administered again. This approach essen-tially eliminates the possibility of examinees obtaining priorknowledge of the items. This type of administration is not possiblewith a computerized test because facilities are not available to se-curely deliver tests concurrently on this scale. As mentioned, oneadvantage of the MST is the administration of the same test tomany examinees. However, this advantage changes to a disadvan-tage when the administration is spread out over several days. Thecontent of exposed items can be distributed among subsequentexaminees. This is a major reason why the LSAT remains a linear(paper-and-pencil) test. A natural use of the MST format is for alow-stakes test that can be administered over the internet.

The Skill Readiness Inventory (SRI) test is designed to evaluatethe skills of potential LSAT examinees approximately one year be-fore they take the operational version. It is a multiple-choice testand uses items written for the LSAT. Responses to a SRI form areevaluated and suggestions made to the examinees concerninghow to improve the skills tested by the LSAT. This exam is admin-istered over the internet. It is counter productive to cheat becausea fair assessment of skills is the goal from the standpoint of theexaminee. This testing program uses the MST design. An opera-tional version of this test can be found at sri.lsac.org. Even thoughthe stakes are relatively low, the need still exists to assemble par-allel forms for purposes of consistent study recommendations. Theremainder of this paper focuses on a method to determine targetsthat allow for the creation of parallel forms.

Luecht and Nungester (2000) discuss determining targets bymatching the reciprocal test information function to a desired de-gree of accuracy. The conditional error variance of the ability esti-mate is this reciprocal. Luecht and Burgin (2004) take a more

detailed look at target creation, but they concentrate on targetsfor proficiency tests while this paper considers the problem for askills assessment test. The MSTs assembled based on the targetsmust provide enough information to yield reliable scores and theitem pool must support the assembly of multiple MSTs. There isa trade-off between these conflicting goals. A formal method isneeded to create targets that remain stable over time and meetthe objectives of the testing agency. The targets developed hereare a weighted average of information and response functions ofitems on forms assembled by integer programming where ran-domly generated ability levels representative of the population’sability distribution are given. The items came from an operationalpool of the Law School Admission Council that administers the LawSchool Admission Test. This research is based on the target creationprocedure of Armstrong and Roussos (2005).

2. Multi-stage adaptive test design

A multi-stage adaptive test design (MSTD) provides the MSTrequirements and the population subgroup intended to visit abin at a given level. The layout and the development of the numberof bins, number of items per bin, the number of stages, and therouting decision can be found in Edmonds and Armstrong (2008).For simplicity, this paper assumes a fixed number of items in everybin; however, the procedures presented can be extended to a var-iable number of items in each bin. An examinee visits exactly onebin at each stage and at the completion of each stage a decision ismade as to the bin to visit next, or the test ends. Table 1 gives theoutline of a MSTD with three stages. The items of bin 1 are admin-istered to all the examinees, the items of bin 2 are aimed at thelower half of the ability range and those of bin 3 at the upper half.There are three levels at stage 3 where the items of bins 4, 5 and 6are aimed at the bottom, middle and upper one-third of the popu-lation, respectively. A test cannot be so exact in practice and it isrealized that examinees outside the specified percentile groupswill slip over into other groups. For example, someone in the34th percentile of ability is almost as likely to go to bin 4 or bin5. However, a good design should make it unlikely for this individ-ual to go to bin 6. We restrict the design to a binary routing; that is,an examinee coming out of bin 3 can only proceed to either bin 5 or6. Similarly, an examinee coming out of bin 2 can only proceed toeither bin 4 or 5. This design has four paths through the MST,1 ? 2 ? 4, 1 ? 2 ? 5, 1 ? 3 ? 5, and 1 ? 3 ? 6.

The difficulty of a testlet is determined by the position in theMST. The testlet assigned to a bin is aimed at a percentile rangeof the general cognitive skill covered. All the testlets in a MST formcover the same general skill (for example, reading comprehension).However, a testlet assigned to a bin does not necessarily containitems covering all areas within the general skill range. There areconstraints on the number of items covering the sub-skills. Theseconstraints must hold over every path. Suppose that at Stage 1 oftwo different MST forms, the two testlets for Stage 1 do not have

R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718 711

to cover the same sub-skills. The sub-skill constraints are overpaths and not at the individual testlet. Thus, a different routingcould be given for examinees whose sub-skill levels differ butthe overall skill levels are the same. At the end of the examination,all sub-skill levels will be evaluated and the appropriate classifica-tions should be made.

The examples of this paper were derived from a project to de-velop a Skill Readiness Inventory (SRI) test to evaluate the cogni-tive skills tested by the LSAT. These skills are logical reasoning(LR), analytical reasoning (AR) and reading comprehension (RC).Here, sample results are given only with LR items but the approachhas been extended to both AR and RC. The MSTD presented in Table1 was one of the designs considered for the SRI; however, it wasnot the chosen operational design. The operational design has few-er items in testlets and more frequent branching. To illustrate thetarget creation method, we work with the design of Table 1.

Let h denote a random variable giving the ability of an exam-inee. A more general term used for h is latent trait; however, thispaper will use the ability descriptor because the SRI measures askill ability level. The distribution of h may be represented by aprobability density function or empirically derived; for example,a table with the ability estimates of a previous test administeredto the population can be used. No data was available for the pop-ulation taking the SRI test, so it was assumed that the populationability was the same as the LSAT population ability. The distribu-tion of LSAT test-taker’s ability over recent years was studied andfound to be approximately Nðl0;r0Þ, where l0 was close to 0 andr0 was close to 1; to be explicit, a N(0.122,0.932) distribution. Here,the simulations are performed assuming a N(0,1) distribution ofability.

We define the information function of item i as fiðhÞ (Lord, 1980,p. 73) and the response function for item i as piðhÞ. Let T representthe number of bins in the MST design. The information and re-sponse functions of a testlet assigned to bin t are denoted asIFtðhÞ and RFtðhÞ; t ¼ 1; . . . ; T . These are taken to be the sum ofthe testlet’s individual item information and response functions.RFtðhÞ is also referred to as the true (expected) score of an exam-inee with ability h when administered the testlet.

This paper assumes local independence between items within atestlet; that is, the probability of a correct response to an itemwithin a testlet is independent of the responses to other items inthe testlet. Li et al. (2006) define a testlet to be a group of itemswith a common stimulus. They note that, under this definition ofa testlet, the responses to items within a testlet are likely ‘‘depen-dent due to factors associated with the stimulus”. Alternative IRTmodels designed to capture the testlet factors are presented andanalyzed by Li, Bolt and Fu. The method to determine targets givenin the current paper requires only the score distribution of the test-let. Targets can be developed when local item dependence exists aslong as the responses between testlets are independent.

Establishing target test information and response functions areessential for the creation of parallel MSTs. The IFtðhÞ and RFtðhÞshould match these targets within some level accuracy, for in-stance, ±10%. The approach to create the targets is now summa-rized. The ability distribution of the population and ‘‘ideal” itempools determine the targets. The targets are a weighted averageof information and response functions of items on forms assem-bled by integer programming. It is possible to develop an opera-tional MST without target characteristic curves, but similarresponse functions across MSTs promote a similar score distribu-tion across MSTs. The following gives an overview of the steps usedto create the targets.

Step 1. Apply integer programming to assemble multiple linearforms with knowledge of the true ability of each examinee. Allconstraints for the MST design must be satisfied and exposure

control is enforced. The true abilities for the simulation are ran-domly drawn from the population’s ability distribution. Thetrue ability and the items on the form are recorded once theprocess has stabilized the relationship between exposure con-trol and information.Step 2. Set the current bin to 1. All examinees start at bin 1.Step 3. Calculate the probability of reaching the current bin foreach recorded examinee from Step 1 given the examinee’s trueability, routing rules and targets of previous stages. Createexaminee weights by dividing the probability of visiting thebin by the overall probability of visiting the bin; thus, thesum over all weights equals one. The targets for the bin arethe weighted sum of the observed information and responsefunctions associated with the bin as recorded in Step 1.Step 4. Based on the targets obtained in Step 3, determine therules for routing examinees out of the current bin to bins atthe next stage whenever bins at the next stage exist.Step 5. Proceed to the next bin and repeat Steps 3 and 4 until allbins have been considered.

Fig. 1 gives a macro flowchart of the process. The number offorms needed to stabilize the item exposure rates is K. The numberof forms used to determine the targets is also given by K. There arethree distinct phases in the process. (1) stabilize exposure rates;(2) assemble and record forms; (3) create targets for bins sequen-tially beginning at bin 1.

2.1. Targets and routing

MSTDs have target bin information functions (TIFs) and targetbin response functions (TRFs). The targets facilitate accurate testequating by providing tests which closely matched informationand score distribution over the ability range. Target IRT curvesare common with linear tests, but the problem of choosing targetsfor the MST testing approach is more complex because routingdecisions have to be made. The targets for each bin and the routingrules are interrelated. Routing rules should depend on the targetsand vice versa. Since MSTs associated with a MSTD are assembledafter the targets are created, this report concentrates on routing fora MSTD and one routing method is summarized later. It is assumedthat the information and response functions of testlets assigned toa bin closely match the target of the bin.

The application uses a 3-parameter IRT model (Lord, 1980;Hambleton et al., 1991) to establish target curves. The Bernoullirandom variable Ui indicates whether the ith item is answered cor-rectly or incorrectly, and h gives the true ability of the examinee. Itis assumed that the IRT model accurately represents examinee re-sponse patterns and the Ui’s are conditionally independent. Theprobability of a correct response for item i over the range of h is gi-ven by piðhÞ � PðUi ¼ 1jh ¼ hÞ. As previously mentioned, the IFtðhÞand RFtðhÞ for bin t are the sum of the fiðhÞ and piðhÞ over all itemsassigned to bin t. A good target for a bin will be informative for thesubpopulation visiting the bin.

Let TIFtðhÞ and TRFtðhÞ, represent the target information and re-sponse functions at bin t; t ¼ 1; . . . ; T . The procedure does notwork with the actual TIFtðhÞ and TRFtðhÞ, but with the values at dis-crete points on the h-axis. Label these points ~hm;m ¼ 1; . . . ;M. Thisstudy used M ¼ 21 points from �3.0 to +3.0 in steps of 0.3. The hvalues were chosen to be distributed uniformly to include mostabilities having a standard normal distribution. A linear extrapola-tion is performed to obtain values for points between the discretepoints.

The target creation process assumes an item pool and popula-tion ability distribution that accurately represents the future. Agen-cies resist changing the targets for assembling parallel tests

Fig. 1. Macro flowchart of process to determine targets.

712 R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718

because this may affect the equating between administrations.However, time changes almost everything. Issues related to testequating (Kolen and Brennan, 2004) should be considered. It maybe necessary to reassess the viability of the targets on a regular ba-sis. One consideration should be the number of non-overlappingforms contained in the item pool. If this number reduces over time,an evaluation of pool and targets should take place.

This study used a LSAT operational pool with logical reasoningitems. Each logical reasoning item has its own stimulus; thus, the

assumption of local item independence is reasonable. The itemswere calibrated based on data from administrations of variable(not scored) sections of the LSAT.

3. Omniscient testing

Omniscient testing is the delivery of a CAT where the true abilityof the examinee is known. This can only be achieved with simula-tion. An omniscient testing method is used to create data for deriv-ing the bin targets. The method was inspired by the shadow CATproposed by van der Linden and Reese (1998) where an active testform, satisfying all constraints, is assembled and items to deliverare chosen from this active form. The shadow CAT updates abilitybased on responses to those items already administered and up-dates the test accordingly. However, in omniscient testing the trueability is known throughout and no updates are needed. Also, thereis no need to simulate the administration of the form when thepurpose is to construct targets. Constraints specified for a MST pathmust be satisfied, but there are no information or response targets.Limitations on item exposure are imposed to assure a broad usageof items from the pool.

An important issue for a testing agency is item pool usage.Every item has its own information function. An item with highinformation around the center of the population’s ability distribu-tion will be a likely item to be placed on several forms. We wish tocontrol the usage of ‘‘good” items. The control approach employedhere is to adjust the information provided by an item with a pen-alty based on the exposure rate. The exposure rate of an item is de-fined to be the ratio of the number of times the item has appearedover the total number of forms assembled up to that point. Theexposure rate changes over time and early in the testing processwill fluctuate greatly; for example, after one form has been admin-istered, all exposure rates are 0 or 1. If an item has a high exposurerate after many administrations of the test, then it has an increasedrisk of being known by prospective examinees, which in turnthreatens test security. Therefore, the number of overly exposeditems is an important evaluation criterion for any successful CATprogram (Leung et al., 2003).

Consider the sequential assembly of an individualized form foreach of 2K simulated examinees. The first K examinees are used tostabilize item exposure rates and the second K examinees are usedto create the targets. The form for the kth examinee is assembledknowing all forms administered to the previous k� 1 examinees.Let xi denote a zero-one decision variable for the selection of eachitem in the pool. The assembly process has xi ¼ 1 when the ith itemis present on the test, and xi ¼ 0 when it is not present. Let hk bethe true ability of the kth examinee, randomly drawn from thepopulation’s ability distribution.

A cost is associated with each item. This cost is the value of afunction Hikð~ui; �ui; hkÞ where ~ui is a random number uniformly dis-tributed between 0 and 1, and �ui is the exposure rate of the ith itemimmediately before the kth test is assembled. Thus, �ui is equal tothe number of times item i has appeared on the previous k� 1forms divided by k� 1. The objective of the test assembly problemfor the kth examinee is the following:

MinimizeX

i

Hikð~ui; �ui; hkÞxi: ð1Þ

The study reported here uses the following representation forHikð~ui; �ui; hkÞ:

Hikð~ui; �ui; hkÞ ¼ a1~ui þ a2�ui � a3IiðhkÞ: ð2Þ

The coefficients, a1; a2, and a3, are determined by repeated simula-tions to achieve the desired exposure rates for items in the pool. Thefirst term, a1~ui, creates randomization in the item selection process.

R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718 713

This assures no discernible pattern for selecting items. The firstterm of (2) introduces randomness, but does not dominate the otherterms. The value of ~ui is generated anew for each examinee. Sinceobjective coefficient is relative to the a values assigned to eachterm, a1 ¼ 1:0 for all simulations. The second term, a2�ui, penalizesitems that have already been exposed as a linear expression ofthe exposure rate. The value of a2 > 0 is chosen based on desiredpool usage. This approach gives an acceptable method to distributeitems over the omniscient test forms. The results of the simulationfor the first K examinees are not recorded, but are used to stabilizethe exposure rates. The ultimate goal is to produce targets that willeffectively utilize the item pool. The success of achieving this goalcan only be evaluated fully after the targets are defined and MSTsare assembled. Experience with the items used in this study indi-cates that a goal of keeping the maximum exposure rate under15% and the median exposure rate around 2% produces acceptablepool utilization. This exposure rate level is consistent with studieson exposure rates in CAT; for example, van der Linden and Veldk-amp (2007) noted that a possible goal value for exposure rates forCAT would have a maximum of between 10% and 15%. In general,if we want to maintain a consistent maximum exposure rate, a2

should increase when the pool size increases and decrease whenthe number of items on a MST path increases.

The third term, �a3IiðhkÞ, is the focus in a standard CAT imple-mentation where the objective is to maximize information. Highinformation items at a hk point should be utilized more than itemswith lower information, but over the course of the simulation, allacceptable items should be administered. There is a trade-off be-tween information and exposure rate. The higher the value of a3

relative to a2, the higher the target information curves and fewernon-overlapping MSTs can be assembled from the item pool. Asearch method was used in this study to identify parameters yield-ing the desired exposure rates.

3.1. Assembly and constraints

A mixed integer programming package was used for the assem-bly of omniscient forms. The objective function for all the problemswas given by (1). The model reported here was for discrete itemswhere the stimulus and the question can be treated as a unit.The model can be extended to set-based items (van der Linden,2000) where a common stimulus exists for multiple items. TheRC and AR testlets do have a common stimulus for all items in atestlet. As mentioned previously, the effects of local dependencewas not considered in the IRT parameter estimation process. Ignor-ing the impact of the common stimulus provides an overestimateof the test reliability (Wainer and Thissen, 1996). Although not re-ported in this article, MST targets for RC and AR were created in amanner similar to LR.

The constraints for the omniscient testing were scaled versionsof the linear LSAT constraints. A parameter assured the final solu-tion to the integer programming problem was within 10% of theoptimal solution. Since randomness was built into the problem,the lack of a true optimal solution was not a concern. The solutionshould, however, be close enough to the optimal to allow theobjective function to impact the solution. Time for solving theMIP was not an issue as the number of constraints was relativelysmall and feasible solutions easily found. The assembly of MSTforms was considerably more challenging because the target con-straints must be satisfied.

The constraints included assigning a fixed number of items to aform, number of items in cognitive sub-skill categories, answer keycount distribution, and word count requirements. Armstrong et al.(2005) and van der Linden and Glas (2003) give explicit statementsof these constraints. The sample pool for the study had 1336 dis-crete items. The representative MSTD from Table 1 had S ¼ 3

stages and n ¼ 10 items at each stage; thus, every form contained30 items. The number of zero-one variables was 1336, the numberof constraints was 20 and the number of nonzero entries in theconstraint matrix was 4043. The objective function, (1), hada1 ¼ 1:0; a2 ¼ 75:0 and a3 ¼ 20:0. A high exposure rate was con-sidered to be around 0.10 and a comparatively high amount ofinformation at an ability point would be 0.30. This gave a little in-sight into the trade-off between information and exposure control.Omniscient form assemblies were run on a laptop with a 2.13 GHzCPU took about 20 minutes with K ¼ 5000; that is, 10,000 omni-scient forms were assembled. The maximum exposure rate was12.7%, the minimum exposure rate was 0.2% and the median expo-sure rate was 1.9%. This was considered to be a good utilization ofthe pool.

The omniscient test is a linear test but the items were se-quenced similar to a MST and placed in testlets with one testletat each of the S stages. All testlets contained n items. There is noreason to simulate the administration of the test. Testlets were cre-ated randomly with items on the test having equal probability ofbeing placed in any testlet. Should additional constraints exist onthe placement of items into testlets, nonrandom groupings maybe necessary. Any nonrandom placement may bias the process.For example, a constraint may exist where at least a specified num-ber of items on a specific sub-skill must be present in the first test-let. In this case, the constraint should be imposed. The models ofthis project had constraints only on summations along paths.

3.2. Data saved from omniscient testing

After the exposure rate stabilized (K tests assembled), data fromeach omniscient form was saved and used when deriving the tar-gets. This data should provide a good cross section of forms assem-bled for the population. The items of the kth omniscient form aredenoted by the following indices:

iðk; s; jÞ; k ¼ 1; . . . ;2K; s ¼ 1; . . . ; S; j ¼ 1; . . . ;n; ð3Þ

where s is the stage and j is the sequencing index of the items with-in the testlet at stage s.

Each item’s piðhÞ and fiðhÞ can be computed based on the cali-brated IRT parameters. The values of the stage information func-tions, SIFk;sðhÞ and stage response functions, SRFk;sðhÞ, observedfor the kth examinee at stage s can be computed as follows:

SIFk;sð~hmÞ ¼Xn

j¼1

fiðk;s;jÞð~h‘Þ; k ¼ K þ 1; . . . ;2K; s ¼ 1; . . . ; S;

m ¼ 1; . . . ;M; ð4Þ

SRFk;sð~hmÞ ¼Xn

j¼1

piðk;s;jÞð~hmÞ; k ¼ K þ 1; . . . ;2K; s ¼ 1; . . . ; S;

m ¼ 1; . . . ;M: ð5Þ

Weighted sums of the SRFk;sð~hmÞ and SIFk;sð~hmÞ are used to create thetargets TRFtð~hmÞ and TIFtð~hmÞ. The weights are derived from theprobabilities as described in the next section.

3.3. Routing and probabilities for weights

The target generation method of this paper can be implementedwith any well defined routing rule. One rule would be to take anability estimate of the examinee at the time of routing decisionand send the examinee to the bin at the next stage where the tar-get group contains the estimate. This has the disadvantage of cal-culating an ability estimate, which can be done in several ways,and possibly needing to explain the routing motivation to someoneunfamiliar with IRT. Also, routing based on ability estimation com-

Calculate all from equation (4)

t = 2

∑+=

=K

Kkmkm KSIFTIF

2

11,1 /)]

~([)

~( θθ

Using the targets from bins at previous stages, calculate, pkt , the

probability that examinee k reaches bin t.

Use equation (6) to calculate the target for bin t.

t = t + 1

Stop

N

Y

)~

(, mskSIF θ

?Tt ≤

Fig. 2. Flowchart of process to determine targets after simulated data are recorded.

714 R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718

plicates the target creation process. Sample testlets would have tobe assembled and numerical integration or simulation performedto obtain the weights to be multiplied by (4) and (5). The SRI testuses number correct at the point of routing to determine the nextbin, and we will use that approach in this section.

The routing rule can be obtained once the TRFs of previousstages have been created. All bins of the MSTD are intended to at-tract a specific group of examinees and the ability range for thisgroup is stated in the design. The extreme h values for the groupability range and the sum of the corresponding TRFtðhÞ values serveas cutoffs for routing based on number correct. The testlet assignedto a bin should match the target for the bin; therefore, the TRFtðhÞestimates the expected number correct for a testlet assigned to bint given h. The expectation from previous bins must be included andthis may introduce an error. Consider the MSTD of Table 1 androuting out of bin 2. The inverse standard normal of 0.33 ish ¼ �0:44, and rule for routing to bin 4 would be number correctless than TRF1ð�0:44Þ þ TRF2ð�0:44Þ. However, the analysis couldcondition on the examinee being routed to bin 2. This conditioningwould compute the expectation by taking into account the possi-bility that an examinee with ability �0.44 may score higher thanTRF1ð0:0Þ and be routed to bin 3 instead of bin 2. The error associ-ated with failure to condition can be removed, but will be ignoredduring the analysis presented here. Further, designs may have mul-tiple paths to a bin requiring a routing rule. An average of the ex-pected number correct from earlier bins should be taken.

The bin targets are determined by weighting the observed val-ues from (4) and (5) by the probability that the kth simulatedexaminee with ability hk visits the bin. The expected number cor-rect at bin t for examinee k is estimated to be TRFtðhkÞ. The distri-bution of number correct at bin t is projected to be binomial withparameters n (the number of items in the testlet) andp ¼ TRFtðhkÞ=n; that is, all items in a testlet have equal probabilityof a correct response and any testlet assigned to bin t closelymatches the target. This estimation of the distribution maximizesthe variance over all other estimation where the individual proba-bilities sum to TRFtðhkÞ. Nedelman and Wallenius (1986) provideproperties of Bernoulli trials with fixed p versus Bernoulli trialswhere p varies (called Poisson trials).

All examinees are administered the testlet in bin 1; thus, thetarget at bin 1 is the average of all K stage 1 testlets from the omni-scient test. After stage 1, the routing rules of previous bins areneeded to obtain the distribution of number correct at the currentbin. The distribution entering the bin is computed from Bayes’ Rule(Ross, 2001) where the conditioning is on the routing rule. The fullscore distribution at bin t can be calculated as a mixed binomialwhere the distribution within the bin is binomial as described inthe preceding paragraph. Possible multiple paths to the bin mustbe considered.

Let pkt represent the probability that examinee k will visit bin t.The value of pkt is obtained by the method described in the previ-ous paragraph. The values obtained for the targets are thefollowing:

TIFtð~hmÞ ¼P2K

k¼Kþ1pktSIFk;sð~hmÞP2Kk¼Kþ1pkt

; m ¼ 1; . . . ;M; ð6Þ

TRFtð~hmÞ ¼P2K

k¼Kþ1pktSRFk;sð~hmÞP2Kk¼Kþ1pkt

; m ¼ 1; . . . ;M: ð7Þ

Fig. 2 provides a flowchart of the process used to determine theinformation targets once the results from the omniscient testinghas been recorded. This flowchart expands the third phase shownin Fig. 1. Since pk1 ¼ 1 for every h, the targets at stage 1 can be cal-culated directly.

Fig. 3 shows the information targets obtained at the three stagesfor the design of Table 1. Recall at stage 1, bin 1 contains itemsadministered to all the examinees. In stage 2, the items of bin 2are aimed at the lower half of the ability range and those of bin3 are aimed at the upper half. In stage 3, the items of bins 4, 5, 6are aimed at the bottom, middle and upper one-third of the popu-lation, respectively. The general shape of each curve is consistentwith what we would expect. The information target for bin 1 hasits highest point (2.58) at ability, h ¼ 0:0, but attempts to provideadequate information over the ability range. The upper tail of thetarget is heavier than the lower tail. This is because high informa-tion items are much more common for high ability examinees thanlow ability examinees. The two information targets at stage 2 areadapting to the lower and upper half of the ability percentiles.The highest information target for bin 2 is 2.89 at h ¼ �0:3 andthe highest information target for bin 3 is 2.73 at h ¼ 0:6. These re-sults show slight rise in the maximum information, that is, 2.58 forall examinees at h ¼ 0:0 in bin 1; 2.89 for the lower half of the abil-ity range at h ¼ �0:3 in bin 2; and 2.73 for the upper half of theability range at h ¼ 0:6 in bin 3.

To see how the target values have shifted from bin 1, let us takevalues of h equal to �0.9 and +0.9. The target information values ath ¼ �0:9 are 1.83, 2.56 and 1.25 for bins 1, 2 and 3, respectively.Thus, at h ¼ �0:9, the highest information target of 2.56 is ob-served in bin 2, where the examinees have the lower abilities areto be directed. Similarly, the target information values at h ¼ 0:9are 2.14, 1.50 and 2.65 for bins 1, 2 and 3, respectively. An exam-

Stage 1 TIF

0

1

2

3

-3 -2.4

-1.8

-1.2

-0.6 0 0.6 1.2 1.8 2.4 3

Ability

Info

rmat

ion Bin 1 TIF

Stage 2 TIFs

00.5

11.5

22.5

33.5

-3 -2.1

-1.2

-0.3 0.6 1.5 2.4

Ability

Info

rmat

ion Bin 2 TIF

Bin 3 TIF

Stage 3 TIFs

0

0.5

1

1.5

2

2.5

3

3.5

4

-3 -2.1

-1.2

-0.3 0.6 1.5 2.4

Ability

Info

rmat

ion

Bin 4 TIFBin 5 TIFBin 6 TIF

Fig. 3. The information targets obtained at the three stages for the design.

TRF Stage 1

0

2

4

6

8

10

12

-3 -2.4

-1.8

-1.2

-0.6 0 0.6 1.2 1.8 2.4 3

Ability

Exp

ecte

d S

core

Target Bin 1

TRF Stage 2

02468

1012

-3 -2.1

-1.2

-0.3 0.6 1.5 2.

4

Ability

Exp

ecte

d S

core Target Bin 2

Target Bin 3

TRF Stage 3

0

2

4

6

8

10

12

-3-2

.1-1

.2-0

.3 0.6

1.5 2.4

Ability

Exp

ecte

d S

core Target Bin4

Target Bin 5

Target Bin 6

Fig. 4. The expected number correct targets obtained at the three stages for thedesign.

Table 2Item pool used for numerical example of a MST design with 2 stages and 3 bins.

Itemnumber

(Discrimination)a

(Difficulty)b

(Guessing)c

Exposurerate

1 1.12 0.21 0.20 0.3412 1.09 1.85 0.17 0.1473 1.04 �1.40 0.16 0.2744 0.85 1.36 0.14 0.1835 1.22 2.11 0.07 0.1556 1.04 �1.96 0.20 0.1687 1.02 1.75 0.18 0.1508 1.02 �0.12 0.12 0.4079 0.98 2.30 0.18 0.09610 0.90 �0.97 0.10 0.30611 0.87 �0.95 0.15 0.26412 0.97 �1.90 0.08 0.20913 1.02 0.10 0.11 0.37814 1.12 0.34 0.15 0.39715 0.85 �1.64 0.21 0.17916 1.00 0.30 0.12 0.345

R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718 715

inee with h ¼ 0:9 should have a high probability of being routed tobin 3 and an improvement in scoring accuracy achieved.

The stage 3 targets are significantly shifted to the ability per-centile the bins are designed to attract. The highest target informa-tion values for bins 4, 5 and 6 occur at h ¼ �0:6, 0.0 and 1.2. Thevalues of the targets evaluated at these ability points are 2.80,3.38 and 2.88. For bin 4, the target information is less than 1.0for any examinee with ability above 0.9. The information targetsand routing of the MST improve the assessment accuracy of logicalreasoning ability. This permits more detailed guidance on how torectify deficiencies which leads to improved scores in futuretesting.

Fig. 4 reveals the expected number correct score targets orTRFtðhÞ over the ability range. Assume the testlet RFtðhÞ for bin tmatches closely the TRFtðhÞ. The easier testlets are found at thosebins designed to attract the lower ability percentiles. The cutofffor the routing rule out of bin 1 would be based on 6 correct re-sponses because the expectation at h ¼ 0:0 was around 5.85. Theexaminee goes to bin 2 if less than 6 correct responses are givenat bin 1 and goes to bin 3 otherwise. The routing rule out of bin2 would focus on expected scores at h ¼ �0:44. The expectationat bin 1 is a little less than 5 correct responses and the expectationat bin 2 is around 6 correct responses. Thus, the routing rule at bin2 would be to direct the examinee to bin 5 if 11 or more correct re-sponses are given up to the completion of bin 2; otherwise, sendthe examinee to bin 4. The targets created in this study demon-strate a high level of consistency with regard to information andexpected number of correct responses across the ability percen-tiles. It should be noted, however, that the results may be limitedby the characteristics of the item pool used.

3.4. A numerical example

A small numerical example is now presented to illustrate thecreation of targets. Table 2 shows the item pool for this examplecontaining 16 discrete items. This MST design is derived by trun-cating the design in Table 1. Rather than having three stages with10 items each, the example MSTD has S ¼ 2 stages and n ¼ 2 itemsat each stage. The testlets assigned to bin 1 are designed for the en-tire population of examinees assuming a N(0,1) distribution ofability. The testlets for bin 2 are for the lowest 50% of the popula-tion and testlets for bin 3 for the upper 50%. The two paths for the

Target Path 1

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

-3.0 -2.4 -1.8 -1.2 -0.6 0.0 0.6 1.2 1.8 2.4 3.0

Ability

Info

rmat

ion

Path 1MST1 InformationPath 1MST2 InformationTarget Path 1

Target Path 2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

-3.0 -2.4 -1.8 -1.2 -0.6 0.0 0.6 1.2 1.8 2.4 3.0

Ability

Info

rmat

ion

Path 2MST1 InformationPath 2MST2 InformationTarget Path 2

Fig. 5. The information targets and the path information from two separate MSTs.

716 R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718

MSTD are 1 ? 2 and 1 ? 3. Every form constructed by the omni-scient test assembly contains four items. The only constraint ofthe test assembly problem is that the number of items in the testmust equal 4. Omniscient forms are assembled with K ¼ 5000. Ta-ble 2 shows the maximum exposure rate from the assembly to be40.7%, the minimum exposure rate is 9.6%. The average exposurerate is 25% because the pool contains only 16 items with 4 itemsper test. The items with level of difficulty, b, between �0.5 and+0.5, corresponds to the ones with the higher exposure rate, be-tween 34.1% and 40.7%.

All the omniscient tests are recorded along with the ability ofthe simulated examinees. The target for bin 1 becomes the averageof the information and response functions of the first two items onthe 5000 tests. The targets are recorded on the ability grid[�3.0,+3.0] in steps of 0.3. Any simulated ability above +3.0 isbrought back to +3.0 and, similarly, any ability below �3.0 isbrought up to �3.0. The bin 1 target at h ¼ 0:0 has a value of1.18. The split at the second stage is at the center of the populationðh ¼ 0:0Þ; thus, the routing rule of bin 1 is to route to bin 2 whenthe examinee has 0 or 1 correct responses and route to bin 3 when2 correct responses are observed. The targets for bins 2 and 3 areobtained by a weighted average of the information and responsefunction of the last two items on the 5000 tests. The weights aredetermined by the probability of each examinee being routed tothe associated bin. The probabilities are obtained assuming thatthe bin 1 target accurately represents the response functions oftestlets assigned to bin 1. Assume simulated examinee k has a trueability of 1.3. The response targets at 1.2 and 1.5 are 1.65 and 1.73,respectively. Hence, the expected score for the examinee can becomputed as k11:65þ k21:73 ¼ 1:68, where k1 ¼ ð1:5� 1:3Þ=0:3and k2 ¼ ð1:3� 1:2Þ=0:3. The probability of a correct response toeach of the two items in the first testlet becomesp ¼ 1:68=2 ¼ 0:84. The probability of both items in testlet 1 being

responded to correctly is pk3 ¼ 0:84�0:84 ¼ 0:71. The probabilitiesof being routed to bin 3 are calculated for all 5000 examineesand then equations (6) and (7) are applied to obtain the targetsfor bin 3. The targets for bin 2 are then calculated withpk2 ¼ 1� pk3.

The final step in the numerical example assembles two MSTsfrom the item pool using the obtained targets. The only constraintson the MST assembly are requiring exactly two items to be as-signed to each of the three bins and an item can be used at mostonce. The objective is to minimize the maximum overall deviationfrom the path targets. The assembled MSTs are MST1 ={bin1:(12,13), bin2:(8,7), bin3:(14,3)} and MST2 = {bin1:(1,3),bin2:(10,4), bin3:(9,16)}. The path information curves and the pathinformation targets are found in Fig. 5. Because the testlets containonly two items, the path information curves are not as symmetricas the results in Fig. 3 with testlets containing more items. Also, be-cause the item pool has only 12 items, the variation of informationover the ability range is not as significant as with a larger pool.

4. Conclusions

This paper has presented a method to obtain targets for multi-stage adaptive tests (MSTs). The fundamental operations researchtechniques of integer programming, simulation and stochastic pro-cesses were used. The design specified a desired population abilitypercentile group to visit each bin. Targets for the bins were createdby assuming the ability distribution of the population being tested.This population can be represented by a probability density func-tion, as was used for the examples of the paper, or a table givingthe abilities of examinees from a previous administration of a re-lated test. Also, the targets depend on the data from an item pool.The item pool must be representative of future pools to be used for

R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718 717

the MST approach. This study used an existing operational poolcreated to support paper-and-pencil (P&P) tests. It is reasonableto expect that a MST approach will use items in a different mannerthan P&P testing, and the item development process may be mod-ified from the one used for P&P testing. Thus, further research istaking place to specify the characteristics of the item bank for aMST approach.

The results indicate the capability of the targets to capture thedesired attributes. If more scoring accuracy is desired at certainability levels, a bin targeting a narrower percentile of the popula-tion can be specified in the design. In general, it has been foundthat three, or at most four, target levels are desirable at the finalstage. The small benefit of the additional levels in improving scor-ing accuracy is outweighed by the added complexity. The addi-tional bins would require the use of more items from the pool.Targets for paths can be obtained by summing the targets for thebins on the path. The Skills Readiness Inventory (SRI) test wasimplemented with path targets and routing rules derived sepa-rately for each assembled MST. It is possible that routing rulescan differ slightly across MSTs and differ from the routing rulesused to obtain the bin targets.

The omniscient test assembly has parameters to create random-ness, control item exposure rate and improve information. An at-tempt has been made to balance the accuracy of the test scoresand the effective usage of the item pool. It does not appear to bepractical to define ‘‘optimal” targets. The future distribution ofability and the components of the future item pools would beneeded and, even then, the multiple objectives make a precise def-inition of optimality difficult. A lengthy appraisal process is re-quired to set targets. Once a test based on the targets for thebins have been made operational, the parallel properties of theforms are set and changing the targets may violate parallelism.The significant time and effort to create and evaluate targets priorto implementation is highly justified.

The MST approach to testing is currently being used in a SRI testfor examinees preparing for the LSAT. Multiple versions of this testhave been assembled using targets created as described here. TheSRI test is designed to be administered over the internet and iden-tify deficiencies in skills of future LSAT examinees. Each item in aMST is aimed at the same main cognitive skill type – analytical rea-soning, reading comprehension or logical reasoning. The MST anal-yses the main skill but also sub-skills. For example, sub-skills foranalytical reasoning could be ‘‘basic understanding of rules”,‘‘things that could be true”, ‘‘things that must be true” and ‘‘condi-tional rules.” Every MST path has constraints on the number ofitems evaluating the sub-skills. Examinees are classified into oneof nine groups based on percentile rankings in the main skill andsub-skills. Learning specialists advise students on how to improveskills necessary for a high LSAT score.

The MST used in the SRI setting eliminates the concerns in-volved in administering most high-stakes tests. The incentive tocheat is eliminated because the purpose of the test is help theexaminee identify skill deficiencies. However, targets are still nec-essary to create parallel forms and provide consistent study recom-mendations. The targets determined as described in this paperassume parallel item pools and homogeneous examinee popula-tions over time. Violation of these assumptions will underminethe objectives of the target creation which were high pool utiliza-tion and consistent assessment of the examinee population. In sit-uations under different test environments such as changes in itempools, examinee population, economic settings, educational back-grounds, unemployment rate, curriculum requirements, emphasison other cognitive skills, and demand of skills from employers,the settings may need a new run of the procedure to create targets.Evaluating both the makeup of the item pools and examinee abilitydistributions should take place regularly. Significant shifts may re-

quire the targets to be constructed anew; however, a careful reas-sessment is needed to equate tests once new targets areimplemented.

Acknowledgement

This research has been partially funded by a Grant from the LawSchool Admission Council, 662 Penn Street, Newtown, PA 18940,USA.

References

Adema, J.J., Boekkooi-Timminga, E., van der Linden, W., 1991. Achievement testconstruction using 0–1 linear programming. European Journal of OperationalResearch 55, 103–111.

Anderson, R.D., Vastag, G., 2004. Causal modeling alternatives in operationsresearch: Overview and application. European Journal of Operational Research156, 92–109.

Armstrong, R.D., Belov, D.I., Weissman, A., 2005. Developing and assembling theLaw School Admission Test. Interfaces 35, 140–151.

Armstrong, R.D., Roussos, L.A., 2005. A method to determine targets for multi-stageadaptive tests. Law School Admission Council Computerized Testing Report 02-07, December, pp. 1–17.

Armstrong, R.D., Jones, D.H., Koppel, N.B., Pashley, P.J., 2004. Computerized adaptivetesting with multiple form structures. Applied Psychological Measurement 28,147–164.

Belov, D.I., Armstrong, R.D., 2008. A Monte Carlo approach to the design, assemblyand evaluation of multistage adaptive tests. Applied PsychologicalMeasurement 32, 119–137.

Chen, C.M., Duh, L.J., 2008. Personalized web-based tutoring system based on fuzzyitem response theory. Expert Systems with Applications 34, 2298–2315.

Chen, C.M., Lee, H.M., Chen, Y.H., 2005. Personalized e-learning system using itemresponse theory. Computers and Education 44, 237–255.

Csondes, T., Kotnyek, B., Szabo, J.Z., 2002. Application of heuristic methods forconformance test selection. European Journal of Operational Research 142,203–218.

Edmonds, J.J., Armstrong, R.D., 2008. A mixed integer programming model formultiple stage adaptive testing. European Journal of Operational Research 193,342–350.

Gupta, S., Kim, H.W., 2008. Linking structural equation modeling to Bayesiannetworks: decision support for customer retention in virtual communities.European Journal of Operational Research 190, 818–833.

Hambleton, R.K., Swaminathan, H., Rogers, H.J., 1991. Fundamentals of ItemResponse Theory. Sage Publications, Newbury Park, CA.

Hatzilygeroudis, I., Prentzas, J., 2004. Using a hybrid rule-based approach indevelopment an intelligent tutoring system with knowledge acquisition andupdate capabilities. Expert Systems with Applications 26, 477–492.

Kolen, M.J., Brennan, R.L., 2004. Test Equating, Scaling, and Linking: Methods andPractices. Springer-Verlag, New York.

Li, Y., Bolt, D., Fu, J., 2006. A comparison of alternative models for testlets. AppliedPsychological Measurement 30, 3–21.

Leung, C., Chang, H., Hau, K., 2003. Incorporation of content balancing requirementsin stratification designs for computerized adaptive testing. Educational andPsychological Measurement 63, 257–270.

Lord, F., 1980. Applications of Item Response Theory to Practical Testing Problems.Lawrence Erlbaum Associates, Hillsdale, NJ.

Lu, C.S., Lai, K.H., Cheng, T.C.E., 2007. Application of structural equation modeling toevaluate the intention of shippers to use internet services in liner shipping.European Journal of Operational Research 180, 845–867.

Luecht, R.M., Burgin, W., 2004. Test information targeting strategies for adaptivemultistage testing designs. American Institute of Certified Public Accountants,Technical Report, Series 2, No. 6. May.

Luecht, R.M., Nungester, R.J., 2000. Computer-adaptive sequential testing. In: vander Linden, W., Glas, C.A. (Eds.), Computerized Adaptive Testing: Theory andPractice. Kluwer, Boston, pp. 117–128.

Mavrommatis, G., 2008. Learning objects and objectives towards automatic learningconstruction. European Journal of Operational Research 187, 1449–1458.

Nedelman, J., Wallenius, T., 1986. Bernoulli trials, Poisson trials, surprising variancesand Jensen’s inequality. The American Statistician 40, 286–289.

O’Brien, F.A., 2004. Scenario planning – Lessons for practice from teaching andlearning. European Journal of Operational Research 152, 709–722.

Papanikolaou, K.A., Grigoriadou, M., Magoulas, G.D., Kornilakis, H., 2002. Towardsnew forms of knowledge communication: The adaptive dimension of a web-based learning environment. Computers and Education 39, 333–360.

Ross, S., 2001. A First Course in Probability. sixth ed.. Prentice Hall, Upper SaddleRiver, NJ.

van der Linden, W.J., 2000. Optimal assembly of tests with item sets. AppliedPsychological Measurement 24, 225–240.

van der Linden, W.J., 2005. Linear Models for Optimal Test Design. Springer, NewYork.

718 R.D. Armstrong et al. / European Journal of Operational Research 205 (2010) 709–718

van der Linden, W.J., Glas, C.A.W., 2003. Preface. In: van der Linden, W.J., Glas,C.A.W. (Eds.), Computerized Adaptive Testing. Kluwer, Dordrecht, Boston,London, pp. vi–xii.

van der Linden, W.J., Reese, L.M., 1998. A model for optimal constrained adaptivetesting. Applied Psychological Measurement 22, 259–270.

van der Linden, W.J., Veldkamp, B.P., 2007. Conditional item-exposure control inadaptive testing using item-ineligibility probabilities. Journal of Educationaland Behavioral Statistics 32, 398–418.

Wainer, H., Thissen, D., 1996. How is reliability related to quality of test scores?What is the effect of local item dependence of reliability? EducationalMeasurement: Issues and Practice 15, 22–29.

Xing, D., Hambleton, R., 2004. Impact of test design, item quality, and item bank sizeon the psychometric properties of computer-based credentialing examinations.Educational and Psychological Measurement 64, 5–21.