A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional...

25
208 British Journal of Mathematical and Statistical Psychology (2011), 64, 208–232 C 2010 The British Psychological Society The British Psychological Society www.wileyonlinelibrary.com A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional item response theory Roy Levy and Dubravka Svetina Arizona State University, Tempe, Arizona, USA A generalized dimensionality discrepancy measure is introduced to facilitate a critique of dimensionality assumptions in multidimensional item response models. Connections between dimensionality and local independence motivate the development of the discrepancy measure from a conditional covariance theory perspective. A simulation study and a real-data analysis demonstrate the utility of the discrepancy measure’s application at multiple levels of analysis in a posterior predictive model checking framework. 1. Introduction Zhang and Stout (1999a) characterized approaches to dimensionality assessment for item response data in terms of those that (a) are more exploratory in nature and attempt full dimensionality assessment by estimating the number of latent dimensions and determining which items reflect which dimension(s) or (b) are more confirmatory in that they assess (departures from) unidimensionality. In addressing the latter of these, Stout (1987) framed the DIMTEST procedure as responding to Lord’s (1980, p. 21) call for a statistical procedure to assess unidimensionality in item response theory (IRT). However, assessments are often multidimensional in the sense that performance on tasks depends on distinct though possibly related proficiencies or aspects of proficiency. Advances in estimation capabilities, including those afforded by adopting Bayesian approaches to statistical modelling, have supported the growth in the popularity of research on and the use of multidimensional IRT (MIRT) models for item response and similar types of data (B´ eguin & Glas, 2001; Bolt & Lall, 2003; Clinton, Jackman, & Rivers, 2004; Yao & Boughton, 2007). With the emergence of such models comes the need for appropriate model checking and model criticism procedures. Thus, Lord’s (1980) call may be extended to a call for procedures for assessing dimensionality assumptions in multidimensional models. The analysis of the assumed dimensionality has Correspondence should be addressed to Dr Roy Levy, Arizona State University, PO Box 873701, Tempe, AZ 85287-3701, USA (e-mail: [email protected]). DOI:10.1348/000711010X500483

Transcript of A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional...

208

British Journal of Mathematical and Statistical Psychology (2011), 64, 208–232C© 2010 The British Psychological Society

TheBritishPsychologicalSociety

www.wileyonlinelibrary.com

A generalized dimensionality discrepancymeasure for dimensionality assessment inmultidimensional item response theory

Roy Levy∗ and Dubravka SvetinaArizona State University, Tempe, Arizona, USA

A generalized dimensionality discrepancy measure is introduced to facilitate a critiqueof dimensionality assumptions in multidimensional item response models. Connectionsbetween dimensionality and local independence motivate the development of thediscrepancy measure from a conditional covariance theory perspective. A simulationstudy and a real-data analysis demonstrate the utility of the discrepancy measure’sapplication at multiple levels of analysis in a posterior predictive model checkingframework.

1. IntroductionZhang and Stout (1999a) characterized approaches to dimensionality assessment foritem response data in terms of those that (a) are more exploratory in nature andattempt full dimensionality assessment by estimating the number of latent dimensionsand determining which items reflect which dimension(s) or (b) are more confirmatoryin that they assess (departures from) unidimensionality. In addressing the latter of these,Stout (1987) framed the DIMTEST procedure as responding to Lord’s (1980, p. 21)call for a statistical procedure to assess unidimensionality in item response theory (IRT).However, assessments are often multidimensional in the sense that performance on tasksdepends on distinct though possibly related proficiencies or aspects of proficiency.Advances in estimation capabilities, including those afforded by adopting Bayesianapproaches to statistical modelling, have supported the growth in the popularity ofresearch on and the use of multidimensional IRT (MIRT) models for item responseand similar types of data (Beguin & Glas, 2001; Bolt & Lall, 2003; Clinton, Jackman, &Rivers, 2004; Yao & Boughton, 2007). With the emergence of such models comes theneed for appropriate model checking and model criticism procedures. Thus, Lord’s(1980) call may be extended to a call for procedures for assessing dimensionalityassumptions in multidimensional models. The analysis of the assumed dimensionality has

∗Correspondence should be addressed to Dr Roy Levy, Arizona State University, PO Box 873701, Tempe, AZ85287-3701, USA (e-mail: [email protected]).

DOI:10.1348/000711010X500483

Generalized dimensionality discrepancy measure 209

long been recognized as an integral part of any application of unidimensional IRT models,and though evaluating dimensionality assumptions takes on no less importance whenmultidimensional models are employed, dimensionality and data–model fit proceduresfor multidimensional models are relatively underdeveloped (Swaminathan, Hambleton,& Rogers, 2007). To this end, we introduce a new discrepancy measure for assessing thedimensionality assumptions applicable to multidimensional (as well as unidimensional)models in the context of MIRT and present the application of this discrepancy measureusing a posterior predictive model checking (PPMC) framework. PPMC has beensuccessfully applied to unidimensional IRT models (e.g., Hoijtink, 2001; Sinharay, 2005,2006; Sinharay, Johnson, & Stern, 2006), including the assessment of dimensionality forunidimensional models (Levy, 2010; Levy, Mislevy, & Sinharay, 2009; Sinharay et al.,2006). To date, no study has examined the use of PPMC for examining dimensionalityassumptions when fitting MIRT models. In the current work, we capitalize on theflexibility of PPMC as a model checking framework and introduce a discrepancy measuresuitable for assessing a model’s assumed (multi)dimensional structure for a wide class oflatent variable models.

The current work considers the multidimensional normal ogive IRT model fordichotomous observables (i.e., scored item responses) which specifies the probabilityof an observed value of 1 (i.e., a correct response) as

where Xij is the observable response (coded as 0 or 1) of examinee i to item j, �i =(�i1, �i2, . . . , �iM)′ is a vector of M latent variables that characterize examinee i, aj =(aj1, aj2, . . . , ajM)′ is a vector of M coefficients for item j that capture the discriminatingpower of the associated examinee variables, cj is a lower asymptote parameter for itemj, and dj is an intercept related to the marginal difficulty of the item (e.g., Beguin &Glas, 2001; McDonald, 1997). The discrepancy measures and procedures presented inthe current work apply to MIRT models using logistic distributions, including modelswith conjunctive relationships (Embretson, 1997; Reckase, 1997).

In Section 2, the role of local independence and dimensionality in multidimensionalmodels is discussed, which serves as the foundation for the subsequent develop-ment of the new discrepancy measure and its variants. Following an overview ofthe mechanics of PPMC, results from a simulation study in which the performanceof the proposed discrepancy measure is evaluated and compared to existing pro-cedures are presented, evidencing the utility of formulating the discrepancy mea-sure at different levels of the analysis to support inference. A real-data exampleillustrates the procedures in which multiple graphical representations are employedin the analysis.

2. Dimensionality and local independence in multidimensional modelsThe relation between the assumptions of (a) proper specification of dimensionality and(b) local independence is not without ambiguity both in terminology and in meaning.Owing to the primacy of unidimensional models, local independence is commonlyphrased as the statistical independence of items conditional on a single underlyinglatent variable (Ip, 2001; Stout, 1987). However, the principle of local independence

210 Roy Levy and Dubravka Svetina

may hold for M = 0,1,2, . . . latent variables (Hattie, 1984). Accordingly, a more generaltreatment defines local independence as the statistical independence of items conditionalon a possibly vector-valued latent variable (Hattie, Krakowski, Rogers, & Swaminathan,1996; Stout, 1990). As typically formulated, this more general notion asserts (e.g., Stout,1990) that for all values of a possibly vector-valued latent variable �,

This definition views local independence with respect to a set of latent variables. Underthis formulation, if the model specifies the correct number of variables (or more), localindependence will hold. In this vein, Hattie et al. (1996, p. 1) argued that the principleof local independence means that ‘Once these trait values are fixed at a given value (i.e.,conditioned on), the responses to items become statistically independent. Thus, in orderto determine the dimensionality of a set of items, it is necessary and sufficient to identifythe minimal set of traits such that at all fixed levels of these traits the item responses areindependent’.

Accordingly, if the model underspecifies the number of variables, local independencewill not hold. To illustrate, we extend Ip’s (2001) argument for the presence of localdependence when fitting a unidimensional model to multidimensional data to the case ofmultidimensional modelling. Once extended to multidimensional models, this argumentis recast to advocate a more refined definition of local independence.

Suppose the true model for a set of J items is three-dimensional with latent variables� = (�1, �2, �3) ∼ N(0, � ) with covariance matrix

where �2m is the variance of �m and �mm′ is the correlation between �m and �m′ . Consider

a linear model for continuous responses to the set of J items:

where � is an overall mean parameter, aj1, aj2, and aj3 are item-specific parameterscapturing the dependence of the item on the latent variables, and ej is a random errordue to measurement where each value of ej is independent and identically distributedwith mean zero and variance �2

e j. Any two items Y j and Y j′ are conditionally (locally)

independent given �1, �2, and �3.However, if one conditions upon �1 and �2, E(Y j|�1, �2) = � + aj1�1 + aj2�2 and

where �23|12 is the squared multiple correlation between �3 and �1 and �2. Non-zero

values for Cov(Y j, Y j′ |�1, �2) indicate local dependence, and it is readily seen that localdependence will exist unless (a) at least one of the items does not depend on �3 (i.e., aj3

and/or aj′3 equals zero) or (b) �3 is perfectly correlated with some linear combinationof �1 and �2. The argument can be extended in straightforward ways to situations withmore true and conditioned dimensions and to situations in which there are multiple

Generalized dimensionality discrepancy measure 211

relevant dimensions (i.e., a dimension �m where both ajm and aj′m are non-zero) that arenot conditioned on.

The above perspective of formulating local independence with respect to a numberof latent variables implicitly assumes that the model correctly specifies the patternof dependence of the items on the dimensions (or allows all items to depend on allthe dimensions, up to the limits of identification). However, the assertion that localindependence follows from correctly specifying the number of latent variables will notnecessarily hold if the pattern of dependence of the items on the dimensions is incorrect.To illustrate, suppose a set of J items follows a three-dimensional model, as above, andthat the analyst (correctly) specifies that there are three dimensions. However, supposethe analyst incorrectly specifies that aj3 and aj′3 equal zero. In this case, the model-implied covariance conditional on �1, �2, and �3 is equal to the model-implied covarianceconditional on just �1 and �2; that is, Cov(Y j, Y j′ |�1, �2, �3) = Cov(Y j, Y j′ |�1, �2), inwhich case the items will be locally dependent (see (2)). More generally, we hypothesizethat an incorrect specification of the dependence of the items on the latent variables(e.g., failing to model an item as dependent on a relevant latent variable, improperlyconstraining item parameters to be equal) will result in local dependence, even whenthe correct number of latent variables is specified. Synthesizing the above discussion,we make the assumption of correct model specification explicit and state that items arelocally independent with respect to a model if

where � is the collection of item-specific parameters �1, . . . , �J ; in the current casefor the model in (1), �j = (aj, dj, cj). The inclusion of � indicates the conditioning iswith respect to the assumed structure, as well as the number of latent variables.

A weaker form of local independence requires pairwise item independence (Ip,2001; Stout, Habing, Douglas, Kim, Roussos & Zhang, 1996). For items j and j′, weaklocal independence holds if

where E� denotes that the expectation is taken with respect to the distribution of �.In practice, weak local independence is most often investigated in terms of pairs ofobservables as it is reasonable to assume that if variables are pairwise independent,higher-order dependencies, though possible, are highly implausible (McDonald, 1997;McDonald & Mok, 1995).

The current work adopts the perspective that local dependence can be thoughtof as unmodelled dimensionality, regardless of whether the extraneous dimensionalityis of substantive interest or merely a nuisance (Ip, 2001). This supports a conditionalcovariance theory perspective on dimensionality assessment, which has provided thefoundation for a number of dimensionality assessment procedures for unidimensionalmodels (Levy et al., 2009; Stout et al., 1996; Zhang, 2007; Zhang & Stout, 1999a, 1999b).

212 Roy Levy and Dubravka Svetina

3. The generalized dimensionality discrepancy measureTo develop a discrepancy measure and associated testing procedure to assess theassumed dimensionality, consider the model-based covariance (MBC; Reckase, 1997)for item pairs,

where N is the number of examinees and in the current context E(Xij|�i, �j) is theconditional expectation of the value of the observable (i.e., the scored item response)for examinee i to item j, which is given by the item response function of the model.MBC is closely related to Q3 (Yen, 1984); Q3jj′ = reijeij′ where r refers to the correlationand eij = Xij − E(Xij|�i, �j). For any pair of items, values of MBC and Q3 that are greater(less) than zero indicate the presence of positive (negative) local dependence, whereasvalues of zero indicate local independence. Note that both MBC and Q3 condition onthe latent variables through their use of E(Xij|�i, �j), the (conditional) expectation ofthe value of the response for examinee i to item j. Research on discrepancy measuresfor dimensionality and local independence assumptions in unidimensional models hasshown that measures that examine conditional associations, including MBC and Q3,outperform those that examine marginal associations (Levy, 2010; Levy et al., 2009).

Viewing MBC as an estimator of the conditional covariance in (3), we employ MBC foritem pairs as a building block, constructing a generalized dimensionality discrepancy

measure (GDDM) as the mean of the absolute values of MBC over unique item pairs

It is clear that GDDM ≥ 0, with equality holding when weak local independence holdsfor all pairs of items. GDDM can be viewed as an estimator of

which captures the average absolute amount of local dependence amongst the itempairs. The absolute value is taken because, when estimating a model, the presenceof positive local dependence implies the presence of negative local independence(Habing & Roussos, 2003). By taking the absolute value prior to aggregation, allbivariate information in the data is employed; in this way negative local dependencecontributes to the total local dependence assessed in the model. Procedures that focuson positive local dependence (e.g., DIMTEST) might suffer in situations in whichunmodelled multidimensionality most prominently manifests itself in terms of negativelocal dependence (Levy et al., 2009).

The quantity in (6) may be thought of as a multidimensional extension of themaximum possible value of the DETECT index (Stout et al., 1996). Whereas the DETECTindex conditions on the (single) dimension of best measurement and seeks to identifyclusters of items assuming simple structure (Stout et al., 1996; Zhang, 2007; Zhang &Stout, 1999b), the proposed quantity conditions on the possibly multiple dimensionsin the model and the possibly factorially complex pattern of dependence of items on

Generalized dimensionality discrepancy measure 213

the dimensions. The quantity in (6) may similarly be viewed as a multidimensionalextension of quantities associated with Stout’s notion of essential unidimensionality (seeequation (14) of Stout, 1987). As such GDDM is, in spirit, an extension of the DIMTESTstatistic to support confirmatory dimensionality assessment for multidimensional as wellas unidimensional models.

GDDM may be formulated at the subtest level in a straightforward manner by taking J

in (5) to be the number of items to be investigated in the subtest. At the level of a singleitem pair (i.e., a subtest with J = 2), MBC or Q3 is preferred, as they reflect the direction(positive or negative) of the conditional association in addition to its magnitude. InSection 4, we describe the model checking framework that supports the use of GDDMat the test and subtest level, and MBC and Q3 at the item-pair level.

4. Posterior predictive model checkingThe current section advances a Bayesian approach to modelling and model checkingto facilitate inference regarding the model’s assumed dimensionality based on GDDM.Given the usual conditional independence assumptions, the posterior distribution ofperson parameters � and item parameters � given data X is

PPMC (Gelman, Meng, & Stern, 1996; Meng, 1994; Rubin, 1984) analyses characteristicsof the observed data and/or the discrepancy between the data and the model by referringto the posterior predictive distribution

where P(�, �|X) is the posterior distribution in (7), and Xrep refers to replicated dataof the same size (i.e., sample size and number of observables) as the original data.PPMC is conducted by evaluating discrepancy measures that are constructed to capturerelevant features of the data as well as the discrepancy between data and the model (i.e.,GDDM). Data–model fit is evaluated by examining the realized values of the discrepancymeasures based on the observed data, denoted D(X; �, �), relative to the values obtainedby employing the posterior predictive distribution, denoted D(Xrep; �, �). In addition tographical procedures, a common way to summarize the results of a PPMC analysis is viathe posterior predictive p (PPP) value (Gelman et al., 1996; Meng, 1994), the tail area ofthe posterior predictive distribution of the discrepancy measure corresponding to therealized value(s):

The current work adopts a perspective that views the results of PPMC as diagnosticpieces of evidence for, rather than a hypothesis test of, data–model (mis)fit (Gelman,2003, 2007; Gelman et al., 1996; Gill, 2007; Stern, 2000). Such an interpretation isparticularly relevant in the context of assessing dimensionality and local dependence(Chen & Thissen, 1997) and is consistent with the approach to model criticism thatviews statistical diagnostics as a component in the larger enterprise of evaluating modeladequacy (Sinharay, 2005). From this perspective, PPP values have direct interpretationsas expressions of our expectation of the extremity in future replications, conditional

214 Roy Levy and Dubravka Svetina

on the model (Gelman, 2003, 2007), and serve to summarize the results of the PPMC.PPP values near .5 indicate that realized discrepancies fall in the middle of the posteriorpredictive distribution of the discrepancy, evidencing adequate data–model fit in termsof the features captured by the discrepancy measure. Extreme PPP values indicate alack of data–model fit in terms of the features captured by the discrepancy measure.For GDDM, a PPP value close to zero will result when the realized values consistentlyexceed those in the posterior predictive distribution, indicating that the observed dataexhibit more local dependence than would be expected based on the model.

It is noted that the propriety of interpreting PPP values from a hypothesis-testingperspective has been critiqued, as PPP values will not necessarily be uniformly distributedunder the null hypothesis of exact data–model fit (Robins, van der Vaart, & Ventura,2000). As a result, employing PPP values to conduct hypothesis testing could lead toconservative tests. Alternatives to PPP values include conditional predictive and partialpredictive p values (Bayarri & Berger, 2000a), which do have desirable properties (froma frequentist, hypothesis-testing perspective) under null conditions (Bayarri & Berger,2000a; Robins et al., 2000). However, these approaches have several drawbacks (Stern,2000), including that they do not support discrepancy measures that are a functionof the data and model parameters (Bayarri & Berger, 2000a). Research has shown thatwhen investigating dimensionality and local independence assumptions, discrepancymeasures that condition on (some function of) the model parameters and examineconditional associations outperform discrepancy measures that are not functions of themodel parameters and examine marginal associations (Levy, 2010; Levy et al., 2009;McDonald & Mok, 1995). As a function of the model-implied expectations, GDDMis explicitly constructed to embody the principles of a conditional covariance theoryperspective on (multi)dimensionality assessment and precludes the usage of conditionaland partial predictive p values. In contrast, PPMC handles GDDM (and other) discrepancymeasures naturally. Importantly, it will not necessarily be the case that PPP valueswill yield overly conservative tests; appropriately chosen discrepancy measures shouldyield PPP values that perform quite well under null conditions (Bayarri & Berger,2000b). Levy et al. (2009) found that PPP values for MBC – from a hypothesis-testingperspective – yielded Type I error rates only slightly below the nominal values in thecase of unidimensional IRT. Moreover, researchers wishing to adopt a hypothesis-testingperspective may employ methods to calibrate the PPP values to obtain p values withdesirable frequentist properties (Hjort, Dahl, & Steinbakk, 2006).

PPMC is a flexible framework for model criticism and holds many potential advantagesover traditional techniques (Levy et al., 2009). PPMC makes no appeal to regularityconditions or asymptotic arguments to arrive at reference distributions; the ubiquity of ill-defined reference distributions has hampered traditional approaches to model checkingin unidimensional IRT (e.g., Chen & Thissen, 1997). Such problems could potentiallybe exacerbated in MIRT models studied here. As a consequence of the flexibility ofPPMC, the analyst is free to choose from a broad class of functions, including those thatpose difficulties for traditional model checking techniques (e.g., Levy et al., 2009). Inaddition, the results of PPMC may be compiled and communicated in a variety of ways,including numerically via PPP values or graphically (Gelman, 2003; Gelman et al., 1996;Levy et al., 2009; Sinharay et al., 2006). Furthermore, by using the posterior distributionrather than point estimates for the model parameters, PPMC incorporates the uncertaintyin the estimation of the model parameters in the construction of the distributions of thediscrepancy measures (Meng, 1994).

Generalized dimensionality discrepancy measure 215

5. Simulation studyThis section describes a Monte Carlo study examining the utility of GDDM in evaluatingthe dimensionality assumptions of MIRT models. As described below, data are generatedfrom two- and three-dimensional MIRT models and fitted with a series of models toevaluate the behaviour of GDDM under various conditions.

5.1. ModelsAll data sets consisted of simulated responses from 1,000 simulated examinees on a testof 36 items following one of several multidimensional model structures of the form in(1) where, for simplicity, all cj parameters are equal to 0. Table 1 lays out the models’structures in terms of the values of the parameters used to generate the data. The secondcolumn gives the values of the location parameter for the items used for all the models.The columns under the heading M0 give the values of aj1 and aj2, the discriminationparameters for �1 and �2, respectively, for M0. As a two-dimensional model with simplestructure, M0 serves as the baseline model; the remaining models may be thought ofas extensions of this model. M1, M2, and M3 are three-dimensional models in whichsome of the items are dependent on a third dimension, �3. For each of these models,the structure of the dependence of the items on �1 and �2 remains the same as in M0.The columns under the headings M1, M2, and M3 indicate which items reflect �3 ineach these models, and give the value of aj3 for these items. M1 is a three-dimensionalmodel in which some of the items that depend on �1 additionally depend on �3. M2is a three-dimensional model in which �3 influences items that depend on �2 as wellas items that depend on �1. M3 has a similar structure to M2, but has fewer items thatdepend on �3. The structures in M1–M3 correspond to situations in which �3 may be asubstantive dimension or a dimension akin to a method factor, say, for use in modellingtestlet structures. Finally, M4 is a two-dimensional model; the columns under M4 givethe values of aj1 and aj2. M4 differs from M0 in that several items reflect both �1 and�2, corresponding to a scenario where multiple skills or proficiencies are involved insolving some of the items. For simplicity, we assume all bivariate correlations betweenthe latent variables are constant, denoted by � . For each model structure, we considerconditions in which � = 0. In addition, we consider conditions in which � = .5 for eachof M0, M1, M2, and M4.

5.2. AnalysesFor each model, the following Monte Carlo procedures were replicated 50 times. Foreach replication, 1,000 values of �1, �2, and �3 were generated from a multivariatenormal distribution with all variances equal to one and correlations dictated by thecondition. These values and the item parameters (Table 1) were entered into the MIRTitem response function in (1) to yield the model-implied probability for each of the 1,000simulated examinees to each of the 36 items. Each of these probabilities was comparedto a simulated random variable from a uniform distribution over the unit interval. If theprobability exceeded the random uniform draw, the value of the item response was setto one; otherwise it was set to zero.

To explore the utility of using the proposed procedures, we conducted PPMCusing GDDM at the test and subtest levels, and using MBC and Q3 at the item-pair level for a variety of combinations of data generation models and data analysis

216 Roy Levy and Dubravka Svetina

Table 1. The structure of the models in the simulation study

models. A cross in Table 2 indicates which models were fitted to data from each ofthe models.

For each of combinations listed in Table 2, the analysis proceeded as follows. Foreach of the data sets from the generation model, the analysis model is fitted using Markovchain Monte Carlo procedures (Beguin & Glas, 2001; Bolt & Lall, 2003; Clinton et al.,2004; Yao & Boughton, 2007) via the pscl package in R (Jackman, 2008). This estimationroutine specifies independent standard normal priors for the latent examinee variables

Generalized dimensionality discrepancy measure 217

Table 2. Analysis plan of the simulation study

and requires user input of the prior distributions for the item parameters. The locationparameters were assigned diffuse normal prior distributions

The function requires the input of prior distributions for all discrimination parameters.Those specified by the model were assigned diffuse normal prior distributions

Those not specified by the model (e.g., discrimination parameters for items 1–18 on �2 inM0) were assigned normal distributions with a mean of zero and a variance of 1 × 10−12,effectively constraining the parameters to be equal to zero.

For five replications in each analysis condition (i.e., each combination listed inTable 2), trace plots and convergence diagnostics (Brooks & Gelman, 1998) wereanalysed and it was determined that, in all cases, the chains converged by 5,000iterations. It was concluded that 5,000 iterations would be sufficient to discard asburn-in for the remaining replications in each condition. Thus, for each analysis, threechains were run from overdispersed starting-points for 2,000 iterations after the burn-in phase of 5,000 iterations. These iterations were thinned by 20 and the remainingiterations were pooled to yield 300 draws from the posterior distribution for use inconducting PPMC. These draws are employed with the observed data X to computethe realized values of the discrepancy measures and then used to generate the Xrep,which are then used in computing the posterior predicted values of the discrepancymeasures. The PPP value is estimated as the proportion of draws for which theposterior predicted value of the discrepancy measure exceeds the corresponding realizedvalue. Programs to conduct the PPMC were written by the authors and are availableupon request.

To facilitate a comparison between the performance of the proposed methods andexisting procedures, the models were fitted in NOHARM (Fraser & McDonald, 1988)and programs were written to compute two statistics targeting the goodness of fit of the

218 Roy Levy and Dubravka Svetina

model. Specifically, we consider a statistic introduced by Gessaroli and De Champlain(1996; Finch & Habing, 2005, 2007),

where j and j′ serve to index the items to define the unique pairings of items, and

is the Fisher z transformation of the residual correlation for pairing of items j and j′,given by

where p(0)j is the observed proportion of examinees getting item j correct and p

(r)jj′ is the

residual covariance between items j and j′. To facilitate hypothesis testing, the value of� 2

GD is referred to a central � 2 distribution with degrees of freedom given by 0.5J(J−1)−t,where t is the number of parameters estimated in fitting the model (Finch & Habing,2007).

In addition, we consider a statistic introduced by Gessaroli, De Champlain, and Folske(1997; Finch & Habing, 2005, 2007),

where

where kj and kj′ are the scores (0 or 1) for items j and j′, respectively, and pkjkj′ and pkjkj′are the observed and model-implied proportions of examinees with response patternsgiven by the combination of kj and kj′ , respectively (see Finch & Habing, 2005, for furtherdetails on the calculation using NOHARM output). To facilitate hypothesis testing, ALRis referred to the same � 2 distribution as � 2

GD (Finch & Habing, 2007).

5.3. ResultsFollowing Gelman et al. (1996), we recommend the use of graphical representations ofthe results of PPMC in an applied analysis. Figure 1 contains scatterplots of the realizedand posterior predicted values of GDDM from an analysis of one data set generated fromM1 with � = 0. Figure 1a contains a scatterplot of the realized and posterior predictedvalues of GDDM obtained by fitting M1, which is the correct model. Figure 1b containsthe scatterplot obtained when the data were fitted with M0, which underspecifies thedimensionality. In each plot, the unit line is added as a reference. It is seen that inFigure 1a the points appear to be randomly distributed around the line, indicatingthat the realized and posterior predicted values are comparable, evidencing adequate

Generalized dimensionality discrepancy measure 219

(a)

Po

ster

ior

pre

dic

tive

GD

DM

0.0036 0.0040 0.0044

0.0036

0.0040

0.0044

Po

ster

ior

pre

dic

tive

GD

DM

0.0036

0.0040

0.0044

Realized GDDM

(b)

0.0036 0.0040 0.0044

Realized GDDM

Figure 1. Scatterplots of the realized and posterior predicted values of GDDM from the analysisof a data set from M1: (a) results from fitting M1; and (b) results from fitting M0.

data–model fit. In Figure 1b, the points fall below and to the right of the line, indicatingthat the realized values are frequently larger than the posterior predicted values, hencethat the observed data exhibit more local dependence than the posterior predicted data,as captured by GDDM. The PPP values summarize these graphical representations as theproportions of points above and to the left of the unit line, which in this case are .46and .01, respectively. For ease of exposition, the majority of the results of the simulationstudy are presented in terms of PPP values as summaries of each PPMC analysis.

Table 3 summarizes the results of the study for GDDM at the test level, � 2GD, and

ALR in each of the conditions. For each condition defined by the first two columns, theproportions of replications with p values below .10 and .05 obtained from fitting thecorrect model are given first, followed by the proportions of replications with p valuesbelow .10 and .05 obtained from fitting M0. For GDDM, the p values are PPP values basedon PPMC; for � 2

GD and ALR the p values are obtained via the � 2 reference distributionsas described above.

Table 4 presents the proportion of PPP values below .10 and .05 for GDDM evaluatedon subtests when M0 was fitted to data from the various conditions. The first and secondsubtests were defined as the first half of the test (items 1–18) and second half of the test(items 19–36), respectively. From the perspective of fitting M0, investigating GDDM forthese subtests corresponds to investigating the dimensionality that is assumed for each

220 Roy Levy and Dubravka Svetina

Table 3. Proportions of p values for GDDM, � 2GD, and ALR in the analysis conditionsa

Table 4. Proportions of PPP values beyond .10 and .05 for subtests when M0 was fitted to datafrom the generation models: except where noted, the results of the two subtests are pooled

modelled dimension. For all models except M1, the results of the PPP values from thetwo subtests were pooled following an assumption that the two subtests are symmetricwith respect to their dimensionality. For M1, the subtests are not symmetric; the first halfof the test reflects �1 and �3 while the second half of the tests reflects �2 only. For M1,the proportion of PPP values below .10 and .05 are listed for the two subtests separately.

Table 5 presents the median and proportion of extreme PPP values of MBC fordifferent types of item pairs from fitting M0 to data in each of the conditions where� = 0. Extremely high PPP values are considered as well as extremely low PPP values

Generalized dimensionality discrepancy measure 221

Table 5. Median PPP values and proportions of extreme PPP values for MBC by types of itempairsa when M0 was fit to data from the generation models with � = 0

because MBC is a directional measure of local dependence. For each model, item pairsare defined by types based on the dimension(s) they reflect. Item-pair types that have anexchangeable dimensional structure are pooled. For example, in M0 (1–1) refers to itempairs in which both items reflect �1 and (2–2) refers to item pairs in which both itemsreflect �2. These item pairs have the same dimensional structure in the sense that theyboth reflect one of the correctly modelled dimensions. As such they are pooled togetherbut kept separate from item pair (1–2), in which one item reflects �1 and the other itemreflects �2. Results for the analysis of Q3 are not presented as they were quite close tothat of MBC. Similarly, results for MBC and Q3 when the latent variables were correlated� = .5 are not presented as they exhibited patterns consistent with those in Table 5.

5.4. DiscussionViewing a PPMC analysis as a diagnostic approach to evaluating data–model fit (Gelman,2003; Gelman et al., 1996; Levy et al., 2009; Sinharay, 2005), the results in Table 3 indicate

222 Roy Levy and Dubravka Svetina

that when M0 is fitted to data from more dimensionally complex models, PPMC usingGDDM is likely to yield patterns indicative of data–model misfit akin to that in Figure 1b.When the correct model is fitted to data, PPMC is unlikely to yield such patterns. Instead,PPMC will typically yield patterns indicative of adequate data–model fit as in Figure 1a.The results for M1, M2, and M3 indicate that GDDM is able to detect data–model misfitof the two-dimensional model when fitted to data that follow a three-dimensional model,both when the influence of the unmodelled dimension is concentrated on items thatdepend on one of the other dimensions (M1) and when its influence is more widelydistributed across the total set of items (M2 and M3).

The results for fitting M0 to data that follow M4 indicate that GDDM is able todetect data–model misfit of the simpler two-dimensional model when fitted to data thatfollow a two-dimensional model with more complex structure. Similarly, the resultsfrom fitting M3 to data that follow M2 indicate that GDDM is able to detect data–modelmisfit when the analysis model correctly specifies three dimensions but the pattern ofthe dependence of the items on those dimensions is incorrectly specified. These resultssupport the argument advanced above that local independence should be viewed notmerely with respect to the number of dimensions, but with respect to the model asconstituted by dimensions and the patterns of dependence of the observables on thosedimensions.

A hypothesis-testing perspective views the proportions in columns 3 through 8 inTable 3 as being akin to empirical Type I error rates and the proportions in the remainingcolumns as akin to power rates for detecting data–model misfit due to the improperlyspecified dimensionality. Although the current work adopts a diagnostic perspective onPPMC, it is interesting to note that, from a hypothesis-testing perspective, GDDM hasconsiderable power to detect data–model misfit in dimensionally misspecified modelswhile maintaining Type I error rates. With only a few exceptions (under M4 with � =0, M2 with � = .5, and M1 with � = .5 and � = .05), the Type I error rates were slightlybelow the nominal values, which is consistent with theoretical work on the distributionsof PPP values under null conditions (Robins et al., 2000) and previous work on the useof PPP values in IRT (e.g., Sinharay et al., 2006). In the context of dimensionality andlocal dependence assessment for unidimensional models, Levy et al. (2009) found thatMBC, Q3, and related indices exhibited empirical Type I error rates at or slightly belownominal values. The present study finds that GDDM, which extends MBC, yields similarrates in the multidimensional models studied here.

In contrast, the empirical Type I error rates for � 2GD and ALR were generally well

below nominal values. GDDM also considerably outperformed � 2GD and ALR in terms of

detecting data–model misfit in when the data followed M1 and M3. GDDM performedas well as or better than � 2

GD and ALR when the data followed M2. In M1–M3, ALRoutperformed � 2

GD, and performed nearly as well as GDDM when the data followed M2.When the data followed M4, � 2

GD and ALR performed similarly and outperformed GDDM.Within any model structure, all of the discrepancy measures performed as well or betterin detecting unmodelled dimensionality when the latent variables were uncorrelatedcompared to when they were correlated, as is consistent with theoretical and previousempirical work in unidimensional modelling (e.g., Levy et al., 2009; Nandakumar &Stout, 1993; Nandakumar, Yu, Li, & Stout, 1998; Stout, 1987; Zhang & Stout, 1999a).Overall, GDDM performed favourably as compared to � 2

GD and ALR.The results from fitting M2, M3, and M0 to the data from the M2 condition illustrate

the way GDDM captures increasingly poor data–model fit. In these conditions, M2 is thecorrect model, M3 departs from M2 by failing to model the dependencies of some of

Generalized dimensionality discrepancy measure 223

0.0035 0.0045 0.0055

Mean GDDM

Figure 2. Distributions of mean GDDM values fitting M0 (dashed line), M2 (solid), and M3 (dotted)to data from M2.

the items on �3, and M0 further departs by failing to model the dependencies of any ofthe items on �3. As such, M3 and M0 are increasingly misspecified models relative to thecorrect model structure of M2. Table 3 indicates that the use of PPP values from GDDMallows for the detection of data–model misfit associated with fitting M3 and M0 in everyreplication. To further illustrate how GDDM assesses the magnitude of the data–modelmisfit, Figure 2 plots the distributions of the means of the realized GDDM values fromfitting each model to the data from the M2 condition with � = 0. The solid line representsthe distribution of mean GDDM values obtained from fitting M2; the dotted and dashedlines represent the distributions of mean GDDM values obtained from fitting M3 and M0,respectively. As is evident, for all data sets the values of the mean GDDM from both M3and M0 exceeded that from M2. Furthermore, the mean values from M0 (dashed) werein general larger than those from M3 (dotted). This finding supports the interpretationthat larger values of GDDM are indicative of worse data–model fit.

The results in Table 4 indicate that GDDM at the subtest level exhibits patterns similarto those of GDDM at the test level. When the data are generated from M0 and fitted withM0, extreme PPP values occur rarely (i.e., in hypothesis-testing terms: slightly above thenominal level of .10 when � = 0, below the nominal level of .10 when � = .5, and at thenominal level of .05 in both conditions). When M0 is fitted to data from other models,the proportions are all larger than when fitted to data from M0, indicating GDDM at thesubtest level is sensitive to the unmodelled dimensionality. The results for fitting M0 todata from M1 exhibit a pattern consistent with expectation where the model detectsthe presence of the extraneous dimensionality in the first subtest, but not in the secondsubtest. When the unmodelled dimension influences items on both subtests (M2), GDDMindicates data–model misfit on each subtest in all replications. Reducing the number ofitems that reflect the unmodelled dimension in M3 reduces the capacity for GDDM todetect data–model misfit at the subtest level. This further supports the previous findingsof GDDM as being sensitive to the magnitude of the data–model misfit. Consistent withthe findings at the test level, the performance of GDDM at the subtest level sufferedwhen the latent variables were correlated.

The results for the investigation of MBC at the item-pair level when fitting M0 (Table 5)reveal a number of patterns. In the M0 condition the analysis model is correctly specified.

224 Roy Levy and Dubravka Svetina

Accordingly, the median PPP values for the two types of item pairs are near .5, and theproportion of extreme PPP values in the tails beyond .10 and .05 is just under thoserespective values. These results parallel analogous results from the investigation of MBCin unidimensional models (Levy et al., 2009).

Under M1, the influence of �3 is localized to the first half of the test, and does notinfluence any of the items that reflect �2; associations involving these items should be wellmodelled. Accordingly, the pairings of items where one or both of the items reflect �2

(labelled 1–2, 2–13, and 2–2) yield medians of PPP values at .5 and proportions of extremePPP values just below the nominal levels, as in the M0 condition. A conditional covariancetheory perspective on estimation and dimensionality (Levy et al., 2009; Stout et al., 1996;Zhang & Stout, 1999a) implies that the dimension estimated as �1 when M0 is fitted todata that follow M1 is a complex combination of the true �1 and �3. As a result, pairings ofitems with the same dimensional structure (either 1–1 or 13–13) should exhibit positivelocal dependence, whereas item pairs with different dimensional structures with respectto the estimated �1 (1–13) should exhibit negative local dependence. This is exactly whatwas observed in the M1 condition, where the 1–1 pairings and the 13–13 pairings yieldedincreasingly high proportions of small PPP values, indicating positive local dependence,and the 1–13 pairings yielded relatively high proportions of large PPP values, indicatingnegative local dependence. These patterns of positive and negative local dependencemirror those found using MBC and related indices in the analysis of unidimensional IRTin light of unmodelled multidimensionality (Levy, 2010; Levy et al., 2009).

Under M2, the influence of �3 is distributed over both halves of the test. A conditionalcovariance theory perspective implies that the dimension estimated as �1 when M0 isfitted to data is a complex combination of the true �1 and �3 and the dimension estimated

as �2 when M0 is fitted to data is a complex combination of the true �2 and �3. Akin tothe results in M1 pairs in which both items reflect �3 (13–13, 23–23, 13–23) yielded lowPPP values, indicating positive local dependence, as did pairs where both items reflectedjust �1 or �2 (1–1, 2–2). Similarly, item pairs in which one item reflects �3 and the otheritem reflects either �1 or �2 (1–13, 2–23, 1–23, 2–13) yielded high PPP values, indicatingnegative local dependence. Finally, the PPP values for the 1–2 pairs were somewhatsmaller than .5. Under M3, pairings in which one or both items did not reflect �3 yieldedmore moderate PPP values than their counterparts under M2. In contrast, item pairs inwhich both items reflect �3 (13–13, 23–23, 13–23) were smaller than their counterpartsunder M2.

Under M4, the PPP values the 12–12 item pairs are small, indicating positivelocal dependence. The PPP values for the remaining types of item pairs are closerto .5, indicating their associations are well modelled. Under M4, the unmodelledmultidimensionality predominantly manifests itself in terms of item pairs in which bothitems reflect �1 and �2. We speculate that the results for the remaining item pairs areconsistent with a conditional covariance theory perspective. However, it is unclearwhether the results for the remaining types of item pairs indicate true patterns or aremerely reflective of random variation around .5. Theoretical research on the extensionof conditional covariance theory to multidimensional models and empirical researchinvestigating patterns of local dependence in multidimensional contexts are necessaryto further investigate this possibility.

Synthesizing across the results, it is clear that the higher level of aggregation (subtestsabove item pairs, test above subtests) yields better performance in terms of higher ratesof indicating the presence of unmodelled multidimensionality. This is because GDDMaggregates the item-pair level local dependencies, akin to how measures of differential

Generalized dimensionality discrepancy measure 225

bundle functioning aggregates item-level differential item functioning, capitalizing onthe amplification of the effects in the aggregation (Nandakumar, 1993).

6. Analysis of National Assessment of Educational Progress dataThe procedures introduced above are illustrated in an analysis of item response datafrom the 1996 National Assessment of Educational Progress (NAEP). The NAEP scienceassessment framework specifies three content areas: physical science, earth science, andlife science. Each item is classified in terms of one of these areas; subscale scores arereported on each of these dimensions (Allen, Carlson, & Zelenak, 1999). Item responsesto 16 items in block S20 from students from the national sample were analysed. Theblock contained four items in life science and six items in each of physical scienceand earth science. Eight of the items were multiple-choice items scored dichotomouslyand eight were constructed response items scored via integers. For the purposes ofthis analysis, the responses to the constructed response items were dichotomized in amanner such that the collapsing of categories results in the most balanced dichotomousresponse frequencies for each item. Missing responses prior to the last observed responsewere regarded as intentional omissions and were scored as incorrect. The analysis wasperformed on 1,020 examinees with complete data.

6.1. MIRT model structureA three-dimensional MIRT model is analysed, where the latent variables correspondto proficiency in the areas of physical science, earth science, and life science. Eachitem is modelled as reflecting one of the latent variables in accordance with the NAEPclassifications of items in terms of the content areas. Figure 3 contains a path diagramrepresentation of the hypothesized model.

6.2. Bayesian analysis and Markov chain Monte Carlo estimationFor the eight multiple-choice items, the probability of a correct response from examineei to item j was given via the MIRT model in (1). For the eight constructed response items,the probability of a correct response from examinee i to item j was given via the MIRTmodel in (1) where cj = 0. The model was identified by specifying the discriminationparameter for the first item on each latent variable to be unity; that is, the discrimination

Physicalscience

θ1

Lifescience

θ3

Earthscience

θ2

X3 X6 X7 X11 X15 X16 X2 X4 X5 X8 X9 X12 X1 X10 X13 X14

Figure 3. The MIRT model for the NAEP analysis.

226 Roy Levy and Dubravka Svetina

parameters for items 3, 2, and 1 for �1, �2, and �3, respectively, were fixed at one. Theremaining unknown discrimination parameters, lower asymptote parameters, and all ofthe location parameters were assigned prior distributions:

where I(0, ∞) for the specification of the prior on each ajm constrains the distributionto have support over the positive real line, modelling the hypothesis that the probabilityof a correct response monotonically increases with increases in any �m.

For each examinee, the prior distribution for the latent variables was multivariatenormal with a mean vector set at 0 to identify the model

A diffuse inverse-Wishart prior distribution was specified for the covariance matrix

where I is the identity matrix of rank M = 3.The model was estimated in WinBUGS 1.4 (Spiegelhalter, Thomas, Best, & Lunn,

2007).1 Three chains from dispersed start values were run for 6,000 iterations usingprogram-chosen sampling algorithms, including a Metropolis algorithm in which thevariance of the normal proposal distribution is adapted for the first 4,000 iterations.As measured by visual inspection of trace plots and Brooks–Gelman–Rubin diagnostics(Brooks & Gelman, 1998), the 4,000 iterations needed to adapt the proposal distributionin the Metropolis sampler were sufficient for the chains to converge. The remainingiterations were thinned by a factor of 20 and pooled to yield 300 draws used to conductPPMC.

6.3. PPMC analysisA PPMC analysis was conducted on the NAEP data, evaluating GDDM at the test level,GDDM at the subtest level where three subtests are defined in terms of the three contentareas (Figure 3), and MBC and Q3 at the item-pair level.

Figure 4 is a scatterplot of the 300 realized and posterior predicted values of GDDMevaluated on the entire test, where it is clearly seen that the realized values tend to belarger than their posterior predicted counterparts; the PPP value from this analysis was.08. This indicates that the solution to the model, in terms of the posterior distribution,suffers in terms of adequately accounting for the associations in the data.

Figure 5 shows scatterplots for the 300 realized and posterior predicted values ofGDDM evaluated in each of three subtests. The PPP values for the physical science,earth science, and life science subtests were .38, .25, and .48, respectively. These results

1WinBUGS parameterizes distributions slightly differently from conventions adopted here. Specifically, it em-ploys the precision (i.e., the inverse of the variance) in specifying normal distributions and a parameterizationof the beta distribution such that the priors are specified in WinBUGS as ajm ∼ N(0,0.10)I(0, ∞), dj ∼N(0,0.10), and cj ∼ �(4,16).

Generalized dimensionality discrepancy measure 227

Po

ster

ior

pre

dic

tive

GD

DM

0.0035 0.0040 0.0045 0.0050 0.0055

0.0035

0.0040

0.0045

0.0050

0.0055

Realized GDDM

Figure 4. Scatterplot of the realized and posterior predicted values of GDDM from the analysis ofthe NAEP data.

(a) (b)

Po

ster

ior

pre

dic

tive

GD

DM

0.002 0.004 0.006 0.008

0.002

0.005

0.008

Po

ster

ior

pre

dic

tive

GD

DM

0.002

0.005

0.008

Po

ster

ior

pre

dic

tive

GD

DM

0.002

0.005

0.008

Realized GDDM

0.002 0.004 0.006 0.008

Realized GDDM

0.002 0.004 0.006 0.008

Realized GDDM

(c)

Figure 5. Scatterplot of the realized and posterior predicted values of the subtest GDDM fromthe analysis of the NAEP data: results from (a) the physical science subtest; (b) the earth sciencesubtest; (c) the life science subtest.

228 Roy Levy and Dubravka Svetina

00.020.05

0.5

0.950.98

1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Figure 6. Graphical representation of PPP values of MBC for item pairs from the analysis of NAEPdata.

suggest that, within content areas, the associations among the items are well accountedfor by the dimensional structure in the model.

At the item-pair level, the results for MBC and Q3 were nearly identical; results forMBC will be presented and discussed. Figure 6 contains a matrix graphical plot of the PPPvalues for MBC at the item-pair level, where the numbers along the diagonal indicate theitem in that row and column of the matrix. The shading of the square in each element inthe matrix conveys the value of the PPP value as indicated by the key (e.g., a white squareindicates that the PPP value lies between 0 and .02). Focusing on the more extreme PPPvalues, the results suggest that the strongest residual associations include: (a) the positiveresidual associations between items 10 and 12, 10 and 16, 1 and 12, 13 and 16, and toa lesser extent between 6 and 10, 7 and 14, and 8 and 13, and (b) the negative residualassociations between items 7 and 10, 10 and 11, 5 and 6, and to a lesser extent between5 and 8, and 3 and 16. With the exception of these last two pairings, these item pairsinvolve the pairings where the items come from different content areas.

Synthesizing the above results, the PPMC analysis using GDDM at the test levelsuggests that the three-dimensional MIRT model with dimensions defined by the contentclassifications of the items does not adequately account for the dependencies among theitems. However, within subtests the associations are well accounted for, as evidenced bythe results for GDDM at the subtest level. Rather, the additional dependencies appear tobe in terms of several sets of item pairs drawn from different subtests, in particular several

Generalized dimensionality discrepancy measure 229

item pairs involving item 10. At this point, this diagnostic information could be leveragedto investigate substantive reasons for the weaknesses of the model with subject matterexperts and the assessment design team. Of the many possible explanations, Wei andMislevy (2007), who interpreted results from exploratory factor analyses of data fromthese items, suggested that a factor structure based on a distinction between conceptualunderstanding and practical reasoning may be more suitable than a factor structure basedon content domains.

7. ConclusionThis paper has described a new discrepancy measure supported by a PPMC frameworkfor assessing the dimensionality assumption in MIRT models. Grounded in conditionalcovariance theory, the discrepancy measure assesses the aggregate magnitude ofestimated pairwise conditional covariances in the data, relying on connections betweenthe assumptions of dimensionality and local independence. It was argued that localindependence should be viewed with respect to a model in terms of both the numberof hypothesized dimensions and the specified patterns of dependence of items on thedimensions.

GDDM is designed to assess the specified dimensional structure for a collection ofmeasured variables. Ideally, the choice to evaluate GDDM on a subset of the itemsshould be based on whether meaningful collections of the items can be determineda priori, as is warranted if the test is constructed to measure subdomains (as in theNAEP example), or if subscale results along those domains will be employed, or if theadministered items may be viewed as testlets. In the absence of a priori defined subtests,we recommend the following procedure for critiquing the assumed dimensionalityof a set of items. If an analysis of GDDM at the test level indicates the assumeddimensional structure of the model is untenable, the researcher may then follow upwith an analysis of subtests if the grouping of items into subtests can be theoreticallyjustified or with an analysis at the finer-grained level of item pairs. Simultaneously,considering the results over the item pairs can be suggestive of the specific weaknessesof the model and the types of structures that might better account for the relationshipsin the data.

GDDM is parametric in the sense that it employs the model-based expectationsof the observed values. However, it is not restricted to MIRT models discussedhere. GDDM is constructed to be sufficiently general to be applicable to assessdimensionality assumptions across a broad class of latent variable modelling paradigmsthat make a variety of distributional assumptions regarding the latent and observablevariables, including factor analytic and latent class models in addition to item responsemodels. Similarly, this work has demonstrated the use of PPMC for model criticismfor MIRT models. As a flexible framework for evaluating model diagnostics, PPMCmay be used to support data–model fit across a variety of psychometric modellingparadigms.

AcknowledgementsWe wish to acknowledge the two anonymous reviewers whose comments promptedimprovements to this paper.

230 Roy Levy and Dubravka Svetina

ReferencesAllen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington,

DC: National Center for Education Statistics.Bayarri, M. J., & Berger, J. O. (2000a). P values for composite null models. Journal of the American

Statistical Association, 95, 1127–1142. doi:10.2307/2669749Bayarri, M. J., & Berger, J. O. (2000b). Rejoinder. Journal of the American Statistical Association,

95, 1168–1170. doi:10.2307/2669756Beguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of

multidimensional IRT models. Psychometrika, 66 , 541–562. doi:10.1007/BF02296195Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidi-

mensional item response models using Markov chain Monte Carlo. Applied Psychological

Measurement, 27, 395–414. doi:10.1177/0146621603258350Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative

simulations. Journal of Computational and Graphical Statistics, 7, 434–455. doi:10.2307/1390675

Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item responsetheory. Journal of Educational and Behavioral Statistics, 22, 265–289.

Clinton, J., Jackman, S., & Rivers, D. (2004). The statistical analysis of role call data. American

Political Science Review, 98, 355–370. doi:10.1017/S0003055404001194Embretson, S. E. (1997). Multicomponent response models. In W. J. van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 305–321). New York:Springer.

Finch, H., & Habing, B. (2005). Comparison of NOHARM and DETECT in item cluster recovery:Counting dimensions and allocating items. Journal of Educational Measurement, 42(2), 149–169. doi:10.1111/j.1745-3984.2005.00008

Finch, H., & Habing, B. (2007). Performance of DIMTEST- and NOHARM-based statistics fortesting unidimensionality. Applied Psychological Measurement, 31, 292–307. doi:10.1177/0146621606294490

Fraser, C., & McDonald, R. P. (1988). NOHARM: Least squares item factor analysis. Multivariate

Behavioral Research, 23, 267–269. doi:10.1207/s15327906mbr2302 9Gelman, A. (2003). A Bayesian formulation of exploratory data analysis and goodness-of-fit testing.

International Statistical Review, 71, 369–382.Gelman, A. (2007). Comment: Bayesian checking of the second levels of hierarchical models.

Statistical Science, 22, 349–352. doi:10.1214/07-STS235AGelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via

realized discrepancies. Statistica Sinica, 6 , 733–807.Gessaroli, M. E., & De Champlain, A. F. (1996). Using an approximate chi-square statistic to test

the number of dimensions underlying the responses to a set of items. Journal of Educational

Measurement, 33, 157–192. doi:10.1111/j.1745-3984.1996.tb00487.xGessaroli, M. E., De Champlain, A. F., & Folske, J. C. (1997). Assessing dimensionality using a

likelihood-ratio chi-square test based on a non-linear factor analysis of item response data.Paper presented at the Annual Meeting of the National Council on Measurement in Education,Chicago, March.

Gill, J. (2007). Bayesian methods: A social and behavioral sciences approach. (2nd ed.). BocaRaton, FL: Chapman & Hall/CRC.

Habing, B., & Roussos, L. A. (2003). On the need for local item dependence. Psychometrika, 68,435–451. doi:10.1007/BF02294736

Hattie, J. (1984). An empirical study of various indices for determining unidimensionality.Multivariate Behavioral Research, 19, 49–78. doi:10.1207/s15327906mbr1901 3

Hattie, J., Krakowski, K., Rogers, H. J., & Swaminathan, H. (1996). An assessment of Stout’s indexof essential unidimensionality. Applied Psychological Measurement, 20, 1–14. doi:10.1177/014662169602000101

Generalized dimensionality discrepancy measure 231

Hjort, N. L., Dahl, F. A., & Steinbakk, G. H. (2006). Post-processing posterior predictive p

values. Journal of the American Statistical Association, 101, 1157–1174. doi:10.1198/016214505000001393

Hoijtink, H. (2001). Conditional independence and differential item functioning in the two-parameter logistic model. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.),Essays in item response theory (pp. 109–129). New York: Springer.

Ip, E. H. (2001). Testing for local dependency in dichotomous and polytomous item responsemodels. Psychometrika, 66 , 109–132. doi:10.1007/BF02295736

Jackman, S. (2008). pscl: Classes and methods for R developed in the political science computa-

tional laboratory, Stanford University. Department of Political Science, Stanford University,Stanford, CA. R package version 1.03. Retrieved from http://pscl.stanford.edu/

Levy, R. (2010). Posterior predictive model checking for conjunctive multidimensionality initem response theory. Journal of Educational and Behavioral Statistics. Advance onlinepublication.

Levy, R., Mislevy, R. J., & Sinharay, S. (2009). Posterior predictive model checking for multi-dimensionality in item response theory. Applied Psychological Measurement, 33, 519–537.doi:10.1177/0146621608329504

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,NJ: Erlbaum.

McDonald, R. P. (1997). Normal-ogive multidimensional model. In W. J., van der Linden & R. K.Hambleton (Eds.), Handbook of modern item response theory (pp. 257–269). New York:Springer.

McDonald, R. P., & Mok, M. M. C. (1995). Goodness of fit in item response models. Multivariate

Behavioral Research, 30, 23–40. doi:10.1207/s15327906mbr3001 2Meng, X. L. (1994). Posterior predictive p-values. Annals of Statistics, 22, 1142–1160. doi:10.

1214/aos/1176325622Nandakumar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stout’s test for

DIF. Journal of Educational Measurement, 30, 293–311. doi:10.1111/j.1745-3984.1993.tb00428.x

Nandakumar, R., & Stout, W. F. (1993). Refinement of Stout’s procedure for assessing latent traitdimensionality. Journal of Educational Statistics, 18, 41–68. doi:10.2307/1165182

Nandakumar, R., Yu, F., Li, H., & Stout, W. (1998). Assessing unidimensionality of polytomousdata. Applied Psychological Measurement, 22, 99–115. doi:10.1177/01466216980222001

Reckase, M. D. (1997). A linear logistic multidimensional model. In W. J. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271–286). New York:Springer.

Robins, J. M., van der Vaart, A., & Ventura, V. (2000). The asymptotic distribution of P valuesin composite null models. Journal of the American Statistical Association, 95, 1143–1172.doi:10.2307/2669750

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the appliedstatistician. Annals of Statistics, 12, 1151–1172. doi:10.1214/aos/1176346785

Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesianapproach. Journal of Educational Measurement, 42, 375–394. doi:10.1111/j.1745-3984.2005.00021.x

Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models.British Journal of Mathematical and Statistical Psychology, 59, 429–449. doi:10.1348/000711005X66888

Sinharay, S., Johnson, M., & Stern, H. S. (2006). Posterior predictive assessment of itemresponse theory models. Applied Psychological Measurement, 30, 298–321. doi:10.1177/0146621605285517

Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2007). WinBUGS user manual: Version

1.4.3. Cambridge: MRC Biostatistics Unit. Retrived from http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml

232 Roy Levy and Dubravka Svetina

Stern, H. S. (2000). Comment. Journal of the American Statistical Association, 95, 1157–1159.doi:10.2307/2669751

Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychome-

trika, 52, 589–617. doi:10.1007/BF02294821Stout, W. (1990). A new item response theory modeling approach with applications to unidi-

mensionality assessment and ability estimation. Psychometrika, 55, 293–325. doi:10.1007/BF02295289

Stout, W., Habing, B., Douglas, J., Kim, H. R., Roussos, L., & Zhang, J. (1996). Conditionalcovariance-based nonparametric multidimensionality assessment. Applied Psychological Mea-

surement, 20, 331–354. doi:10.1177/014662169602000403Swaminathan, H., Hambleton, R. K., & Rogers, H. J. (2007). Assessing the fit of item response

models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics (Vol. 26, pp. 683–718).Amsterdam: Elsevier/North-Holland.

Wei, H., & Mislevy, R. J. (2007). Multidimensionality in the NAEP science assessment. Paperpresented at the Annual Meeting of the National Council on Measurement in Education,Chicago, April.

Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach forimproving subscale proficiency estimation and classification. Applied Psychological Measure-

ment, 31, 83–105. doi:10.1177/0146621606291559Yen, W. (1984). Effects of local item dependence on the fit and equating performance of the

three-parameter logistic model. Applied Psychological Measurement, 21, 93–111.Zhang, J. (2007). Conditional covariance theory and DETECT for polytomous items. Psychome-

trika, 72, 69–91. doi:10.1007/s11336-004-1257-7Zhang, J., & Stout, W. (1999a). Conditional covariance structure of generalized compensatory

multidimensional items. Psychometrika, 64, 129–152. doi:10.1007/BF02294532Zhang, J., & Stout, W. (1999b). The theoretical DETECT index of dimensionality and its application

to approximate simple structure. Psychometrika, 64, 213–249. doi:10.1007/BF02294536

Received 18 August 2009; revised version received 12 March 2010