Wind, S.A., & Engelhard, G. (2013). How Invariant and Accurate are Domain Ratings in Writing...

22
Assessing Writing 18 (2013) 278–299 Contents lists available at ScienceDirect Assessing Writing How invariant and accurate are domain ratings in writing assessment? Stefanie A. Wind , George Engelhard Jr. The University of Georgia, United States a r t i c l e i n f o Article history: Received 19 June 2013 Received in revised form 11 August 2013 Accepted 2 September 2013 Keywords: Rating quality Analytic rubric Rasch measurement theory Rater invariance Rater accuracy a b s t r a c t The use of evidence to guide policy and practice in education (Cooper, Levin, & Campbell, 2009) has included an increased emphasis on constructed-response items, such as essays and port- folios. Because assessments that go beyond selected-response items and incorporate constructed-response items are rater- mediated (Engelhard, 2002, 2013), it is necessary to develop evidence-based indices of quality for the rating processes used to evaluate student performances. This study proposes a set of crite- ria for evaluating the quality of ratings based on the concepts of measurement invariance and accuracy within the context of a large- scale writing assessment. Two measurement models are used to explore indices of quality for raters and ratings: the first model provides evidence for the invariance of ratings, and the second model provides evidence for rater accuracy. Rating quality is exam- ined within four writing domains from an analytic rubric. Further, this study explores the alignment between indices of rating qual- ity based on these invariance and accuracy models within each of the four domains of writing. Major findings suggest that rat- ing quality varies across analytic rubric domains, and that there is some correspondence between indices of rating quality based on the invariance and accuracy models. Implications for research and practice are discussed. Crown Copyright © 2013 Published by Elsevier Ltd. All rights reserved. In response to the increased emphasis on evidence to guide policy and practice in education (Cooper, Levin, & Campbell, 2009; Huff, Steinberg, & Matts, 2010; Mislevy, Steinberg, Breyer, Almond, Corresponding author at: Educational Studies, Emory University, North Decatur Building, Suite 240, Atlanta, GA 30322, United States. Tel.: +1 8506879237. E-mail address: [email protected] (S.A. Wind). 1075-2935/$ see front matter. Crown Copyright © 2013 Published by Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.asw.2013.09.002

Transcript of Wind, S.A., & Engelhard, G. (2013). How Invariant and Accurate are Domain Ratings in Writing...

Assessing Writing 18 (2013) 278–299

Contents lists available at ScienceDirect

Assessing Writing

How invariant and accurate are domain ratingsin writing assessment?

Stefanie A. Wind ∗, George Engelhard Jr.The University of Georgia, United States

a r t i c l e i n f o

Article history:Received 19 June 2013Received in revised form 11 August 2013Accepted 2 September 2013

Keywords:Rating qualityAnalytic rubricRasch measurement theoryRater invarianceRater accuracy

a b s t r a c t

The use of evidence to guide policy and practice in education(Cooper, Levin, & Campbell, 2009) has included an increasedemphasis on constructed-response items, such as essays and port-folios. Because assessments that go beyond selected-responseitems and incorporate constructed-response items are rater-mediated (Engelhard, 2002, 2013), it is necessary to developevidence-based indices of quality for the rating processes used toevaluate student performances. This study proposes a set of crite-ria for evaluating the quality of ratings based on the concepts ofmeasurement invariance and accuracy within the context of a large-scale writing assessment. Two measurement models are used toexplore indices of quality for raters and ratings: the first modelprovides evidence for the invariance of ratings, and the secondmodel provides evidence for rater accuracy. Rating quality is exam-ined within four writing domains from an analytic rubric. Further,this study explores the alignment between indices of rating qual-ity based on these invariance and accuracy models within eachof the four domains of writing. Major findings suggest that rat-ing quality varies across analytic rubric domains, and that thereis some correspondence between indices of rating quality based onthe invariance and accuracy models. Implications for research andpractice are discussed.

Crown Copyright © 2013 Published by Elsevier Ltd. All rightsreserved.

In response to the increased emphasis on evidence to guide policy and practice in education(Cooper, Levin, & Campbell, 2009; Huff, Steinberg, & Matts, 2010; Mislevy, Steinberg, Breyer, Almond,

∗ Corresponding author at: Educational Studies, Emory University, North Decatur Building, Suite 240, Atlanta, GA 30322,United States. Tel.: +1 8506879237.

E-mail address: [email protected] (S.A. Wind).

1075-2935/$ – see front matter. Crown Copyright © 2013 Published by Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.asw.2013.09.002

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 279

& Johnson, 2002), the use of assessments that require constructed responses, i.e., performanceassessments, is increasing. Many current large-scale assessments that are implemented worldwideextend beyond selected-response items and require students to perform a task such as composing anessay. Some salient examples include the international Program for International Student Assessment(PISA; OECD, 2012) assessment, the Test of English as a Foreign Language (TOEFL; Educational TestingService, 2010), the International Association for the Evaluation of Educational Achievement’s (IEA)studies in mathematics (Trends in International Mathematics and Science Study; Mullis, Martin, Foy,& Arora, 2012) and reading (Progress in International Reading Literacy Study; Mullis, Martin, Foy, &Drucker, 2012), and the “next-generation assessments” that are part of the Race to the Top initiative inthe United States (U.S. Department of Education, 2010). Essentially, these assessments provide oppor-tunities for students to construct a response that is judged by a rater according to a rating scale thatis designed to represent a construct. The use of these assessments reflects a view that a rater’s judg-ment using a rating scale provides information beyond what could be provided by a more “objective”measure, such as a set of multiple-choice items.

Hamp-Lyons (2011) points out that rating scales for performance assessments serve two majorroles. First, they are tools that can be used to describe a student’s achievement in terms of a setof performance criteria; second, they serve as operational definitions of a construct. In the contextof writing assessment, she writes: “Often we think of rating scales as tools, and perhaps they are.But they are also realizations of theoretical constructs, of beliefs abut what writing is and whatmatters about writing” (p. 3). Human raters combine these two purposes for rating scales whenthey use a set of performance criteria to describe the quality of a response in terms of a theoret-ical construct, such as writing achievement. Because raters are the mechanism through which astudent’s knowledge, skills, and abilities are described in terms of a rating scale, it is important torecognize that ratings are not a direct representation of a student in terms of a construct––rather,performance assessment scores are mediated through human raters who exist within complexecological contexts. As a result, interpretation and use of rater-mediated descriptions of a student’sperformance depends on the rater’s ability to interpret and apply rating scale as intended. In order totrust ratings as useful descriptions of student performance, it is necessary to systematically examinethe quality of ratings. Within the context of a large-scale writing assessment, this study proposesthe use of measurement models as methodological tools for examining the quality of analyticratings.

1. Evaluating rating quality

In this study, rating quality is defined as the degree to which the rating assigned to a response iswarranted by the level of the performance. Concerns with rating quality are persistent in performanceassessment research, and typically focus on the subjectivity that is associated with human judgmentand decision-making in high-stakes contexts. Edgeworth’s (1890) research that describes disagree-ments among judges scoring written compositions is often highlighted as an early example of concernsabout the quality of rater-assigned scores. An examination of research on rater-mediated assessmentsince Edgeworth’s early documentation reveals a persistent effort in measurement research towardsthe development of methods that can be used to monitor the quality of ratings during operationalscoring. In general, research on rating quality over the last century has focused on examining ratingpatterns within and across individual raters as a method for determining the degree to which ratingscan be interpreted as an accurate reflection of a performance (Elliot, 2005; Saal, Downey, & Lahey,1980). For example, Guilford’s classic (1936) Psychometric Methods includes a chapter on rating scalemethods that summarizes errors in ratings related to leniency and severity, central tendency, haloeffects, and logical errors. Later, Saal et al. (1980) conducted a systematic review of research on rat-ing quality indices and identified three major categories of rating quality indices: (1) leniency andseverity errors, (2) halo errors, and (3) indices of interrater reliability and agreement. Included withinthese categories are methods for evaluating rating quality based on measures of central tendency,correlations, factor analysis, analysis of variance, and reliability coefficients. Subsequent work in thisarea includes indices of rater reliability and errors, along with indices of rater accuracy––defined as a

280 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

match between ratings from operational raters and “expert” raters (Murphy & Cleveland, 1991; Sulsky& Balzer, 1988).

Wolfe and McVay (2012) note that current research on rater-mediated assessments focuses on twomajor themes: (1) the influence of human characteristics of raters, such as experience, training, andprior knowledge and (2) statistical procedures to monitor the quality of ratings. Rating quality researchbased on first theme has increased through the use of cognitive models to examine rater decision-making processes (Barkaoui, 2011; DeCarlo, 2005), the impact of rater background characteristics onrating quality (Lumley & McNamara, 1995; Pula & Huot, 1993), as well as the impact of specific types ofrater training procedures (Weigle, 1998), scoring criteria (Clauser, 2000) and feedback (Elder, Knoch,Barkhuizen, & von Randow, 2005; Knoch, 2011) on rating quality. However, as Hamp-Lyons (2011)points out, advances in these areas are not yet sufficient to fully understand rating processes. Shewrites:

The familiar interaction of contexts with tasks, texts, and raters remains at the heart of writingassessments. Despite many studies of raters and rating processes in recent years, we still do notfully understand the characteristics of raters: should they be experts or novices? Does it matterwhether or not they are teachers? How much difference does rater training make? (p. 4)

Despite increased interest in these human- and training-related aspects of rating procedures, con-cerns about rating quality are addressed in practice through the use of statistical techniques formonitoring raters during operational scoring (Johnson, Penny, & Gordon, 2009). Current operationalmethods for monitoring ratings typically include indices of rating quality that are classified withinthree major categories: (1) rater agreement, (2) rater errors and systematic biases, and (3) rater accu-racy (Johnson et al., 2009; Murphy & Cleveland, 1991). Indices of rater agreement describe the degreeto which raters assign matching scores to the same performance; these indices include measures ofcategorical agreement and association among raters. Second, rater errors and systematic biases areused to describe specific patterns or trends in rating behavior that are believed to contribute to theassignment of scores different from those warranted by a student’s performance. Third, rater accu-racy is defined in practice as a match between operational ratings and those established as “true”or “known” ratings by individual or committees of expert raters (Johnson et al., 2009; Murphy &Cleveland, 1991). Accuracy is determined by comparing an observed rating to an expert-assigned rat-ing; smaller differences between these two scores are associated with higher accuracy. When indicesof rating quality based on these three categories are applied, high levels of agreement, low levels oferror and systematic bias, and high levels of accuracy are assumed to reflect high quality ratings.

1.1. Problems with current rating quality indices

The commonly used indicators of rater agreement, error and systematic bias, and accuracydescribed above focus on identifying specific types of rating patterns among groups of raters thatare assumed to threaten the overall quality of a set of ratings. These traditional methods for monitor-ing rating quality are incomplete as a result of two major issues. First, the set of traditional indicesdoes not provide consistent information about the overall quality of a set of ratings. Reviews and meta-analyses of rating quality indices based on the categories of rater agreement, error and systematic bias,and accuracy reveal that indices across these categories do not provide consistent information abouta set of raters. In their review of rating scale methods in psychological research, Saal et al. (1980)demonstrate a lack of consistency among definitions of rating quality indices; they claim that this lackof consistency increases the already subjective nature of ratings for three main reasons:

First, there is less than unanimous agreement regarding conceptual definitions for several ofthe criteria of rating quality. Second, there is even less agreement regarding the operationaldefinitions (Downey & Saal, [1978]). Third, different researchers have used different researchdesigns or data collection procedures with inherently limited capabilities of aggregating andyielding particular statistical indices of rating quality. It is therefore easy to find two or morestudies in the literature that use the same label for a particular criterion of rating quality (e.g.,halo) even though the conceptual and operational definitions of that particular rating error are

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 281

not identical and the data collection strategies are sufficiently different to preclude calculationof similar statistical indices. (p. 414)

As a result, it is not possible to generalize information about rating quality across these threecategories; for example, evidence of rater agreement does not imply a lack of halo error or rateraccuracy. The second major problem with these traditional indices of rating quality is their lack ofconsideration of domains when evaluating ratings based on analytic rubrics. Although traditionalrater monitoring procedures sometimes include an examination of similarities across domain ratingsin terms of halo error, the set of rating quality indices based on the traditional approach does notinclude a method for considering differences in rater interpretation of a rating scale across differentdomains. Rather, these indicators appear to assume that rater agreement, error and systematic bias,and accuracy have similar implications across distinct components of analytic rubrics.

1.2. Measurement approach to monitoring ratings

In contrast to the disparate set of methods for examining rating quality presented above, it is alsopossible to examine the quality of ratings from the perspective of a measurement theory. Essentially,measurement theories are a combination of a conceptual framework and statistical machinery thatprovides a system for drawing inferences from scores (Messick, 1989). This study describes indicesof rating quality based on the theory of invariant measurement. Essentially, invariant measurementrequires that the particular method by which a variable is measured should not influence the outcomeof the measurement procedure. Wright and Stone (1979) describe invariant measurement in termsof a set of requirements for the measurement of persons and items in the context of an educationalachievement test. In their words:

The calibration of test-item difficulty must be independent of the particular persons used forthe calibration. The measurement of person ability must be independent of the particular testitems used for measuring. When we compare one item with another in order to calibrate a testit should not matter whose responses to these items we use for the comparison. . .. When weexpose persons to a selection of items in order to measure their ability, it should not matterwhich selection of items we use or which items they complete. (p. xii)

When raters are introduced into a measurement process, ratings can be examined for evidence ofrater-invariant measurement. Rater-invariant measurement requires that the measurement of studentachievement not depend on the particular raters that happen to be used for the measuring. Likewise,rater-invariant measurement requires that estimates of rater severity not depend on the particularstudents that they score.

1.2.1. Rasch measurement theoryRasch (1960/1980) developed a set of measurement models that meet the requirements for invari-

ant measurement. Accordingly, models based on Rasch measurement theory (Rasch, 1960/1980) canbe used to examine the hypothesis of rater-invariant measurement in empirical rating data. An exten-sion of the original Rasch model that is particularly useful in the context of rater-mediated assessmentsis the many-facet Rasch model for polytomous ratings (Linacre, 1989; Wright & Masters, 1982). Thismodel extends the original Rasch model to describe the probability of an observed rating as a functionof student achievement, rater severity, and other researcher-specified facets, such as the domains inan analytic rubric.

1.2.2. Many-facet Rasch model for rater invarianceWhen the many-facet Rasch (MFR) model is applied to rating data, statistics and displays based on

the model can be used to identify individual or groups of raters who may be exhibiting rating patternsthat suggest a lack of rater-invariant measurement (Engelhard, 2013; Eckes, 2011; Myford & Wolfe,2003, 2004; Wolfe, 2009). Engelhard (2002, 2013) extends the concept of rater errors and systematicbiases to a set of statistics and displays based on the MFR model that can be summarized in three majorcategories: (A) Rater calibration, (B) Model-data fit, and (C) Interactions. Table 1 displays indices of

282 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

Table 1Model I: Rating quality indices based on the MFR model for rater invariance.

Category Rating quality indicator Questions Statistics and displays

A. Rater calibrations Rater leniency/severity What is the location of eachrater (severity/leniency)?

• Calibration and locationof elements within facetVariable map

Rater precision How precisely has eachrater been calibrated?

• Standard errors for raters

Rater separation How spread-out are theindividual rater severities?

• Reliability of separationstatistic for raters

Can the raters beconsidered exchangeable?

• Chi square statistic forraters

B. Model-data fit Model-data fit forraters

How consistently has eachrater interpreted thedomains and rating scalecategories across students?

• Mean square error fitstatistics (Outfit MSE)

C. Interactions Rater interactions Are ratings invariant overinternalconstruct-irrelevantcomponents (e.g.,prompts)?

• Interaction effectsbetween rater and internalfacets (bias analysis)

Note. This description of Rasch-based rating quality indices is based on Engelhard (2013).

rating quality within these three categories using Rasch-based statistics that are computed within theFacets computer program (Linacre, 2010). These statistics and displays can be viewed as indices ofrater invariance.

1.2.3. Many-facet Rasch model for rater accuracyIn addition to indices of rater invariance, the MFR model can also be used to evaluate rater accuracy

in a direct way when expert ratings are available. Engelhard (2013) describes an application of theMFR model to describe rater accuracy based on the match between operational and expert ratings.First, dichotomous accuracy scores are computed for each observed rating. An accuracy score of zero isassigned when there is a discrepancy between the rating assigned by an operational rater and an expertrater. In contrast, an accuracy score of one reflects a perfect match between an operational and expert

Table 2Model II: Rating quality indices based on the MFR model for rater accuracy.

Category Rating Quality Indicator Questions Statistics and Displays

A. Rater accuracycalibrations

Rater leniency/severityaccuracy

What is the accuracy location ofeach rater?

• Variable map• Calibration and locationof elements within facet

Rater accuracy precision How precisely has each rater beencalibrated in terms of accuracy?

• Standard errors for raters

Rater accuracyseparation

How spread out are the individualraters in terms of accuracy?

• Reliability of separationstatistic for raters

Can the raters be consideredexchangeable in terms of accuracy?

• Chi square statistic forraters

B. Model-data fit Model-data fit for rateraccuracy

How consistently does each raterdemonstrate accuracy across thedomains, rating scale categories,and students?

• Mean square error fitstatistics (Infit and OutfitMSE)

C. Interactions Rater accuracy interactions Is rater accuracy invariant overinternal construct-irrelevantcomponents (e.g., prompts)?

• Interaction effectsbetween rater and internalfacets (bias analysis)

Note. This description of Rasch-based rating quality indices is based on Engelhard (2013).

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 283

rater. Using this dichotomous scoring scheme, rater accuracy can be modeled as a latent variableon which raters, essays, and domains are calibrated. Table 2 summarizes and extends Engelhard’s(2013) indices of rater accuracy based on the MFR model. In this study, indices of rater accuracyare considered within the domains of an analytic rubric. Further, the correspondence between theseindices and indices of rater invariance (i.e., between Tables 1 and 2) is explored within analytic rubricdomains.

1.2.4. Correspondence between invariance and accuracy indicatorsA previous study (Wind & Engelhard, 2012) examined the correspondence between indicators of

rater invariance and direct indices of rater accuracy within the framework of Rasch measurementtheory (Rasch, 1960/1980). This study found a strong negative correlation between indicators ofmodel-data fit based on observed ratings, which were used to describe rater error, and rater cali-brations on an accuracy scale (direct indicator of rater accuracy). The negative correlation betweenthese two rating quality indices suggested that model-data fit improves as rater accuracy increases.Thus, the major conclusion from this study was that indices of model-data fit for raters may providesimilar diagnostic information to direct indicators of rater accuracy. The current study applies simi-lar techniques, but examines the correspondence between these invariance and accuracy indicatorsseparately within domains on an analytic rubric.

2. Purpose

The purpose of this study is to examine the quality of ratings in writing assessment within separaterubric domains (Conventions, Ideas, Organization, and Style) using a coherent set of rating qualityindices based on models that meet the requirements for invariant measurement. Two MFR models areused to explore indices of rating quality: the first model provides evidence for the invariance of raterjudgments, and the second model provides evidence for rater accuracy. This study extends previousresearch that has applied the MFR model to rater-mediated assessments by illustrating the use ofRasch-based rating quality indices within the domains on an analytic rubric, and by examining thecorrespondence between indicators of rating quality based on rater invariance and rater accuracy.Three research questions are used to guide the analyses:

1. Are rater judgments invariant over analytic rubric domains in writing assessment?2. How accurate are rater judgments over analytic rubric domains in writing assessment?3. Do the invariance and accuracy indices provide comparable indices of rating quality over analytic

rubric domains in writing assessment?

3. Methods

3.1. Instrument

Data from the 2011 administration of the Georgia Middle Grades Writing Assessment (MGWA) areexamined in this study. Students were given 90 min to compose a one- or two-page essay in response toa persuasive prompt, and compositions were rated analytically on a five-point scale within four sepa-rate domains: Conventions, Ideas, Organization, and Style. Each point value describes a level of controlof a particular domain; the labels for each score point are as follows: 1 = lack of control, 2 = minimalcontrol, 3 = sufficient control, 4 = consistent control, and 5 = full command. Due to a low frequency of ‘5’ratings in these data (1 = 5.1%, 2 = 45%, 3 = 44%, 4 = 5.6%, 5 = 0.3%), categories four and five were com-bined prior to the analysis. Operational definitions for domains are presented in Appendix A (GeorgiaDepartment of Education, 2010).

3.2. Participants

Ratings for this analysis come from a group of ten operational raters and one expert rater who scored2120 essays that were composed during the 2011 administration of the MGWA. The operational raters

284 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

were randomly selected from the total group of raters hired to score essays for this administration ofthe assessment. Prior to scoring, raters were required to complete a 14-hour training program at theGeorgia Center for Assessment, which involved instruction related to the rubric, prompt, and scoringpractices specific to the 2011 MGWA. In order to successfully complete training, raters were requiredto earn a passing scores on a qualifying test designed to assess their use of the rubric in terms ofthe specific requirements and intentions for prompts used in the 2011 administration. The expertrater works closely with developers of the MGWA, and is also involved in training operational ratersfor scoring this assessment. Each of the ten operational raters scored at least 377 essays in all fourdomains. The operational raters scored at least 48 essays in common with each of the other nine ratersso that raters would be linked in the data set. These links were created randomly. Further, the fact thatthe expert rater scored all 2120 essays links the set of raters. Eckes (2009) describes this rating planas an incomplete but connected design.

3.3. Procedures

Data analyses were conducted with the Facets computer program (Linacre, 2010) using two sepa-rate many-facet Rasch (MFR) models. First, the models were used to examine the quality of the overallratings. Then, the models were applied within the analytic rubric domains. Model I is a rating scaleformulation of the MFR Model (Linacre, 1989; Wright & Masters, 1982), and it was used to explorerating quality indices related to rater invariance. Model II is a MFR Model for dichotomous data, andit was used to explore indices of rater accuracy. Finally, the correspondence between indices of ratingquality under Model I and Model II was examined within individual rubric domains.

3.3.1. Model I: Rater invarianceIn the context of a rater-mediated writing assessment, Rasch models provide a method for modeling

the judgmental process of raters and evaluating the degree to which these judgments provide evidencefor invariant measurement. The first model that was applied in this study was a MFR model throughwhich three major categories of statistics and displays were used to examine rating quality: (A) Ratercalibrations, (B) Model-data fit indices, and (C) Interaction analyses (see Table 1). Model I is based ona three-facet formulation of the rating scale MFR Model (Linacre, 1989; Wright & Masters, 1982):

ln

[Pnjik

Pnjik−1

]= �n − �j − ıi − �k, (1)

where Pnjik = probability of student n rated by rater j on domain i receiving a rating of k;Pnjik−1 = probability of student n rated by rater j on domain i receiving a rating of k − 1; �n − locationof student n on the latent variable; �j = severity of rater j; ıi = difficulty of domain i, and �k = difficultyof moving from category k − 1 to k.

The dependent variable in this model is the log of the odds that a student receives a rating, giventheir location on the latent variable, the severity of the rater, and the difficulty of the domain. A majorbenefit of Rasch models is that estimates for each facet are calculated on the log-odds, or logit scale,that represents the latent variable (in this case, writing achievement). Because estimates for each facetare described on the same scale, it is possible to compare latent-variable locations across facets. Logit-scale locations for students and raters under Model I describe writing achievement and rater severityfor each student and rater, respectively. Similarly, domain calibration on the latent variable describesthe judged difficulty of each of the four analytic rubric domains across the set of raters. The tau termin the model (�k) does not represent a facet; instead, this term represents the category coefficients forthe rating scales used to score the essays. In addition, interaction terms can be added to the model toexamine differences among rater interpretation of domain difficulty. Standard errors for each of thesefacets describe the precision of the measurement on the logit scale.

3.3.2. Rater calibrationsAs shown in Table 1, the first category of rating quality indices based on Model I is rater calibrations.

First, rater leniency and severity can be examined by comparing the estimates of rater locations on

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 285

the logit scale (�) obtained from Model I. Because the logit scale serves as an operational definitionof writing achievement (the latent variable), logit-scale locations for individual raters describe theirinterpretation of the level of writing achievement required for ratings in each category. Raters withhigh locations on the logit scale require “more” writing achievement to assign higher ratings, and thusare considered severe; raters with low locations on the logit scale require “less” writing achievementto assign higher ratings, and thus are considered lenient. The logit-scale location for each rater sum-marizes their overall leniency and severity across facets. Standard errors are computed for each raterthat describe the precision with which each rater is calibrated on the logit scale.

In addition to logit-scale locations and standard errors, separation statistics based on Rasch modelsdescribe the differences between logit-scale locations for elements within a facet, such as individ-ual students, raters, and domains. Conceptually, the reliability of separation statistics for studentsis comparable to Cronbach’s coefficient alpha and KR20, in that it reflects an estimate of true scoreto observed score variance.1 For the other facets, the reliability of separation statistic describes theoverall spread or differences between elements within a facet, such as differences in rater severity.Andrich (1982) provides a detailed derivation of this reliability of separation index. The Facets com-puter program (Linacre, 2010) for MFR analyses also computes a chi-square statistic can be used toinvestigate degree to which differences among the individual elements within a facet are statisticallysignificant.

3.3.3. Model-data fitAs shown in Table 1, rating quality indices based on Model I also include evidence of model-data

fit. Essentially, model-data fit is a measure of the match between the requirements of a model andthe structure of empirical data. For Rasch models, evidence of good model-data fit suggests that therequirements for invariant measurement are met in empirical data. Model-data fit analyses for Raschmodels typically focus on fit statistics that summarize residuals, or differences, between model expec-tations and empirical observations. In the context of a rater-mediated writing assessment, fit statisticscan be used to identify individual raters or students for whom rating patterns do not match the valuespredicted from the MFR model. This study examines model-data fit within the four writing domainsusing a fit statistic that is computed in the Facets program (Linacre, 2007): Outfit Mean Square Error(Outfit MSE). This statistic can be calculated for facets related to students and raters. In light of thisstudy’s focus on rating quality, fit statistics for raters are examined in detail. Outfit MSE is calculated bysumming standardized residual variance for a facet. Because it is unweighted, the Outfit MSE statisticis useful because it is particularly sensitive to “outliers,” or extreme unexpected observations. Anothercommonly used MSE statistic that is computed in Facets (Linacre, 2007) is Infit MSE. Infit MSE statis-tics are weighted by statistical information for a particular facet; as a result they are not as sensitiveto extreme outliers. For this reason, the current study focuses on the Outfit MSE statistic. Althoughlimitations of Rasch fit statistics have been noted in previous research (e.g., Karabatsos, 2000; Smith,Schumacker, and Bush, 2000), useful applications of Outfit MSE statistics have been demonstrated(Engelhard, 2013; Linacre, 1994). Various rules of thumb have been proposed for interpreting thevalues of Rasch fit statistics as they apply to specific types of facets, such as raters and students inresponse to the fact that the exact sampling distribution for MSE statistics is not known (Engelhard,2013; Wright & Masters, 1982). Engelhard (2013) describes an acceptable range of Outfit MSE statis-tics of about 0.80–1.20. Recently, Wolfe (2013) demonstrated a bootstrap procedure for determiningcritical values for Rasch MSE statistics. Despite these advances, the current study uses the critical val-ues of 0.80 and 1.20 that have been explored in depth in the context of rater-mediated assessments,and have been found to signal meaningful patterns in rating data. Specifically, values of MSE statisticsbelow about 0.80 have been found to signal “muted” rating patterns, or possible dependencies amongthe ratings within raters assigned across domains; extreme values in both directions warrant furtherinvestigation.

1 However, these coefficients are slightly different because of the fact that alpha is based on the assumption of linearity, andthe Rasch-based reliability of separation statistic is based on a linear, interval-level scale, when good model-data fit is observed.

286 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

3.3.4. InteractionsFinally, MFR model analyses can be extended to examine “bias,” or interactions between facets;

these interaction analyses constitute the third category of rating quality indices based on Model I.Tests for significant interactions based on the MFR model investigate the null hypothesis that thereis no significant interaction between the logit-scale locations for particular facets. For the purposesof this study, it is of interest to examine interactions between rater severity and the difficulty ofwriting domains. Examination of interactions between these two facets provides information aboutdifferences among individual rater interpretation of the difficulty of a particular domain.

Interaction analyses for the MFR model can be conducted using the Facets computer program(Linacre, 2010) using a bias analysis procedure. For this study, the interaction between rater severityand writing domains was specified for a bias analysis. The program provides information about therater interactions at an overall level, and at the level of individual raters. First, an overall fixed chisquare (�2) test for the significance of a set of interactions is conducted. A significant value of the chisquare statistic suggests that interactions between rater severity and domain difficulty was significantat an overall level. In addition to the overall test for significant interactions, it is possible to exam-ine individual interactions between two facets. These individual terms provide information aboutthe direction and magnitude of each interaction, and patterns of interactions can be used to informthe interpretation of measurement outcomes. For this study, interactions were examined betweenindividual rater severity within particular domains, and the overall difficulty of the domain based onthe group of operational raters. Significant interactions for an individual rater suggest that the raterinterprets the difficulty of a domain either significantly higher or lower than the overall group ofraters.

3.3.5. Model II: Rater accuracyThe second part of the data analysis for this study involved the use of a MFR model for rater accuracy.

In order to conduct accuracy analyses, a comparison of operational ratings with those assigned bythe expert rater was used to create a second data set of dichotomous accuracy ratings. Ratings thatmatched the expert were assigned an accuracy score of one, and ratings that were discrepant from theexpert were assigned an accuracy score of zero, such that higher accuracy ratings reflect more-accuratescoring in terms of a match between operational rater and expert scores. Using these dichotomousaccuracy ratings, the latent variable in the second MFR model is rater accuracy, and the rater is theobject of measurement. Students and domains provide opportunities for raters to assign accurate orinaccurate ratings. In Model II, rater accuracy is modeled as a combination of rater ability to assignaccurate ratings, the difficulty for raters to score particular students accurately, and the difficulty forraters to assign accurate ratings within writing domains. Stated mathematically, Model II is:

ln

[Pnji(x=1)

Pnji(x=0)

]= ˇn − �j − ıi, (2)

where Pnjix=1 = probability of essay n scored by rater j within domain i being rated accurately (x = 1);Pnjix=0 = probability of essay n scored by rater j within domain i being rated inaccurately (x = 0);ˇn = difficulty to provide an accurate rating on the essay composed by student n; �j = accuracy ofrater j; ıi = difficulty to provide an accurate rating of domain i.

Similar to Model I, there are three major categories of rating quality indices for Model II: (A) Rateraccuracy calibrations, (B) Model-data fit, and (C), Interactions. As shown in Table 2, these rating qualityindices parallel the Rasch-based statistics and displays for Model I. Rating quality indices based onrater accuracy calibrations describe differences in the accuracy of individual raters using logit-scalelocations on the accuracy construct, a variable map, and the reliability of separation statistic. Then,fit statistics for rater accuracy are examined to identify raters whose accuracy scores tend to differsignificantly from the values expected by the model. Finally, interactions between rater accuracylocations and the difficulty for accuracy within each analytic rubric domain are examined in orderexplore whether rater accuracy is comparable across writing domains.

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 287

3.3.6. Correspondence between Model I and Model IIIn order to address the third research question for this study, scatterplots and regression analyses

(R2) were used to examine the relationship between Outfit MSE statistics from Model I and rateraccuracy calibrations from Model II within each of the four writing domains. These two indices ofrating quality were selected based on previous research that demonstrates the usefulness of Outfit MSEstatistics for identifying raters who meet the requirements of invariant measurement (e.g., Engelhard,2002, 2013; Myford & Wolfe, 2003, 2004). Rater accuracy measures describe individual raters in termsof their ability to assign accurate ratings across a set of essays and domains with known scores basedon an expert rating (Engelhard, 2013). For this analysis, the correspondence between the standardizedOutfit MSE statistic and rater accuracy measures was examined within the four writing domains.

4. Results

In this section, results from the overall rating quality analyses under Model I (rater invariance), andModel II (rater accuracy) are described separately. Then, the correspondence between the indices ofrater invariance and rater accuracy within the Conventions, Ideas, Organization, and Style domains ispresented. The subsequent section provides conclusions in terms of the research questions that wereused to guide this study.

4.1. Model I: Rater invariance

Table 3 summarizes the findings from the analyses based on Model I. Summary statistics based onRasch models were used to examine the overall fit between the model and the empirical observationsbefore individual raters are examined in detail in terms of rating quality. First, the Measures providethe overall logit-scale locations for the students, raters, and domains. In order to provide a frame ofreference for interpreting the locations of the three facets on the logit scale, the average location forraters and domains was set to zero, and only the average location of the student facet was allowedto vary. Chi square statistics suggested that the overall differences in logit-scale locations betweenstudents (�), raters (�), and domains (ı) were significant (p < 0.05). Further, the value of the reliability ofseparation statistic for students (similar in interpretation coefficient alpha) was quite high (Rel = 0.94).For the Rater and Domain facets, the value of the reliability of separation statistic describes the spread,or differences, between elements within a facet. High values of reliability of separation statistics forraters (Rel = 0.98) and domains (Rel = 0.99) suggested that there were significant differences among theindividual raters in terms of severity and among the difficulty estimates of the four domains examinedin this study. The somewhat-low Outfit MSE statistics in Table 3 suggest that the raters might haveapplied holistic scoring patterns when rating the MGWA.

Table 3Model I: Summary statistics for ratings.

Student Rater Domain

MeasureM −0.01 −0.15 0.00SD 3.43 0.39 0.38N 2120 11 4

Outfit MSEM 0.74 0.89 0.74SD 0.56 0.09 0.13

Separation statisticReliability of separation 0.94 0.98 0.99Chi-square (�2) 35,059.1* 464.0* 577.9*

(df) 2119 10 3

Note. Rater calibrations are anchored on the expert rater, whose location is set to 0.00 logits.* p < 0.05.

288 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

Fig. 1. Model I: variable map for rating data. Note. The logit scale in the first column represents the construct being measuredby Model I (writing achievement). Higher logit-scale values represent higher writing achievement, and lower logit-scale valuesrepresent lower writing achievement. The locations of each individual student, rater, and domain reflect their location on thiscommon scale.

Fig. 1 is a variable map for Model I that displays student writing achievement measures, raterseverity calibrations, domain difficulty locations, and the rating scale categories on a common linearlogistic scale. This visual display corresponds to the information provided in Table 3, and it representsthe overall shared understanding of student writing achievement, domain difficulty, and rating-scalecategories for the MGWA among the group of raters. The first column is the logit scale that representswriting achievement, which is the latent variable examined in this study. The second column displaysstudent locations on the latent variable (N = 2120). Each asterisk represents 15 students, and a periodrepresents one student. As can be seen in the variable map, there was a wide range in student locationson the latent variable; the spread of students suggests that the group of raters detected differencesamong the students in terms of writing achievement. Students who are located higher on the logit scalereceived higher ratings, and students who are located lower on the logit scale received lower ratings.Examination of the Rater column reveals that there was not much variation among the locations of the

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 289

Table 4Model I: Calibration of the rater facet.

Raters Average rating Severity measure (logits) SE Outfit MSE

Severe 3 2.49 0.41 0.06 0.752 2.46 0.21 0.06 0.94

6 2.40 0.10 0.06 0.821 2.51 0.09 0.06 0.73Expert 2.50 0.00 0.03 0.607 2.45 −0.06 0.06 0.8710 2.54 −0.14 0.06 0.924 2.62 −0.30 0.06 0.858 2.56 −0.46 0.06 0.775 2.57 −0.47 0.06 0.65

Lenient 9 2.65 −1.07 0.06 0.79

Note. Raters are ordered by severity measure (logits) from high (severe) to low (lenient).

raters on the logit scale compared to the spread of student locations. Raters who are located higheron the logit scale were more severe; i.e., they assigned lower ratings more often. Raters who arelocated lower on the logit scale are less severe; i.e., they assigned higher ratings more often. Finally,the location of the domains on the logit scale reflects the judged difficulty of the Conventions, Ideas,Style, and Organization domains for the MGWA. Domains that are located higher on the logit scalewere associated with more severe (i.e., lower) ratings, and domains that are located lower on the logitscale were associated with less severe (i.e., higher) ratings.

4.1.1. Rating quality indices based on Model INext, rating quality indices based on Model I are investigated using the three major categories of

Rasch-based statistics and displays presented in Table 1: (A) Rater calibrations, (B) Model-data fit, and(C) Interactions.

Table 4 summarizes the individual rater calibrations based on Model I; these rater calibrationscorrespond to the rater column in the variable map shown in Fig. 1. The rater calibrations were centeredon the expert rater (M = 0.00). Among the operational raters, Rater 3 assigned the lowest ratings onthe three-point scale overall (M = 2.49), and is located highest on the logit scale (Measure = 0.41 logits,SE = 0.06). Rater 3 is located about 0.56 logits above the group mean (−0.15 logits), which suggests thatthis rater assigned ratings that are severe compared to the other raters in the group. In contrast, Rater 9assigned the highest ratings overall (M = 2.65), and is located lowest on the logit scale (Measure = −1.07logits, SE = 0.06). Rater 9 is located about 0.92 logits below the group mean, which suggests that thisrater assigned ratings that are lenient compared to the other raters in the group. As mentioned above,separation statistics for the rater facet suggest that there were significant differences among individualrater severity calibrations (Reliability of separation = 0.98; �2 (10) = 464.0, p < 0.05). Detailed analysesof the ratings assigned by these raters and qualitative interviews may shed light further light ondifferences in rater severity.

Table 4 also provides Outfit MSE statistics for the raters examined in this study. The Expert rater andRaters 3, 1, 8, 5, and 9 display muted ratings. On the other hand, MSE statistics higher than about 1.20have been found to signal “noisy” rating patterns that suggest more variation in ratings than expectedby the model. No Outfit MSE statistics with values higher than 1.20 are observed among these raters.

In order to examine interactions between rater severity and domain difficulty, significance testsfor interaction terms are explored using statistical evidence and visual displays. For Model I, the over-all chi square test for the interaction between rater severity and domain difficulty is significant:�2(44) = 357.1, p < 0.05. Fig. 2 illustrates interactions between rater severity and domain difficultybased on Model I. Raters are listed along the x-axis, and the test statistic for the null hypothesis that

290 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

Fig. 2. Model I: interactions between rater severity and domain difficulty. Note. Values of the t-statistic shown here are testsof the hypothesis that there is no interaction between rater severity and domain difficulty. Values higher than +2.00 suggestthat the rater assigned higher ratings (i.e., was more lenient) than expected on a domain, based on its overall judged difficultymeasure across the raters. Values of the t-statistic that are lower than −2.00 suggest that the rater assigned lower ratings (i.e.,was more severe) than expected on a domain.

there is no interaction effect between rater severity and domain difficulty is plotted along the y-axis.Lines that mark the positive and negative two boundaries are used to highlight significant values of t-statistics. Absolute values of the t-statistic that greater than two suggest that rater severity in a domainis higher or lower than expected based on its overall difficulty measure. Inspection of Fig. 2 revealsthat there are significant interactions between rater severity and domain difficulty within each of thefour domains. Significant interactions between these two facets suggest that rater severity may not beinvariant across the rubric domains on the MGWA. In other words, significant interactions suggest thatraters may be systematically more or less severe when assigning ratings within a particular domainthan would be expected if their ratings matched the expectations of the ideal-type model. Significantinteractions between these two facets were most frequently observed in the Ideas domain, and leastfrequently observed in the Style domain.

4.2. Model II: Rater accuracy

Table 5 summarizes the findings from the rater accuracy analyses based on Model II. This setof summary statistics describes the overall locations for the essays, raters, and domains on the

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 291

Table 5Model II: Summary statistics for accuracy ratings.

Student Rater Domain

MeasureM −0.64 0.00 0.00SD 1.26 0.27 0.04N 2120 10 4

Outfit MSEM 1.06 1.06 1.06SD 0.40 0.10 0.05

Separation statisticsReliability of separation 0.45 0.95 0.00Chi-square (�2) 3215.1* 191.4* 3.6(df) 2119 9 3

* p < 0.05.

logit-scale that represents rater accuracy, and an examination of Table 5 is used to examine theoverall fit of the MFR model to the rating data. Similar to analyses under Model I, the averagelogit-scale locations for the essay and domain facets was set to zero in order to provide a frame ofreference for interpreting rater accuracy in Model II. Based on the chi-square statistic, the overalldifferences among the students (ˇ) and raters (�) in terms of rater accuracy are significant (p < 0.05).Because the rater facet is the object of measurement in Model II, the reliability of separation statisticfor raters is interpreted similarly to Cronbach’s alpha; findings suggest that there are significantdifferences among the group of operational raters in terms of their ability to assign accurate ratings(Rel = 0.95). In contrast, low reliability of separation statistics for the Student (ˇ) facet (Rel = 0.45) andthe Domain (ı) facet (Rel = 0.00) suggest that rater accuracy was not significantly different across theset of 2120 essays or across the four domains on the analytic rubric. Examination of the Outfit MSEstatistics in Table 5 suggests adequate fit to Model II (Outfit MSE ≈ 1.00).

Fig. 3 is a variable map that summarizes findings from the Facets analysis (Linacre, 2010) for rateraccuracy using Model II. In the context of accuracy ratings, the variable map represents rater accuracyin terms of the difficulty for raters to provide accurate ratings on each student essay, individual rateraccuracy measures, and the difficulty for raters to provide accurate ratings within each of the writingdomains. The first column is the logit scale that represents rater accuracy, which is the latent variablefor Model II. The second column displays the location of each student on the accuracy construct; eachasterisk represents 61 students, and a period represents one student. The location of the students onthe logit scale represents the difficulty for operational raters to assign accurate ratings to each essay.As can be seen in the variable map, there is a wide range in location of the 2120 students on thelogit scale, which suggests that there are differences among the set of essays in terms of difficulty forthe operational raters to assign accurate ratings. Students who are located higher on the logit scalereceived accurate ratings more often, and students who are located lower on the logit scale receivedinaccurate ratings more often. Examination of the Rater column reveals that there is some variationamong the locations of the raters on the logit scale. Raters who are located higher on the logit scalewere more accurate; i.e., their ratings matched expert ratings often. Raters who are located loweron the logit scale are less accurate; i.e., they assigned ratings that matched the expert less often.Finally, the location of the domains on the logit scale reflects the difficulty for raters to assign accurateratings within the Ideas, Style, Conventions, and Organization domains for the MGWA. Domains thatare located higher on the logit scale are associated with more-accurate ratings, and domains that arelocated lower on the logit scale are associated with less-accurate ratings.

4.2.1. Rating quality indices based on Model II.Rating quality indices based on Model II were investigated using the three major categories of

Rasch-based statistics and displays presented in Table 2: (A) Rater accuracy calibrations, (B) Model-data fit, and (C) Interactions.

292 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

Fig. 3. Model II: variable Map for Accuracy Ratings. Note. (1) The logit scale in the first column represents the construct beingmeasured by Model I (rater accuracy). Higher logit-scale values represent higher accuracy, and lower logit-scale values representlower accuracy. The locations of each individual student, rater, and domain reflect their location on this common scale.

Table 6 summarizes the rater calibrations under Model II. Among the operational raters, Rater 6 ismost accurate, as defined by the dichotomous accuracy ratings (M = 0.76), and is located highest on thelogit scale (Measure = 0.40 logits, SE = 0.06). Rater 9 is least accurate (M = 0.63), and is located loweston the logit scale (Measure = −0.50 logits, SE = 0.06). As mentioned above, separation statistics for therater accuracy facet suggest that there are significant differences among rater accuracy calibrations onthe logit scale (Reliability of separation = 0.95; �2(9) = 191.4, p < 0.05). Detailed analyses of the ratingsassigned by these raters and qualitative interviews may shed light further light on differences in rateraccuracy among the group of operational raters.

Table 6 also provides Outfit MSE statistics for the rater accuracy calibrations under Model II. As canbe seen in the table, fit statistics for the ten operational raters fall within the range of ∼0.80 ≤ OutfitMSE ≤ ∼1.20, which suggests good fit between the dichotomous accuracy scores and the MFR model.

Analyses under Model II also include an examination of interactions between rater accuracy and thefour writing domains. Because of the unique nature of the accuracy scale, a slightly different technique

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 293

Table 6Model II: Calibration of the rater accuracy facet.

Rater ID Average rating Accuracy measure (logits) SE Outfit MSE

Accurate 6 0.76 0.40 0.06 0.913 0.75 0.34 0.06 0.95

4 0.73 0.27 0.06 0.995 0.73 0.16 0.06 0.957 0.70 −0.06 0.06 1.052 0.70 −0.08 0.06 1.058 0.70 −0.11 0.06 1.061 0.69 −0.15 0.06 1.04

10 0.65 −0.26 0.06 1.14Inaccurate 9 0.63 −0.50 0.06 1.15

Note. Raters are ordered by accuracy measure (logits), from high (accurate) to low (inaccurate).

is used to display interaction analyses for the accuracy data. Specifically, average accuracy ratings onthe dichotomous accuracy scale were examined for each rater within the four domains. If a rater wereconsistently accurate when scoring students in terms of Conventions, Ideas, Style, and Organization,then their average accuracy rating would be consistent across these analytic rubric domains. In con-trast, raters whose accuracy varied across the analytic rubric domains would have different averageaccuracy ratings for Conventions, Ideas, Style, and Organization. Fig. 4 provides illustrative plots thatdemonstrate differences in rater accuracy across domains. Each rater profile indicates an individualrater’s average accuracy score within each domain; the average accuracy score is plotted along they-axis, and the four domains are shown along the x-axis. Three rater profiles are presented that demon-strate different patterns in a particular rater’s ability to assign accurate ratings across the four writingdomains. Rater 1 (Panel A) assigns accurate ratings in the Ideas and Style domains more often thanin the Organization and Conventions domains. In contrast, Rater 9 (Panel C) assigns accurate ratingsin the Ideas and Organization domains more often than in the Style and Conventions domains. Incontrast, Rater 6 (Panel B) is generally consistent in terms of accuracy across the four domains. Theprimary utility of these differential rater accuracy profiles is as a diagnostic tool through which trendsin accuracy can be examined for individual raters, and compared across groups of raters in terms ofanalytic rubric domains.

4.3. Correspondence between Model I and Model II

The final step in the analyses for this study involved an examination of the correspondence betweenrating quality indices based on rater invariance (Model I) and direct measures of rater accuracy (ModelII). In order to examine the relationship between indices of rating quality based on Model I and Model II,the alignment between model-data fit statistics from Model I and rater accuracy measures from ModelII were explored using scatterplots and regression analyses (R2) within the four writing domains. Inthe context of a rater-mediated writing assessment, low values of Outfit MSE statistics (y-axis) suggest“muted” ratings that have been shown to signal dependences among ratings, or halo error. High valuesof Outfit MSE statistics have been shown to signal “noisy” ratings, or haphazard use of the rating scale.Rater accuracy measures (x-axis) describe individual raters in terms of their ability to assign ratingsthat match known scores from an expert rater. Findings are summarized statistically and graphicallyin Fig. 5.

In general, examination of the correspondence between Model I and Model II within the fourwriting domains suggests that there is a significant association between Outfit MSE and rater accuracymeasures (p < 0.05) within all four domains. This finding suggests that indices based on Outfit MSEstatistics and direct measures of accuracy may provide comparable information about rating quality.Although visual displays of the correspondence between these indices reveal that the nature of therelationship between model-data fit and accuracy varies slightly across the four domains, there is

294 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

Fig. 4. Model II: differential rater accuracy profiles: three examples. Note. For each rater, the average observed accuracy basedon the dichotomous accuracy scale (0 = inaccurate rating; 1 = accurate rating) is plotted within the four domains.

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 295

Fig. 5. Correspondence between rater accuracy measures (Model II) and outfit MSE statistics (Model I).

generally a somewhat negative relationship between these two indices of rating quality. In general,low logit-scale measures of rater accuracy on the x-axis correspond to higher values of Outfit MSEstatistics on the y-axis, which signal noisy ratings. High logit-scale measures of rater accuracy onthe x-axis correspond to lower Outfit MSE statistics. These findings are likely related to the fact thatfit statistics from Model I for the expert rater—whose ratings were used to define accuracy for theoperational raters—are somewhat muted (low). As a result, raters who match the expert rater arelikely to also display muted ratings.

5. Summary of results

In this section, findings are discussed as they relate to the three research questions that were usedto guide this study.

5.1. Are rater judgments invariant over analytic rubric domains in writing assessment?

The first research question focuses on rater invariance across four analytic rubric domains for alarge-scale rater-mediated writing assessment. In order to address this research question, a three-facetrating scale formulation of the MFR model was used to examine the locations of individual students(�), raters (�), and domains (ı) on a logit scale that represents writing achievement. Indices of model-data fit based on this model were used as evidence that rater-assigned scores met the requirementsfor rater-invariant measurement of student writing achievement. Further, interactions between rater

296 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

severity and domain difficulty were used to explore the degree to which rater judgments of writingachievement were invariant across the Conventions, Ideas, Organization, and Style domains of theMGWA.

Findings from analyses with Model I revealed that rater severity is not invariant over the analyticrubric domains on the MGWA. Specifically, results from interaction analyses indicated significantinteractions within each of the writing domains. Furthermore, the direction and magnitude of theseinteractions varied across the four domains. There are several plausible explanations for the observedlack of rater invariance across rater domains that was in this study. One possible explanation is that dif-ferences related to the overall quality of an essay may impact rater severity within individual domains.For example, simple mechanical errors may seem less excusable in an essay that demonstrates fullcontrol of organization and style. On the other hand, a lack of mechanical errors may overshadowinfelicities in stylistic components of an essay.

It is also likely that the lack of rater invariance across domains is a result of the inability for themethods used in practice to monitor ratings during rater training and operational scoring proceduresto detect this aspect of rating quality.

The finding of rater-variant interpretation of analytic rubric domains demonstrates the usefulnessof the MFR model as a methodological tool that has potential benefits for practical settings as wellas in mixed-methods research (Myford, 2012). Specifically, the MFR model can be used to examinerater effects related to analytic rubric domains during rater training procedures or operational rating toidentify raters whose interpretation of an analytic rubric is inconsistent with the domain specificationsset forth by assessment developers. Further, qualitative interviews with raters who are “flagged” bythis procedure may shed additional light on quantitative indices of rating quality that can be used totarget further training in order to ensure invariant score interpretation across raters and domains inlarge-scale rater-mediated writing assessments.

5.2. How accurate are rater judgments over analytic rubric domains in writing assessment?

The second research question focuses on rater accuracy across the four analytic rubric domains onthe MGWA. In order to address this question, a MFR model was specified that modeled rater accuracy asa combination of rater accuracy ability (ˇ), the difficulty for raters to assign accurate ratings on studentessays (�), and the difficulty for raters to assign accurate ratings in rubric domains (ı). Findings fromanalyses under Model II revealed that there were significant differences in accuracy among the oper-ational raters, and that these individual student essays are significantly different in terms of the diffi-culty for raters to assign accurate ratings. Despite the non-significant differences in the overall logit-scale locations for the four domains (the main effect for domains), findings from the interaction analy-ses suggest that the difficulty ordering for accuracy within each domain varies across the group of oper-ational raters. In other words, interactions between the rater accuracy and domain facets from Model IIsuggest that rater accuracy was not invariant across these domains. Similar to the findings from ModelI, the interaction analyses illustrate a method for identifying individual raters whose rating patternsmay warrant further attention during rater training or operational rating. Furthermore, findings ofinteractions between rater accuracy and analytic rubric domains highlight differences in the qualityof ratings within domains that may not be detected in rating quality analyses based on holistic ratings.

5.3. Do the invariance and accuracy indices provide comparable indices of rating quality over analyticrubric domains in writing assessment?

The third research question focuses on the alignment between indices of rater invariance andrater accuracy within the four domains on the MGWA. In order to address this research question,scatterplots and regression analyses were examined between Outfit MSE statistics from Model Iand rater accuracy measures on the logit scale from Model II within the Conventions, Ideas, Style,and Organization domains on the MGWA. Findings revealed that there was a significant correlationbetween these two indices of rating quality within all four domains. Further, fit statistics that suggesthaphazard rating patterns (mid-to-high standardized Outfit) were associated with low accuracymeasures, and fit statistics that suggest expected and muted rating patterns (low standardized Outfit)

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 297

were associated with high accuracy measures. This finding suggests that model-data fit statisticsbased on operational ratings may provide information about rater accuracy for individual raters.Based on these findings, it is possible to conclude that Outfit statistics may provide information thatcan be used to identify raters during training and operational scoring who require additional trainingin order to ensure high-quality ratings.

6. Discussion and conclusions

Large-scale writing assessments such as the Middle Grades Writing Assessment are widely used toinform educational decisions for students, including promotion, placement, and admissions. Althoughthese performance-based assessments are assumed to provide evidence about a construct in a morevalid way than multiple-choice assessments (Yancey, 1999), evidence of rating quality is necessary tosupport the interpretation and use of rater-assigned scores. As Lumley (2002) observed:

The rater, not the scale, lies at the centre of the process. It is the rater who decides:

1. which features of the scale to pay attention to;2. how to arbitrate between the inevitable conflicts in the scale wordings; and3. how to justify her impression of the text in terms of the institutional requirements represented by

the scale and rater training. (p. 267)

Although concern with the quality of ratings has been a recurring theme in writing assessmentresearch for the last century, traditional methods for evaluating ratings are limited by a lack of consis-tency across statistical indicators of rater agreement, error and systematic bias, and accuracy, alongwith a lack of concern for identifying differences in rating quality across analytic rubric domains. Inthis study, a coherent set of rating quality indicators based on the theory of invariant measurementwas applied to rating data, and the alignment between Rasch model-based indices of rater invari-ance and rater accuracy was explored within analytic rubric domains. Although previous researchhas demonstrated the use of the MFR model as a methodological tool for examining rater invari-ance (Engelhard, 2013; Eckes, 2011; Myford & Wolfe, 2003, 2004; Wolfe, 2009) and rater accuracy(Engelhard, 2013), the use of these models within analytic rubric domains and the correspondencebetween rating quality indices based on the two models has not been fully developed. Differences inrating quality across rubric domains hold significant implications for the interpretation and use of rat-ings from rater-mediated writing assessments that employ analytic rubrics. Findings from this studysuggest that indices of rating quality based on the MFR model provide a useful method for identifyingindividual raters for whom score meaning may not be consistent across rubric domains. The Rasch-based indices of rater invariance and rater accuracy for domains that were demonstrated in this studycan be used in practice to inform rater-training procedures and to monitor individual raters duringoperational scoring. For example, the indicators of rater invariance (Model I) that were illustrated inthis study can be used during rater training to identify raters for remediation or as qualification crite-ria. Similarly, these indicators can be included in ongoing monitoring procedures during operationalscoring in order to identify raters who may need retraining, or individual student essays or domainson which ratings should be reexamined.

Further, the correspondence between rating quality indices based on invariance (Model I) andaccuracy (Model II) holds important implications for methods used to examine rating quality. Inparticular, the finding that indices of model-data fit for raters correspond to rater accuracy measurescontrasts with previous research on performance assessment that has not found an alignmentbetween indices of model-data fit and direct measures of rater accuracy (Landy & Farr, 1980; Murphy& Cleveland, 1991; Sulsky & Balzer, 1988). Because practical limitations in large-scale performanceassessment often limit the use of expert raters, the correspondence between these two indiceshas important implications for monitoring raters during rater training and operational scoring.Specifically, ongoing model-data fit analyses for ratings using the MFR model can be used to identifyraters whose scores may warrant further attention to ensure that they are providing high qualityinformation about student achievement in terms of a construct.

Additional research should further examine the correspondence between indices of rater invari-ance and rater accuracy using Rasch-based indices that were not explored in this study, including

298 S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299

additional indices of model-data fit and summaries of residuals for both models. Further researchshould also consider the alignment between rater invariance and rater accuracy using conceptualiza-tions of rater accuracy besides the dichotomous scale employed in this study, along with accuracymeasures computed from groups of expert raters rather than a single expert. In addition, researchshould explore the utility of these quantitative indices of rater invariance and rater accuracy as toolsfor quality control during rater training and operational scoring procedures.

A possible limitation to this study is the finding of muted rating patterns (defined in this context aslimited category usage within a rater) for the expert rater and among the operational raters, indicatedby low values of Outfit MSE statistics. However, these muted rating patterns are not surprising due tothe fact that this is a study of raters grounded in an ongoing statewide assessment of writing. Becausestudents must earn a score of at least ‘3’ in order to “pass” the MGWA, it is likely that rater trainingfocuses on distinguishing between achievement levels at the cut point between scores of ‘2’ and ‘3’.Further, the muted ratings among the operational raters may be due to the role of the expert rater intraining the group of operational raters who were examined in this study. In spite of the statisticalconcerns generated by the muted and uniform rating patterns, we believe that these analyses reflectan authentic examination of the reliability, validity, and fairness of the scores obtained from thisoperational assessment system. This finding also highlights the value of this study as an example ofan application of Rasch-based rating quality indicators with data from an operational, high-stakesrater-mediated assessment. Thus, the muted rating patterns and potential halo effects are not seen asa limitation. Rather, these findings highlight the usefulness of the rating quality indicators illustratedin this study as methods for describing the quality of ratings within analytic rubric domains thatcan inform the interpretation and use of scores in operational settings. Future studies should considerthe relationship between rating quality indicators on assessments with different implications. Further,analyses using these rating quality indicators with simulated data may bring to light additional aspectsof the relationship between indicators of rater invariance and rater accuracy.

As the number of constructed-response items increases in large-scale assessments worldwide,critical evaluation of the quality of ratings is becoming increasingly important. Rater-mediated assess-ments offer many advantages over assessments based on multiple-choice items, but these advantagesmay be illusionary if the quality of the ratings yields low levels of reliability and validity for the pro-posed uses of the rater-mediated assessments. The quality control indices proposed in this studyoffer promising areas for further research to provide warrants for reliable, valid, and fair use ofrater-mediated scores.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.asw.2013.09.002.

References

Andrich, D. A. (1982). An index of person separation in latent trait theory, the traditional KR.20 indices and the Guttman scaleresponse pattern. Education Research and Perspectives, 9, 95–104.

Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study of their veridicality and reactivity.Language Testing, 28(1), 51–75.

Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assessments. Applied Psychological, Measure-ment, 24(4), 310–324.

Cooper, A., Levin, B., & Campbell, C. (2009). The growing (but still limited) importance of evidence in education policy andpractice. Journal of Educational Change, 10, 159–171.

DeCarlo, L. T. (2005). A model of rater behavior in essay grading based on signal detection theory. Journal of EducationalMeasurement, 42(1), 53–76.

Eckes, T. (2009). Many-facet Rasch measurement. In S. Takala (Ed.), Reference supplement to the manual for relating languageexaminations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H).Strasbourg, France: Council of Europe/Language Policy Division.

Eckes, T. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurtam Main: Peter Lang.

Edgeworth, F. Y. (1890). The element of chance in competitive examinations. Journal of the Royal Statistical Society, 53(3),460–475.

Educational Testing Service. (2010). TOEFL iBT test scores. Retrieved from: http://ets.org/toefl/ibt/scores/

S.A. Wind, G. Engelhard Jr. / Assessing Writing 18 (2013) 278–299 299

Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work?Language Assessment Quarterly, 2(3), 175–196.

Elliot, N. (2005). On a scale: A social history of writing assessment in America. New York: Peter Lang.Engelhard, G., Jr. (2002). Monitoring raters in performance assessments. In G. Tindal, & T. Haladyna (Eds.), Large-scale Assessment

Programs for All Students: Development, Implementation, and Analysis (pp. 261–287). Mahwah, NJ: Erlbaum.Engelhard, G., Jr. (2013). Invariant measurement: Using Rasch Models in the Social, Behavioral and Health Sciences. New York:

Routledge.Georgia Department of Education. (2010). Georgia grade 8 writing assessment interpretive guide. Atlanta, GA: Author.Guilford, J. P. (1936). Psychometric methods. New York: McGraw Hill.Hamp-Lyons, L. (2011). Writing assessment: Shifting issues, new tools, enduring questions. Assessing Writing, 16(1), 3–5.Huff, K., Steinberg, L., & Matts, T. (2010). The promises and challenges of implementing evidence-centered design in large-scale

assessment. Applied Measurement in Education, 23(4), 310–324.Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New

York: The Guilford Press.Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1(2), 152–176.Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior: A longitudinal study. Language

Testing, 28(2), 179–200.Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87(1), 72–107.Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.Linacre, J. M. (2007). Facets: Rasch Measurement Computer Program. Chicago: Winsteps.com.Linacre, J. M. (2010). Facets Rasch measurement computer program, version 3.67, 1. Chicago: Winsteps.com.Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing,

19(3), 246–276.Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1),

54–71.Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement. Washington, DC: American Council on Education.Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (2002). Making sense of data from complex assessments.

Applied Measurement in Education, 15(4), 363–389.Mullis, I. V. S., Martin, M. O., Foy, P., & Aora, A. (2012). TIMSS 2011 international results in mathematics. Chestnut Hill, MA,

USA: TIMSS & PIRLS International Study Center, Lynch School of Education. International Association for the Evaluation ofEducational Achievement (IEA): Amsterdam, The Netherlands.

Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012). TIMSS 2011 international results in reading. Chestnut Hill, MA,USA: TIMSS & PIRLS International Study Center, Lynch School of Education. International Association for the Evaluation ofEducational Achievement (IEA): Amsterdam, The Netherlands.

Murphy, K. R., & Cleveland, J. N. (1991). Performance appraisal: An organizational perspective. Boston: Allyn & Bacon.Myford, C. M. (2012). Rater cognition research: Some possible directions for the future. Educational Measurement: Issues and

Practice, 31(3), 48–49.Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal

of Applied Measurement, 4(4), 386–422.Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal

of Applied Measurement, 5(2), 371–398.Pula, J. J., & Huot, B. A. (1993). A model of background influences on holistic raters. In M. M. Williamson, & B. A. Huot (Eds.),

Validating holistic scoring for writing assessment: Theoretical and empirical foundations (3rd edition, pp. 237–265). Presskill,NJ: Hampton Press.

Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests Copenhagen: Danish Institute for EducationalResearch (expanded ed.). Chicago: University of Chicago Press.

Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. PsychologicalBulletin, 88(2), 413–428.

Smith, R. M., Schumacker, R. E., & Bush, J. J. (2000). Examining replication effects in Rasch fit statistics. In M. Wilson, & Author(Eds.), Objective measurement: Theory into practice (Vol. 5) (pp. 303–317). Stamford, CT: Ablex Publishing Corp.

Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rating accuracy: Some methodological andtheoretical concerns. Journal of Applied Psychology, 73(3), 497–506.

U.S. Department of Education. (2010). Race to the top assessment program executive summary. Washington, DC: Author.Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.Wind, S. A., & Engelhard, G., Jr. (2012). Examining rating quality in writing assessment: Rater agreement, error, and accuracy.

Journal of Applied Measurement, 13(4), 321–335.Wolfe, E. W. (2009). Item and rater analysis of constructed response items via the multi-faceted Rasch model. Journal of Applied

Measurement, 10(3), 335–347.Wolfe, E. W. (2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement,

14(1), 1–9.Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational

Measurement: Issues and Practice, 31(3), 31–37.Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press.Wright, B. D., & Stone, M. (1979). Best test design: Rasch measurement. Chicago, IL: MESA Press.

Stefanie A. Wind is a doctoral candidate in educational measurement in the Division of Educational Studies at Emory Universityin Atlanta, GA, USA.

George Engelhard, Jr. is a professor of educational measurement and policy at The University of Georgia in Athens, GA, USA.