Ill-structured measurement designs in organizational research: Implications for estimating...

23
Ill-Structured Measurement Designs in Organizational Research: Implications for Estimating Interrater Reliability Dan J. Putka Human Resources Research Organization Huy Le University of Central Florida Rodney A. McCloy and Tirso Diaz Human Resources Research Organization Organizational research and practice involving ratings are rife with what the authors term ill-structured measurement designs (ISMDs)— designs in which raters and ratees are neither fully crossed nor nested. This article explores the implications of ISMDs for estimating interrater reliability. The authors first provide a mock example that illustrates potential problems that ISMDs create for common reliability estimators (e.g., Pearson correlations, intraclass correlations). Next, the authors propose an alternative reliability estimator—G(q,k)—that resolves problems with traditional estimators and is equally appro- priate for crossed, nested, and ill-structured designs. By using Monte Carlo simulation, the authors evaluate the accuracy of traditional reliability estimators compared with that of G(q,k) for ratings arising from ISMDs. Regardless of condition, G(q,k) yielded estimates as precise or more precise than those of traditional estimators. The advantage of G(q,k) over the traditional estimators became more pronounced with increases in the (a) overlap between the sets of raters that rated each ratee and (b) ratio of rater main effect variance to true score variance. Discussion focuses on implications of this work for organizational research and practice. Keywords: measurement design, ratings, reliability Supplemental materials:http://dx.doi.org/10.1037/0021-9010.93.5.959.supp The use of ratings is ubiquitous in the social and organizational sciences (e.g., Hoyt & Kerns, 1999; Saal, Downey, & Lahey, 1980). In organizational research and practice, analyses of ratings commonly arise in several domains, including studies of job per- formance and multisource feedback (e.g., Greguras & Robie, 1998; Rothstein, 1990; Scullen, Mount, & Goff, 2000; Viswesva- ran, Schmidt, & Ones, 2005), employment interviews (e.g., Con- way, Jako, & Goodman, 1995; Latham, Saari, Pursell, & Campion, 1980; Pulakos & Schmitt, 1995), and assessment centers (e.g., Harris, Becker, & Smith, 1993; Lance, Foster, Gentry, & Thore- son, 2004; Sackett & Dreher, 1982). Analyses of ratings also commonly arise in the context of concurrent and predictive vali- dation studies (e.g., Campbell, 1990; Nathan & Tippins, 1990). Of primary concern in this article are analyses conducted for purposes of estimating interrater reliability. Reliability has traditionally been defined as the proportion of observed score variance that reflects true score variance (Guilford, 1954) or the squared correlation between true score and observed score (Lord & Novick, 1968). Reliability is a critical psychometric property on which the quality of ratings (and scores in general) is judged (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999; Society for Industrial and Organizational Psychology, 2003). Besides being used to judge the psychometric quality of measures (e.g., tests, ratings), reliability estimates also play central roles in the correction of observed validity coefficients for attenuation due to measurement error (Schmidt & Hunter, 1996) and the calculation of standard errors of measurement, which are critical for evaluating the precision of individual scores (Feldt & Brennan, 1989). Thus, obtaining accurate estimates for the reliability of ratings is a fundamental concern of many domains of organizational research and practice. The impetus for the current article stems from the juxtaposition of three observations regarding (a) ratings and (b) psychometric theory and practice. First, there is an implicit link between the structure of measurement designs that give rise to scores and the structure of coefficients that estimate the reliability of those scores (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; McGraw & Wong, 1996). Second, measurement designs underlying ratings gathered in organizational settings often deviate substantially from the fully crossed and nested measurement designs discussed in psychometrics texts (e.g., Crocker & Algina, 1986; Cronbach et al., 1972; McDonald, 1999; Nunnally & Bernstein, 1994) and commonly cited articles regarding reliability estimation (e.g., Feldt Dan J. Putka and Tirso Diaz, Human Resources Research Organization, Alexandria, Virginia; Huy Le, Department of Psychology, University of Central Florida; Rodney A. McCloy, Human Resources Research Organi- zation, Louisville, Kentucky. Support for this study was provided by the Human Resources Research Organization’s Internal Research and Development Program. Correspondence concerning this article should be addressed to Dan J. Putka, Human Resources Research Organization, 66 Canal Center Plaza, Suite 400, Alexandria, VA 22314-1591. E-mail: [email protected] Journal of Applied Psychology Copyright 2008 by the American Psychological Association 2008, Vol. 93, No. 5, 959 –981 0021-9010/08/$12.00 DOI: 10.1037/0021-9010.93.5.959 959

Transcript of Ill-structured measurement designs in organizational research: Implications for estimating...

Ill-Structured Measurement Designs in Organizational Research:Implications for Estimating Interrater Reliability

Dan J. PutkaHuman Resources Research Organization

Huy LeUniversity of Central Florida

Rodney A. McCloy and Tirso DiazHuman Resources Research Organization

Organizational research and practice involving ratings are rife with what the authors term ill-structuredmeasurement designs (ISMDs)—designs in which raters and ratees are neither fully crossed nor nested.This article explores the implications of ISMDs for estimating interrater reliability. The authors firstprovide a mock example that illustrates potential problems that ISMDs create for common reliabilityestimators (e.g., Pearson correlations, intraclass correlations). Next, the authors propose an alternativereliability estimator—G(q,k)—that resolves problems with traditional estimators and is equally appro-priate for crossed, nested, and ill-structured designs. By using Monte Carlo simulation, the authorsevaluate the accuracy of traditional reliability estimators compared with that of G(q,k) for ratings arisingfrom ISMDs. Regardless of condition, G(q,k) yielded estimates as precise or more precise than those oftraditional estimators. The advantage of G(q,k) over the traditional estimators became more pronouncedwith increases in the (a) overlap between the sets of raters that rated each ratee and (b) ratio of rater maineffect variance to true score variance. Discussion focuses on implications of this work for organizationalresearch and practice.

Keywords: measurement design, ratings, reliability

Supplemental materials:http://dx.doi.org/10.1037/0021-9010.93.5.959.supp

The use of ratings is ubiquitous in the social and organizationalsciences (e.g., Hoyt & Kerns, 1999; Saal, Downey, & Lahey,1980). In organizational research and practice, analyses of ratingscommonly arise in several domains, including studies of job per-formance and multisource feedback (e.g., Greguras & Robie,1998; Rothstein, 1990; Scullen, Mount, & Goff, 2000; Viswesva-ran, Schmidt, & Ones, 2005), employment interviews (e.g., Con-way, Jako, & Goodman, 1995; Latham, Saari, Pursell, & Campion,1980; Pulakos & Schmitt, 1995), and assessment centers (e.g.,Harris, Becker, & Smith, 1993; Lance, Foster, Gentry, & Thore-son, 2004; Sackett & Dreher, 1982). Analyses of ratings alsocommonly arise in the context of concurrent and predictive vali-dation studies (e.g., Campbell, 1990; Nathan & Tippins, 1990). Ofprimary concern in this article are analyses conducted for purposesof estimating interrater reliability.

Reliability has traditionally been defined as the proportion ofobserved score variance that reflects true score variance (Guilford,

1954) or the squared correlation between true score and observedscore (Lord & Novick, 1968). Reliability is a critical psychometricproperty on which the quality of ratings (and scores in general) isjudged (American Educational Research Association, AmericanPsychological Association, & National Council on Measurementin Education, 1999; Society for Industrial and OrganizationalPsychology, 2003). Besides being used to judge the psychometricquality of measures (e.g., tests, ratings), reliability estimates alsoplay central roles in the correction of observed validity coefficientsfor attenuation due to measurement error (Schmidt & Hunter,1996) and the calculation of standard errors of measurement,which are critical for evaluating the precision of individual scores(Feldt & Brennan, 1989). Thus, obtaining accurate estimates forthe reliability of ratings is a fundamental concern of many domainsof organizational research and practice.

The impetus for the current article stems from the juxtapositionof three observations regarding (a) ratings and (b) psychometrictheory and practice. First, there is an implicit link between thestructure of measurement designs that give rise to scores and thestructure of coefficients that estimate the reliability of those scores(Cronbach, Gleser, Nanda, & Rajaratnam, 1972; McGraw &Wong, 1996). Second, measurement designs underlying ratingsgathered in organizational settings often deviate substantially fromthe fully crossed and nested measurement designs discussed inpsychometrics texts (e.g., Crocker & Algina, 1986; Cronbach etal., 1972; McDonald, 1999; Nunnally & Bernstein, 1994) andcommonly cited articles regarding reliability estimation (e.g., Feldt

Dan J. Putka and Tirso Diaz, Human Resources Research Organization,Alexandria, Virginia; Huy Le, Department of Psychology, University ofCentral Florida; Rodney A. McCloy, Human Resources Research Organi-zation, Louisville, Kentucky.

Support for this study was provided by the Human Resources ResearchOrganization’s Internal Research and Development Program.

Correspondence concerning this article should be addressed to Dan J.Putka, Human Resources Research Organization, 66 Canal Center Plaza,Suite 400, Alexandria, VA 22314-1591. E-mail: [email protected]

Journal of Applied Psychology Copyright 2008 by the American Psychological Association2008, Vol. 93, No. 5, 959–981 0021-9010/08/$12.00 DOI: 10.1037/0021-9010.93.5.959

959

& Brennan, 1989; McGraw & Wong, 1996; Shrout & Fleiss,1979). Third, methods for estimating the reliability of scoresarising from fully crossed or nested designs are commonly used byorganizational researchers to estimate the reliability of ratings evenwhen the measurement designs that give rise to those ratings areneither fully crossed nor nested (Conway et al., 1995; Viswesva-ran, Ones, & Schmidt, 1996).

In this article, we examine the implications of applying tradi-tional approaches to interrater reliability estimation to ratings thatarise from ill-structured measurement designs (ISMDs)—designsin which ratees and raters are neither fully crossed nor nested. Webegin with a brief primer on measurement designs to clarifyterminology and discuss factors that give rise to ISMDs in orga-nizational research and practice. Next, by using a mock example,we examine problems that arise when traditional approaches toreliability estimation are applied to ratings arising from ISMDs.Illustration of these problems frames discussion of an alternativeapproach to estimating the reliability of ratings that is appropriatefor ISMDs and grounded in the generalizability theory (G-theory)literature. Finally, we present results from a Monte Carlo simula-tion designed to evaluate the performance of traditional ap-proaches for estimating interrater reliability, as well as the alter-native approach we offer.

A Primer on Measurement Designs Underlying Ratings inOrganizational Research

A measurement design refers to the orientation of objects ortargets of measurement (e.g., ratees) to the facets (or conditions) ofmeasurement (e.g., raters, items, tasks, occasions) that underlie anassessment (Shavelson & Webb, 1991). To simplify discussion inthis article, we will focus only on the orientation of targets ofmeasurement to one potential facet of measurement, raters. If eachrater (r) in a study rates all of the targets of measurement (t),targets are fully crossed with raters. If each rater rates a unique,non-overlapping set of targets, targets are nested within raters.Alternatively, if each target is rated by a unique, non-overlappingset of raters, raters are nested within targets. In the terminology ofG-theory, these designs can be labeled t � r, t:r, and r:t, respec-tively (Brennan, 2001).

Several researchers have recognized that when ratings gatheredin organizational settings allow for local estimation of interraterreliability, the orientation of ratees to raters is often neither fullycrossed nor nested (McCloy & Putka, 2004; Schmidt, Viswesva-ran, & Ones, 2000; Viswesvaran et al., 2005).1 For example, onillustrating an example of an ISMD in the context of job perfor-mance ratings, Viswesvaran et al. (2005) stated, “Almost all theinterrater reliability estimates we have found in the literature are[based on measurement designs] of this nature” (p. 111).

Despite such anecdotes regarding the prevalence of ISMDs,organizational researchers have found it difficult to quantify theuse of ISMDs in research and practice. As Scullen et al. (2000)noted,

We examined a representative sample of studies included in theViswesvaran et al. (1996) meta-analysis to determine which type ofdesign had been used. In many cases, the rating procedures were notdescribed precisely enough by the authors of the primary studies. (p.968)

The state of the literature begs the question of whether ISMDs areindeed pervasive—not just in the domain of performance ratingsbut for other types of ratings as well. Although difficult to quan-tify, the pervasiveness of ISMDs can be logically affirmed byexamining the logistical and practical constraints under whichratings are typically gathered in organizations. As we note below,the specific types of constraints that give rise to ISMDs tend tocovary with the type of ratings being collected.

When gathering job performance ratings, researchers often seekraters who have had the greatest opportunity to observe a givenratee perform. When researchers are fortunate enough to identifytwo or more raters for each ratee, such raters are typically not thesame for each ratee (indicative of a fully crossed t � r design), norare they completely different for each ratee (indicative of a nestedr:t design). Characteristics of the specific organization in whichthe data are being collected (e.g., participating supervisors’ spansof control, size of workgroups, level of autonomy with whichratees perform their work) affect the availability of potential ratersfor each ratee and therefore the nature of the measurement designthat gives rise to performance ratings.

In the case of assessment center ratings and interview ratingsinvolving at least two raters (i.e., assessors, interviewers) per ratee,logistical considerations often dictate the measurement design. Forexample, in the interest of efficiently processing a large number ofgeographically dispersed ratees, an employer or consulting firmmight use teams of assessors/interviewers. These teams might notbe unique to each ratee and might shift in their composition overthe course of the assessment center or interview program due torater availability, conflicts of interest (e.g., prior familiarity withthe ratee), or the desire to prevent any team of raters from devel-oping a unique shared policy (e.g., Campion, Campion, & Hudson,1994; Harris et al., 1993; Pulakos & Schmitt, 1995; Tsacoumis,2007; Tziner & Dolan, 1982). As such, the assessors/interviewersproviding ratings may vary in their degree of overlap across ratees,resulting in a measurement design that is neither fully crossed nornested.

Despite the clear existence of ISMDs in organizational researchand practice, the literature has been plagued by a lack of clarity indescribing such designs. For example, Viswesvaran et al. (2005)labeled the ISMD they described as “nested,” even though it isclear from the example they provided that the sets of raters whorated each ratee were not unique (p. 111). Unfortunately, misla-beling these designs is not just a matter of semantic inaccuracy. Aswe discuss below, the structure of measurement designs underly-ing ratings has implications for how reliability is estimated.

ISMDs and Reliability Estimation

The literature on intraclass correlations (ICCs) and G-theoryclearly reveals that the structure of coefficients used to estimatereliability is closely linked to the measurement design that pro-duced the scores. For example, McGraw and Wong (1996) andShrout and Fleiss (1979) clearly differentiated between ICCs thatare appropriate when raters are nested within ratees and ICCs that

1 The primary characteristic of measurement designs that allow for localestimation of interrater reliability is that there are two or more raters perratee for a sufficient number of ratees in the study sample.

960 PUTKA, LE, MCCLOY, AND DIAZ

are appropriate when ratees are fully crossed with raters. Brennan(2001) and others in the G-theory literature (e.g., Shavelson &Webb, 1991) provided analogous generalizability (G) coefficientsfor nested and crossed designs, respectively. This begs the follow-ing question: If measurement design has implications for thestructure of coefficients used to estimate reliability, what is thestructure of coefficients that are appropriate for estimating thereliability of ratings arising from ISMDs? For the moment, weleave that question open and offer a mock example to illustrate theproblems ISMDs can create for common approaches to interraterreliability estimation. Illustrating these problems will frame dis-cussion of our answer to the question we pose above and clarifywhy distinguishing ISMDs from fully crossed and nested designsis important.

ISMDs: A Mock Example

Imagine that a team of researchers is conducting a small pilotstudy to assess the psychometric properties of a behaviorallyanchored rating scale (BARS) assessing contextual performance.The researchers gather performance ratings from two raters foreach incumbent participating in the study. The two raters whoprovide ratings for each incumbent are those who have the mostopportunity to observe each incumbent perform the job. Thoughsome incumbents have a common rater, others do not (i.e., thedesign is neither fully crossed nor nested). For simplicity, say thatonly 10 incumbents were sampled and 13 raters provided ratings.The complete data set gathered in the pilot study is presented inTable 1. To facilitate the analyses, a research team member (Re-searcher A) randomly assigns the rating from one rater for eachincumbent to a column labeled Rater 1 in their data analysis file,and the rating from the other rater for each incumbent to a columnlabeled Rater 2 (e.g., Mount, Judge, Scullen, Sytsma, & Hezlett,1998; Van Iddekinge, Raymark, Eidson, & Attenweiler, 2004).The researchers then estimate single-rater reliability and k-raterreliability (i.e., the reliability of the average or summed ratingacross k raters) based on the resulting file.

The researchers are aware of several common approaches toestimating the interrater reliability of ratings. One approach wouldbe to calculate the Pearson correlation (r) between the two col-

umns of ratings to arrive at an estimate for single-rater reliability,and then to “step up” the correlation by using the Spearman–Brown prophecy formula to arrive at a reliability estimate for theaverage rating based on the two raters (Rothstein, 1990; Schmidt& Hunter, 1996; Viswesvaran et al., 1996). Another approachwould be to calculate an intraclass correlation. In this case, theresearchers might rationalize that ICC(1,k) is the most appropriatechoice, given that each incumbent is rated by a different set ofraters (LeBreton, Burgess, Kaiser, Atchley, & James, 2003; Shrout& Fleiss, 1979).2

Although consistent with current practice, three problems arisefrom applying these methods to ratings from ISMDs. A problemwith using the Pearson r approach regards the potential for obtain-ing different reliability estimates for the same underlying data set.The other two problems are related, in that both regard the poten-tial for underestimating reliability—one results from failure toexplicitly model rater main effects, and the other results fromfailure to properly scale the contribution of rater main effects toobserved score variance. As a result of these issues, both Pearsonr and ICC(1,k) have the potential to underestimate reliability whencalculated on ratings arising from ISMDs. We elaborate on each ofthese problems in the sections that follow.

Problem 1: Different estimates can arise from the same under-lying data set. Continuing with the example above, say thatanother researcher on the project team (Researcher B) indepen-dently performed the random assignment of ratings data to Rater 1and Rater 2 columns for each ratee. The original data file that theteam constructed and the file that Researcher B constructed areboth presented in Table 2. Note that the two files share manycharacteristics: The samples of ratees are identical, the observedmean scores for ratees (averaging across raters) are identical, andthe pairs of raters that rated each ratee are identical. Despite theaforementioned similarities, a key difference between these files isthe correlation between the Rater 1 and Rater 2 columns.

If the researchers use the original analysis file, their estimate ofsingle-rater reliability based on a Pearson correlation betweenRaters 1 and 2 will be .33, and the reliability of the mean scoreacross two raters will be .50 (based on Spearman–Brown). If theresearchers use the file created by Researcher B, their estimate ofsingle-rater reliability will be .53, and the reliability of the meanscore across two raters will be .69. If the researchers use Re-searcher B’s file, they will likely interpret their findings as beingconsistent with past meta-analytic work (e.g., Viswesvaran et al.,1996). In contrast, use of Researcher A’s file would likely leadthem to interpret their estimates as lower than those found inprevious research, thereby prompting the research team to look forsubstantive reasons why their reliability was lower than the norm,or potentially to attribute their results to sampling error. All thiswould occur despite the samples of ratees and raters being iden-tical across files. The differences in reliability estimates arisecompletely as an artifact of how the data files were constructed.

2 Just because different sets of raters rate each ratee, it does not meanthat individual raters are nested within ratees. Rather, it implies that ratersets are nested within ratees. As we note later, this subtlety, rarely pointedout in descriptions of nested designs in organizational research, can haveimportant implications for interrater reliability estimation.

Table 1Mock Data Set

Rater ID

RateeID A B C D E F G H I J K L M

01 5 702 4 203 3 204 5 405 1 306 4 407 7 508 6 409 5 310 5 2

Note. Cells contain the rating assigned by each rater for the given ratee.Empty cells indicate that the rater did not rate the given ratee.

961ILL-STRUCTURED MEASUREMENT DESIGNS

The estimate the researchers actually report when writing up theirresults is a matter of chance.

Although this example provides a clear illustration of the po-tential for different reliability estimates with Pearson r based onthe same underlying data set, we acknowledge that it is a contrivedexample based on only 10 ratees. The degree of dissimilarity inestimates above may simply reflect the sensitivity of Pearson r tosmall changes in the values being correlated when sample sizes aresmall. The simulation we present later in this article explores theextent of such “within-sample” variability in Pearson r-basedreliability estimates when they are applied to ISMDs with rateesamples of more realistic size (e.g., N � 50, 150, 300).

Problem 2: Failing to model rater main effects leads to biasedreliability estimates. Although the potential for obtaining differ-ent estimates based on the same underlying data set is disconcert-ing, a more fundamental problem regards the possibility that usingPearson r or ICC(1,k) to estimate the reliability of ratings arisingfrom ISMDs could result in substantially biased estimates ofinterrater reliability because neither approach explicitly models thecontribution of rater main effects to observed score variance. Tounderstand the source of this bias, it is necessary to examine thescore model often associated with ratings.

Psychometricians have long acknowledged that a score modelunderlies the calculation of any reliability coefficient (Cronbach etal., 1972; Feldt & Brennan, 1989). In the case of ratings, the mostcommonly cited score model is

yij � ti � rj � trij � eij, (1)

where yij is the observed rating for the ith ratee provided by the jthrater; ti indexes the ratee main effect for the ith ratee (analogous tothe ratee’s true score); rj indexes the rater main effect (e.g., a

rater’s leniency/severity) for the jth rater; trij indexes the interac-tion between the ith ratee and jth rater (e.g., rater idiosyncrasiesspecific to the given ratee), and eij is a residual term (Hoyt, 2000;McGraw & Wong, 1996; Murphy & DeShon, 2000; Schmidt et al.,2000). These score effects are typically assumed to be independentand normally distributed with a mean of 0 and a variance �x

2, wherex identifies the effect of interest (e.g., ratee main effect, rater maineffect, Ratee � Rater interaction effect).3

Which of these score components has the potential to contributeto observed variance in scores across ratees is a function ofmeasurement design (Cronbach et al., 1972). In the case of fullycrossed designs, the variance components associated with rateemain effects (�T

2), Ratee � Rater interaction effects (�TR2 ), and

residual error (�e2) contribute to observed score variance. In nested

designs (r:t), these variance components along with variance as-sociated with rater main effects (�R

2) contribute to observed scorevariance. When designs are ill-structured, organizational research-ers have correctly recognized that rater main effects have thepotential to contribute to observed score variance because allratees are not rated by the same raters (Schmidt et al., 2000;Viswesvaran et al., 2005). What is not clear from this research,however, is that when there is at least some overlap between thesets of raters that rate each ratee (i.e., one or more ratees share oneor more raters in common) and researchers fail to account for thatoverlap when estimating interrater reliability, it can lead to down-wardly biased estimates of true score variance (�T

2).Neither the calculation of Pearson r nor the calculation of

ICC(1,k) involves explicitly modeling the contribution of ratermain effects to observed score variance. Failing to model theseeffects leads to the violation of an assumption underlying theratings score model in Equation 1—independence of residualerrors across ratees. If one does not explicitly model rater effects(rj) in ISMDs, they are essentially treated as residual effects (eij)for purposes of decomposing variance in ratings. Because at leastsome ratees in ISMDs share one or more raters in common, theresiduals associated with those ratees’ scores are not independent.The implications of this type of residual non-independence forinterrater reliability estimates can be derived from past research onthe implications of non-independence for variance estimates inanalysis of variance (ANOVA) models (Kenny & Judd, 1986).

Kenny and Judd (1986) showed that if residuals are non-independent across treatment levels (in this case, ratees), thevariance attributable to treatment effects (in this case, true scorevariance) is biased downward. Such non-independence has littleeffect on estimates for residual variance, however. Kenny and Juddalso noted that the extent of downward bias in variance attributableto treatments is a function of the (a) proportion of observedvariance attributable to the source of non-independence in theresiduals (in this case, the proportion of observed variance asso-ciated with rater main effects) and (b) number of “linked” obser-

3 Murphy and DeShon (2000) have raised questions regarding the ade-quacy of this score model for performance ratings made for administrativepurposes. Our focus, however, is on ratings gathered in organizationalcontexts in general.

Table 2Mock Data Analysis Files Based on Random Assignment ofRaters Within Ratees

RateeID

Rater 1ID

Rater 2ID

Rater 1rating

Rater 2rating Mean

Data analysis file based on random assignment by Researcher A01 H G 7 5 6.002 E I 4 2 3.003 C D 3 2 2.504 A F 5 4 4.505 M B 3 1 2.006 F J 4 4 4.007 L B 5 7 6.008 M H 4 6 5.009 F E 3 5 4.010 K H 2 5 3.5

Data analysis file based on random assignment by Researcher B01 G H 5 7 6.002 E I 4 2 3.003 C D 3 2 2.504 F A 4 5 4.505 M B 3 1 2.006 J F 4 4 4.007 B L 7 5 6.008 H M 6 4 5.009 E F 5 3 4.010 H K 5 2 3.5

962 PUTKA, LE, MCCLOY, AND DIAZ

vations (i.e., observed scores provided by a common rater).4 Interms of reliability estimation, the net effect of underestimatingtrue score variance while accurately estimating residual variance isa downwardly biased estimate of reliability. The simulation wepresent later in this article explores the extent of such downwardbias in Pearson r and ICC(1,k) under a variety of ISMD conditions.

Problem 3: Failing to model rater main effects prevents properscaling of rater variance. Failing to model the contribution ofrater main effects to observed score variance also creates anotherpotential problem for the common approaches to reliability esti-mation when they are applied to ISMDs: It prevents researchersfrom properly scaling the contribution of rater main effects toobserved score variance. Failure to properly scale the contributionof rater main effects to observed score variance also leads todownward bias in Pearson r and ICC(1,k)-based interrater reliabil-ity estimates for ratings arising from ISMDs, thus serving tocompound the problem of non-independent residual errors de-scribed above.

In fully crossed and nested designs, the potential contribution ofrater main effects, Ratee � Rater interactions, and residual error toobserved score variance is a function of the magnitude of theirassociated variance components and the number of raters per ratee,“k” (McGraw & Wong, 1996; Shavelson & Webb, 1991). InISMDs, the potential contribution of rater main effects to observedvariance is a function of the magnitude of the rater main effectvariance component and the number of raters that each pair ofratees shares in common (Brennan, 2001).

In his discussion of crossed measurement designs with missingdata, Brennan (2001, p. 236) provided a general formula for variancein observed scores arising from measurement designs in which targetsof measurement (e.g., ratees) are “crossed” with a facet of measure-ment (e.g., raters, items), but not all conditions of that measurementfacet are associated with all targets (e.g., ratings are missing for somepairings of raters and ratees). Such a design is analogous to the notionof an ISMD. Based on Brennan’s (2001) formulation, the expectedvalue of the observed variance in ratings (�Y

2) that have been averagedacross k raters per ratee can be expressed as

�Y2 � �T

2 � q�R2 �

�TR,e2

k̂(2)

In this equation, k̂ is the harmonic mean number of raters per ratee,and q is a multiplier that scales the contribution of varianceattributable to rater main effects (�R

2).5 Conceptually, the purposeof the q multiplier is to scale the contribution of �R

2 to observedscore variance based on the amount of overlap between the sets ofraters who rate each ratee. Mathematically, q is a function of thenumber of raters that each pair of ratees (i, i�) share, which wedenote as ci,i� in the equation below:

q �1

k̂�

�i

�i�

ci,i�

kiki�

Nt�Nt � 1�(3)

Here, k̂ is as defined earlier; Nt is the total number of ratees in thesample; and ki and ki� are the number of raters who rated ratees iand i�, respectively (i � i�). Given the formula above, the qmultiplier effectively ranges from 0 to 1/k̂. Figure 1 provides avisual example of how the q multiplier varies as a function of

measurement design as well as a worked example of how tocalculate q. Note that as the overlap between the sets of raters thatrate each ratee increases, q approaches 0. Indeed, when there iscomplete overlap between the sets of raters that rate each ratee,ratees are fully crossed with raters and q equals zero, thus effec-tively cancelling the contribution of rater main effects to observedvariance between ratees (Schmidt et al., 2000). As the overlapbetween the sets of raters that rate each ratee decreases, q ap-proaches 1/k̂. When there is no overlap between the sets of ratersthat rate each ratee, raters are fully nested within ratees and qequals 1/k̂, thus the expected contribution of rater main effects toobserved score variance is �R

2/k̂ (McGraw & Wong, 1996). Giventhese properties of the q multiplier, readers familiar with theliterature on ICCs and G-theory may observe that Equation 2provides a more general version of the formulae typically offeredfor expected observed score variance in fully crossed and nesteddesigns (Cronbach et al., 1972; McGraw & Wong, 1996).

The definition of reliability as the proportion of observed scorevariance that reflects true score variance (Guilford, 1954; Lord &Novick, 1968) and Equations 2 and 3 jointly suggest that interraterreliability for ratings arising from ISMDs based on the average ofk ratings per ratee may be estimated by

G�q,k� ��̂T

2

�̂T2 � �q�̂R

2 ��̂TR,e

2

k̂ � (4)

where �̂T2, �̂R

2, and �̂TR2 ,e are estimated variance components for

ratee main effects (true score), rater main effects, and the combi-nation of the Ratee � Rater interaction and residual effects (whichare confounded).6 Given the properties of the q multiplier notedabove, Equation 4 reduces to the formula for ICC(1,k) provided byShrout and Fleiss (1979) and McGraw and Wong (1996) whenindividual raters are nested within ratees, and it reduces to theformula for ICC(C,k) provided by McGraw and Wong (1996)when ratees are fully crossed with raters. Thus, the coefficientprovided in Equation 4 offers researchers and practitioners ageneral formula for estimating interrater reliability that appliesregardless of whether ratings arise from a fully crossed design,nested design (r:t), or an ISMD.

Equation 4 allows us to illustrate the implications that failing toproperly scale the contribution of rater main effects to observedscore variance has for Pearson r and ICC(1,k) when they are used

4 If individual raters were nested within ratees, no ratees would share anyraters in common and thus residuals would presumably be independentacross ratees. For such a design, ICC(1,k) would be an appropriate choicefor estimating interrater reliability (McGraw & Wong, 1996; Shrout &Fleiss, 1979). It is important, therefore, not to confuse designs in whichpotentially overlapping rater sets are nested within ratees and designs inwhich individual raters are nested within ratees.

5 Use of the harmonic mean for k accounts for situations in which anunequal number of raters rate each ratee (Winer, 1971). Note also that thisformula applies even if only a single rating is available for each ratee (k � 1).

6 We use G to label this coefficient to signify its grounding in thegeneralizability theory literature. This coefficient can be viewed as ameasure of the generalizability of ratings to similarly structured measure-ment designs (in terms of q and k) formed via random sampling of ratersfrom the population of raters.

963ILL-STRUCTURED MEASUREMENT DESIGNS

to estimate the reliability of ratings arising from ISMDs. BothPearson r and ICC(1,k) will underestimate interrater reliabilitybecause neither distinguishes rater main effect variance from Ra-tee � Rater interaction and residual variance (McGraw & Wong,1996; Schmidt et al., 2000). Therefore, when estimating interraterreliability by using one of these coefficients, one is essentiallydividing variance attributable to rater main effects by k rather thanmultiplying it by q (see Equation 4). The q multiplier will neces-sarily be less than 1/k in an ISMD, which means that Pearson r andICC(1,k) will underestimate interrater reliability because they willoverestimate the contribution of rater main effects to observedscore variance. The simulation we present later in this articleexplores the extent of such downward bias in Pearson r andICC(1,k) as a result of failing to properly scale the contribution ofrater main effects under a variety of ISMD conditions.

Estimating Interrater Reliability for ISMDs by UsingG(q,k)

In theory, use of G(q,k) should yield a less biased estimate ofinterrater reliability for ratings arising from ISMDs than shoulduse of either Pearson r or ICC(1,k). A more practical questionregards how researchers and practitioners can calculate this coef-ficient for ratings that arise from ISMDs. Like other generalizabil-ity coefficients, calculating G(q,k) requires estimation of variancecomponents. First, we discuss two general strategies for estimatingsuch variance components, both of which we frame in terms ofG-theory.7 After that, we provide a worked example of how tocalculate G(q,k) by using the mock data set presented in Table 1.

The traditional G-theory approach. The first strategy for es-timating the variance components underlying G(q,k) can be de-scribed as a traditional G-theory approach, and it involves twosteps. First, researchers conduct a generalizability study (G-study)using a fully crossed Ratee � Rater design and estimate thevariance components in Equation 4 by fitting a two-way randomeffects ANOVA model to the resulting data (Shavelson & Webb,1991). Second, researchers use the aforementioned variance com-ponents to estimate the expected generalizability of ratings forseveral potential measurement designs they might use to makeoperational decisions regarding ratees (i.e., D-study designs). Spe-cifically, researchers estimate the G-coefficients (in this case,G[q,k]) associated with different measurement designs by applyingvalues of q and k associated with those designs to the G-studyvariance component estimates (Brennan, 2001). A primary benefitof a traditional G-theory approach is that it offers researchers theopportunity to better understand sources of error in their measurebefore it is implemented operationally or used for formal valida-tion research (Cronbach et al., 1972). Unfortunately, calls to cap-italize on G-theory in this manner have gone relatively unheededin the organizational research literature (DeShon, 2002).

7 In addition to the approaches offered here, a third approach would beto obtain meta-analytic estimates of the variance components in Equation4. Unfortunately, a unique meta-analytic estimate for each of these vari-ance components exists only for ratings in general (e.g., Hoyt & Kerns,1999), not ratings specific to domains of interest to organizational research-ers.

Figure 1. Example of the relation between the q multiplier and measurement design.

964 PUTKA, LE, MCCLOY, AND DIAZ

Despite its noted merits, the traditional G-theory approach hassome drawbacks that may be particularly salient for organizationalresearchers and practitioners. First, adopting a traditional G-theoryapproach may not always be feasible. For example, practitionersmay often face situations where they lack the funding and supportto conduct both a well designed G-study and their primary study ofinterest. Similarly, academicians fortunate enough to acquire andanalyze archival organizational data involving ISMDs are not wellpositioned to go back and conduct a G-study if one was notpreviously conducted. In addition to these practical issues, therealso may be issues with the external validity of G-study variancecomponent estimates, particularly if one has reason to believe thatkey contextual factors present in the primary study (or operationaldecision context) would not be adequately reflected in the G-study(Cook & Campbell, 1979; Sussmann & Robertson, 1986). Forexample, previous research has acknowledged differences betweenperformance ratings gathered for research purposes as opposed toadministrative purposes (Landy & Farr, 1980). In light of themyriad factors influencing performance ratings, inferring the mag-nitude of variance components for such ratings based on variancecomponents estimated from ratings gathered under “research only”conditions is problematic for administrative purposes (Murphy &DeShon, 2000).

The “D-study-only” approach. If it is not feasible to conducta carefully designed G-study that accounts for salient threats toexternal validity of the variance component estimates, a secondstrategy researchers may adopt can be described as a “D-study-only” approach. Specifically, researchers may forego a G-studyand attempt to estimate variance components underlying G(q,k)based on the ratings they have collected for their primary study,even though such ratings may have arisen from an ISMD. Indeed,in his recent text on G-theory, Brennan (2001) recognized this notonly as a possibility but also a common occurrence:

Often, of course, the available data, with their missing patterns, are theonly data available, and an investigator is interested in generalizabilitywith respect to sample sizes and sample size patterns in these data.Presumably, such an investigator believes that a replication of thedesign (in the D-study sense) would involve much the same pattern ofmissing data. (p. 236)

Interestingly, the extant organizational research literature ap-pears to suggest that it is not possible in ISMDs to distinguishvariance attributable to rater main effects (i.e., �R

2) from thecombination of variance attributable to Ratee � Rater interactionsand residual error (i.e., �TR,e

2 )—a necessity when calculatingG(q,k) (Schmidt et al., 2000). Although this might seem a reason-able claim, it is something that to our knowledge has not beenempirically evaluated. Based on developments in the literature onvariance component estimation in random effect models, it may bepossible to generate unique variance component estimates for ratermain effects for a wide range of ISMDs.

For purposes of variance component estimation, one might viewISMDs as crossed designs with potentially large amounts of miss-ing data (Brennan, 2001). Recent work on the estimation of vari-ance components has suggested that restricted maximum likeli-hood (REML) estimators can accurately reproduce variancecomponents in designs with missing data (Searle, Casella, &McCulloch, 1992) and are surprisingly robust to violations ofnormality assumptions typically associated with such estimators

(Marcoulides, 1990). REML variance component estimators standin contrast to the traditional ANOVA-based estimators typicallyassociated with G-theory, which become problematic in designswith missing data or unbalancing in terms of nesting (DeShon,2002). Given the lack of research regarding the feasibility of usinga D-study-only approach to calculate G(q,k), as well as the fre-quent need of researchers and practitioners to estimate interraterreliability with ratings arising from ISMDs, the section belowprovides a worked example of how to calculate G(q,k) by using themock data set presented in Table 1.

Calculating G(q,k) with ratings arising from an ISMD. Thefirst step in calculating G(q,k) is to structure one’s data in a mannersuch that each record (row) reflects a unique Ratee � Ratercombination. Each ratee in the data file should have one record(row) for each rater who rated him or her. The data file should havethree columns: one that identifies the ratee (e.g., ratee ID), asecond that identifies the rater who rated the ratee (e.g., rater ID),and a third that reflects the rating for the given Ratee � Rater pair.As an example of how the data file should be structured, we havereformatted the mock data set presented in Table 1 and included itin the online supplemental materials with this article.

With the data formatted in this manner, one can easily estimatethe three variance components underlying the calculation of G(q,k)noted in Equation 4—�T

2, �R2, and �TR,e

2 —by using standard sta-tistical software packages (e.g., SAS, SPSS, R). Specifically, thesevariance components can be estimated by fitting a simple randomeffects model that specifies ratee ID and rater ID as randomfactors, and the rating variable as the outcome. Table 3 shows thecode required for fitting this model in SAS, SPSS, and R. By usingthe reformatted version of the mock data set presented in Table 1,we fit this model to the data and obtained estimates of 1.24 forratee main effect variance (�T

2), 0.52 for rater main effect variance(�R

2), and 1.06 for the combination of Ratee � Rater interactionand residual error variance (�TR,e

2 ).Another key step in the process of estimating G(q,k) is to

calculate the q multiplier. Calculating q can be the most laborintensive part of estimating G(q,k) in that for each pair of ratees inthe sample, it requires (a) counting up the number of raters that thepair of ratees share in common and (b) dividing that quantity bythe product of the number of raters who rated each ratee in the pair(i.e., the ci,i�/kiki� part of Equation 3). Without the help of acomputer program, such a calculation would be arduous for all butthe smallest data sets (see Figure 1). We therefore developed anSAS program that quickly calculates not only q but also G(q,k) andthe variance components underlying it. This program is providedin the supplemental online material associated with this article andeasily can be applied to any ISMD where raters are the facet ofmeasurement. With our program, we obtained a value of .45 for qby using the mock data set presented in Table 1. With this value ofq and estimates for the three variance components noted above,Equation 4 can be used to calculate G(q,k) for the mock data set.Specifically,

G(q,k) ��̂T

2

�̂T2��q�̂R

2��̂TR,e

2

k̂ � �1.24

1.24 � � .45�.52� �1.06

2 � � .62

(5)

965ILL-STRUCTURED MEASUREMENT DESIGNS

This value reflects the reliability of the mean rating for eachratee—that is, the proportion of expected observed score varianceattributable to true score variance. Again, although the process forcalculating G(q,k) might appear lengthy, it is for illustration only.The program provided in the supplementary online material com-putes the coefficient and its constituent variance componentsquickly and easily.

Implications of ISMDs for Interrater ReliabilityEstimation: A Simulation

For the next phase of our study, we designed a Monte Carlosimulation to evaluate the performance of Pearson r, ICC(1,k), andthe proposed G(q,k) interrater reliability estimator under a varietyof ISMD conditions. We designed the simulation to address fourresearch questions that follow from issues raised in the introduc-tion. First, how variable are Pearson r estimates of interraterreliability within a given sample when the raters who rated eachratee are assigned randomly as “Rater 1” and “Rater 2”? The mockexample suggests such “within-sample” variability may be great,but such findings may simply reflect sensitivity in estimates due tothe small sample size used (i.e., Nt � 10). Second, how dotraditional estimators of interrater reliability such as Pearson r andICC(1,k) compare with G(q,k) in terms of bias when applied toratings arising from ISMDs? Based on the work of Kenny andJudd (1986) and Brennan (2001), we expect the traditional esti-mators to be more downwardly biased as (a) variance attributableto rater main effects increases and (b) the overlap between the setsof raters that rate each ratee becomes greater (i.e., as the totalnumber of raters participating in the simulated study [Nr] ap-proaches the number of raters per ratee [k]). Third, how do tradi-tional estimators of interrater reliability such as Pearson r andICC(1,k) compare with G(q,k) in terms of their precision (asindexed by root mean square error) when applied to ratings arisingfrom ISMDs? Fourth, under what ISMD conditions can we useREML variance component estimators to produce accurate, uniqueestimates for variance attributable to rater main effects? Presum-ably, the answer to this question would have implications for theanswer to the third question, in that a failure to estimate rater maineffects with adequate precision would lead to imprecision in thereliability estimate provided by G(q,k).

Method

To address the four research questions, our simulation involvedmanipulating three independent variables: (a) ratee sample size, (b)total number of raters participating in the simulated study, and (c)variance component profiles underlying the score model that gaverise to the simulated ratings. In all, 480 simulated conditions wereexamined, each representing a unique Ratee Sample Size (Nt) �Total Number of Participating Raters (Nr) � Variance ComponentProfile combination. For each condition, 1,000 independent sam-ples of data were simulated.

Ratee Sample Size (Nt)

We examined conditions in which 50, 150, and 300 ratees weresampled. These levels were chosen to represent relatively small,moderate, and large sample sizes in terms of primary researchstudies. For each ratee sample size (Nt), we simulated data for 165conditions that varied in terms of the total number of ratersparticipating in the simulated study and the variance componentprofile used to simulate ratings.

Total Number of Raters Participating in the SimulatedStudy (Nr)

If one conceives of each simulated sample of data as a separate“study,” then this design factor reflects the total number of raterswho participated in the given study (as opposed to k, the numberof raters per ratee). Recall that a key determinant of the q multi-plier for any given ISMD is the extent to which the sets of ratersthat rate each ratee overlap. We manipulated the degree of overlapbetween the sets of raters that rated each ratee in the ISMDs weexamined by varying the total number of raters participating in thesimulated study (Nr) yet holding the number of raters per ratee (k)constant at 2 regardless of simulation condition. The simulationprogram we designed initially generated a fully populated Nt

(Ratee Sample Size) � Nr (Total Number of Participating Raters)matrix of ratings for each of the 1,000 samples within eachcondition. To introduce “ill-structure” into the measurement de-sign underlying each sample of data, we independently and ran-domly deleted data for Nr – 2 ratings for each ratee. For example,in conditions where Nr � 3, we independently and randomlydeleted one of the three ratings for each ratee, thus producing adesign where each ratee was rated by two of three raters partici-

Table 3SAS, SPSS, and R Code for Fitting a Random Effects Model with Ratees and Raters Treated as Crossed Random Factors

SAS Code SPSS Code R Code

PROC VARCOMP method � reml; VARCOMP lmer(rating 1 � (1 ratee_ID) � (1 rater_ID)).CLASS ratee_id rater_id; rating BY ratee_ID rater_IDMODEL rating � ratee_id rater_id; /RANDOM � ratee_ID rater_ID

/METHOD � REML/DESIGN � ratee_ID rater_ID/INTERCEPT � INCLUDE.

Note. This code will estimate three REML variance components, one reflecting variance due to ratee main effects (�T2), a second reflecting variance due

to rater main effects (�R2), and a third reflecting variance due to the Ratee � Rater interaction and residual error variance (�TR,e

2 ). To fit a model in whichraters were nested within ratees (i.e., a one-factor random effects model for the r:t design underlying ICC[1,k]), one would simply omit rater_id from allinstances where it appears in the code above. REML � restricted maximum likelihood; ICC � intraclass correlation.

966 PUTKA, LE, MCCLOY, AND DIAZ

pating in the study. In contrast, in conditions where Nr � 50, weindependently and randomly deleted 48 of the 50 ratings for eachratee, thus producing a design where each ratee was rated by only2 of the 50 raters participating in the study. Thus, by varying thetotal number of raters participating in the study, yet holding thenumber of raters per ratee constant, we were able to simulateISMDs that varied in the degree of overlap between the sets ofraters that rated each ratee—therefore manipulating the magnitudeof the q multiplier associated with the simulated design.

In choosing levels for this design factor for our simulation, weidentified a core set of Nr levels that would allow us to illustratevarying degrees of overlap between the sets of raters that ratedeach ratee. For this purpose, we examined Nr levels of 3, 4, 5, 7,10, 15, 20, 30, and 50, regardless of how many ratees weresampled. We also wanted to use the simulation to examine Nr

levels that produced total number of raters-to-ratee sample sizeratios (Nr/Nt) of 1:1 and 2:1, respectively, as we have encounteredthese in practice (particularly with regard to performance ratings).As such, for conditions where ratee sample sizes were 50, we alsoexamined an Nr level of 100. Similarly, for conditions where rateesamples sizes were 150, we also examined Nr levels of 150 and300; and for conditions where ratee sample sizes were 300, we alsoexamined Nr levels of 300 and 600.

Variance Component Profiles

The data generation model underlying ratings in our simulationreflected the sum of four independent and normally distributedrandom variables corresponding to the score effects ti, rj, trij, andeij in Equation 1. We independently simulated each set of scoreeffects to have a mean of zero and a given variance in thepopulation (�T

2, �R2, � TR

2 , and �e2, respectively). Simulated ratings

in each condition were based on 1 of 15 profiles of variancecomponents underlying these score effects (see Table 4). Theprofiles we used can be grouped into three clusters: (a) thoserepresenting relatively low levels of true score variance (�T

2 � .35),(b) those representing moderate levels of true score variance (�T

2 �

.50), and (c) those representing relatively high levels of true scorevariance (�T

2 � .75). Profiles with relatively low levels of truescore variance might be viewed as characteristic of peer perfor-mance ratings and unstructured interview ratings (e.g., Conway etal., 1995; Scullen et al., 2000). Profiles with moderate levels oftrue score variance might be viewed as characteristic ofsupervisor-based performance ratings (e.g., Viswesvaran et al.,1996). Profiles with relatively high levels of true score variancemight be viewed as characteristic of assessment center ratings andstructured interview ratings (e.g., Conway et al., 1995; Tsacoumis,2007).

Given the centrality of rater main effect variance (�R2) to the

potential for bias in Pearson r and ICC(1,k) estimates of interraterreliability, we examined a wide range of rater main effect variancecomponents for each true score variance level noted above. Ex-amining a wide range of rater main effect variance components isconsistent with the general ratings literature, which suggests thatthe range of estimates for �R

2 may vary greatly depending oncharacteristics of primary studies (Hoyt & Kerns, 1999). To com-plete each variance component profile, we set variance associatedwith Ratee � Rater interaction and residual error (which areconfounded) equal to 1 – (�T

2 � �R2). This allowed each variance

component profile to sum to 1.0, thereby aiding the interpretabilityof our results.

Reliability Estimators

Four sets of reliability estimators were evaluated in the simula-tion. First, we estimated single-rater reliability and two-rater reli-ability by using Pearson r and the Spearman–Brown prophecy fortwo raters (using Pearson r as the starting point), respectively(Schmidt & Hunter, 1996). Second, we estimated single-raterreliability and two-rater reliability by using ICC(1,1) and ICC(1,2),respectively (McGraw & Wong, 1996; Shrout & Fleiss, 1979).ICCs typically applied to designs in which raters are nested withinratees. Third, we estimated single-rater reliability and two-raterreliability by using G(q,1) and G(q,2), respectively, based onEquation 4. Fourth, to examine the effects of explicitly modelingrater main effects but failing to scale them properly when estimat-ing interrater reliability, we calculated an alternative versionG(q,k) in which we set q equal to 1/k, regardless of the conditionexamined. Comparing G(q,k) with G(k) allowed us to isolate theimprovement in reliability estimation that arises from properlyscaling the contribution of rater main effects to observed scorevariance (i.e., allowing the q multiplier to be a function of thenumber of raters that ratees share in common as opposed to fixingit at 1/k for all conditions). In the section that follows, we providedetails on the data generation process, calculation of the reliabilityestimators described above, and calculation of the populationreliabilities against which these reliability estimators were evalu-ated.

Data Generation and Reliability Calculations

All data for this study were simulated with a program that wedeveloped in R (R Development Core Team, 2007). This programalso calculated all reliability coefficients of interest for this studyand aggregated results for purposes of reporting. The programsimulates data based on a linear mixed model framework (LMM;

Table 4Variance Component Profiles for Simulation

Profile �T2 �R

2 �TR2 �e

2

1 .350 .550 .050 .0502 .350 .450 .100 .1003 .350 .350 .150 .1504 .350 .250 .200 .2005 .350 .150 .250 .2506 .350 .050 .300 .3007 .500 .450 .025 .0258 .500 .350 .075 .0759 .500 .250 .125 .125

10 .500 .150 .175 .17511 .500 .050 .225 .22512 .750 .200 .025 .02513 .750 .150 .050 .05014 .750 .100 .075 .07515 .750 .050 .100 .100

Note. �T2 � ratee main effect (true score) variance component; �R

2 � ratermain effect variance component; �TR

2 � ratee � rater interaction effectvariance component; �e

2 � residual error variance component.

967ILL-STRUCTURED MEASUREMENT DESIGNS

Littell, Milliken, Stroup, & Wolfinger, 1996; Searle et al., 1992).The LMM is a more general version of the random effectsANOVA model that underlies G-theory (Cronbach et al., 1972)and multilevel random coefficient models (Bliese, 2002). Theprogram takes a vector of nine parameters as input: (a) number ofsamples per condition, (b) ratee sample size (Nt), (c) total numberof raters participating in the simulated study (Nr), (d) ratee maineffect variance (�T

2), (e) rater main effect variance (�R2), (f) Ra-

tee � Rater interaction variance (�TR2 ), (g) residual variance (�e

2),(h) covariance between ratee and Ratee � Rater interaction effectsacross ratees (e.g., �T,TR1, �T,TR2), and (i) covariance betweenRatee � Rater interaction effects across ratees (e.g., �TR1,TR2).Given the assumptions of independent score effects that underlieboth classical test theory (CTT) and G-theory perspectives onreliability (Brennan, 2001; Lord & Novick, 1968), we set the twocovariance parameters to zero for all simulations in this study. Theprogram’s structure, however, will permit future research to pursueissues regarding the potential for non-zero covariance among theseeffects that may occur in certain contexts (Murphy & DeShon,2000). The steps used to simulate data, calculate reliability esti-mates, and derive population reliabilities for each of the 480conditions we examined are described below.

Step 1. For each simulation condition, we created 1,000 inde-pendent, fully crossed Nt (Ratee Sample Size) � Nr (Total Numberof Participating Raters) matrixes of simulated ratings based on thescore model outlined in Equation 1 (i.e., yij � ti � rj � trij � eij).These score effects were independently and randomly generatedfrom normal distributions with means of 0 and variances equal to�T

2, �R2, �TR

2 , and �e2, respectively. Observed ratings (yij) were

calculated by summing the four randomly generated score effectsfor each Ratee � Rater combination. For each of the 1,000matrixes of sample data that we generated for each condition, wesaved the Nt � 1 vector of true scores (i.e., the vector of ti scoreeffects). These true scores were used later in our simulation forcalculating population reliabilities (Steps 9 and 10).

Step 2. For each ratee in the matrixes created in Step 1, weindependently and randomly deleted simulated ratings for Nr – 2raters. This resulted in a data set for each sample that was a matrixNt � Nr in magnitude, but only two of the Nr columns werepopulated with ratings for each ratee.

Step 3. To examine “within-sample” variability in Pearsonr-based estimates of interrater reliability, we constructed a set of

100 “analysis files” for each of the 1,000 samples of data generatedfor each condition. Each of these 100 analysis files consists ofexactly the same set of raters and ratees. Figure 2 provides anillustration of the nested nature of the files generated by thesimulation program. Each analysis file was constructed by takingthe matrixes of sample data resulting from Step 2 and randomlyassigning one rating for each ratee to column 1, and the otherrating for each ratee to column 2. Thus, each analysis file consistedof a fully populated Nt � 2 matrix of ratings on which Pearson rcould be calculated. We calculated Pearson r for each of the 100analysis files within each sample and subsequently used theSpearman–Brown prophecy formula to “step up” those Pearson rsto two-rater reliability estimates. We recorded the range of theseestimates across analysis files within each sample to assess within-sample variability in Pearson r-based reliability estimates.

Step 4. For each of the 1,000 samples of data we generated foreach condition, we randomly selected one of the analysis filesdescribed in Step 3 and used the Pearson r and Spearman–Brownprophecy estimates based on that file as the estimates for single-rater and two-rater reliability (respectively) for the given sample.As noted when discussing our mock example, calculating Pearsonr based on only one assignment of raters (i.e., based on only oneanalysis file) for a given sample of data appears to reflect currentpractice in organizational research (e.g., Mount et al., 1998;Scullen et al., 2000; Van Iddekinge et al., 2004).

Step 5. To facilitate calculation of the remaining reliabilityestimates, namely ICC(1,k), G(k), and G(q,k), we restructured eachsample data set resulting from Step 2 such that there was one rowfor each unique Ratee � Rater combination. This resulted in a dataset for each sample that consisted of 2Nt rows and three fullypopulated columns of data. One column identified the ratee andtook on values from 1 to Nt. Another column identified the actualrater that rated the given ratee and took on values from 1 to Nr.Lastly, the final column reflected the simulated observed ratingassociated with the given Ratee � Rater combination.

Step 6. By using the restructured data sets created in Step 5,we calculated ICC(1,1) and ICC(1,2) for each sample (McGraw &Wong, 1996; Shrout & Fleiss, 1979). To calculate these ICCs, wefirst used the lmer procedure in R’s lme4 package to fit a one-factor random effects model in which raters were treated as nestedwithin ratees. This model provided REML estimates for true scoreand residual variance components (�T

2 and �R,TR,e2 ) for each sample

Condition 1 Sample 1

Analysis File 1

Analysis File 100 Sample 2

Sample 1000 Condition 2

Condition 480

1000 independent samples nested within each condition

100 analysis files nested within each sample and created by randomly assigning one rater for each ratee to column 1 and the other rater to column 2 (used to address Research Question 1)

480 simulated conditions ereflecting a uniquevariance component profile × N

ach

t × Nr combination

Figure 2. Illustration of levels of nesting in the simulation design.

968 PUTKA, LE, MCCLOY, AND DIAZ

(Bates, 2007).8 We then used these variance components to cal-culate estimates of single-rater reliability and two-rater reliabilityfor each sample by using the formula for ICC(1,k) provided byMcGraw and Wong (1996).

Step 7. Again using the restructured data sets constructed inStep 5, we calculated G(q,1) and G(q,2) by using the formula forG(q,k) provided in Equation 4. To calculate these coefficients, wefirst used the lmer procedure to fit a crossed random effects model(t � r) that provided separate REML estimates for true score, ratermain effect, and residual variance components (i.e., �T

2, �R2, and

�TR,e2 ) for each sample. (Recall that Table 3 provides the code for

fitting this model not only in R but also in SAS and SPSS.) Next,we calculated the q multiplier associated with each simulationcondition. The manner in which we simulated ISMDs for thisstudy enabled us to derive a simplified formula for the q multiplierthat was solely a function of the number of raters per ratee (k) andthe total number of raters participating in the simulated study(Nr).

9 Specifically, when k of Nr raters are independently andrandomly assigned to rate each ratee, the expected value of q canbe shown to be

E(q) �1

k�

1

Nr(6)

.

Table 5 provides the values of q we used for calculating estimatesof single-rater reliability (G[q,1]) and two-rater reliability (G[q,2])for each sample of data as a function of the total number of ratersparticipating in the given simulated study condition. With thevariance component estimates and q multiplier generated via theprocess above, we calculated single-rater and two-rater reliabilityestimates for each sample by using Equation 4.

Step 8. By using the variance component estimates generatedfor each sample in Step 7, we calculated single-rater and two-raterreliability estimates that were similar in structure to G(q,1) andG(q,k), respectively, except that instead of multiplying the vari-ance component estimate for rater main effects (i.e., �R

2) by q, we

divided it by k. This allowed us to examine a coefficient (G[k]) thatwas based on a random effects model that explicitly accounted forrater main effects (therefore mitigating the issue of non-independence of residuals noted in the introduction) but failed toproperly scale their contribution to observed score variance.

Step 9. On calculating each reliability estimate above, wegenerated two-rater population reliabilities for each of the 480simulation conditions via the following process. These populationreliabilities were necessary for evaluating the quality of the reli-ability estimators described above. The process for calculatingpopulation reliabilities proceeded as follows: First, for each of the1,000 samples within each condition, we merged the Nt � 1 vectorof true scores created in Step 1 with the observed rating data setsresulting from Step 2. Next, we calculated the average observedrating for each ratee across their two raters. Third, we calculatedthe squared correlation between the resulting average rating andthe vector of true scores to generate an estimate for two-raterpopulation reliability based on the given sample. Finally, we tookthese sample estimates for each condition and averaged themacross the 1,000 samples within each condition to derive a two-rater population reliability for the condition.

Step 10. To derive a single-rater population reliability for eachcondition, we started with the data sets that resulted from Step 2and merged them with the Nt � 1 vector of true scores created inStep 1 (as we did at the start of Step 9). Next, we randomlyassigned the rating of one of the two raters for each ratee to asingle column and calculated the squared correlation between theresulting vector of observed ratings and the vector of true scores.For each sample, within each condition, we repeated this process100 times, took the average of the resulting squared correlationsacross replications within samples, and then averaged those resultsacross samples within condition. We used the resulting averagesquared correlations as the population single-rater reliability foreach condition. It would have been inappropriate to use theSpearman–Brown prophecy formula in this instance to “stepdown” the two-rater population reliabilities we calculated in Step9 to derive single-rater population reliabilities; this is because thecontribution of rater main effect variance to observed variance isnot simply a function of the number of raters who rated each ratee(k), but rather a function of the number of raters that pairs of rateesshare in common (Brennan, 2001). Thus, as described here, wetook a more direct approach to generating a single-rater populationreliability for each condition.

Evaluation Criteria

To evaluate the performance of each interrater reliability esti-mator (i.e., Pearson r, ICC[1,k], G[q,k], G[k]) examined in the

8 This model could also easily be fitted within by using the VARCOMPor MIXED procedures in SAS or the VARCOMP procedure in SPSS andproduce equivalent results (see note for Table 3; DeShon, 2002; McCloy &Putka, 2004). We used the lmer procedure in R as a matter of convenience,given that our simulation program was designed to carry out all phases ofour study: (a) simulating data, (b) calculating reliabilities, and (c) aggre-gating results.

9 Derivation of the formula presented in Equation 6 can be obtained fromDan J. Putka on request. Its accuracy is reinforced by the quality of resultsobtained for the G(q,k) reliability estimator presented in the next section.

Table 5Expected Values for the q Multiplier Used in the SimulationStudy

Nr

k � 1 k � 2

E(ci,i�) E(q) E(ci,i�) E(q)

3 .333 .667 1.333 .1674 .250 .750 1.000 .2505 .200 .800 .800 .3007 .143 .857 .571 .357

10 .100 .900 .400 .40015 .067 .933 .267 .43320 .050 .950 .200 .45030 .033 .967 .133 .46750 .020 .980 .080 .480

100 .010 .990 .040 .490150 .007 .993 .027 .493300 .003 .997 .013 .497600 .002 .998 .007 .498

Note. Nr � total number of raters participating in the study; k � numberof raters per rate; E(ci,i�) � expected number of raters that any given pairof ratees (i,i') share in common; E(q) � expected value of the q multiplier.

969ILL-STRUCTURED MEASUREMENT DESIGNS

simulation, we adopted two well known criteria: bias and rootmean square error (Enders, 2003; Miller & Miller, 1999). The biasof each interrater reliability estimator was calculated as the differ-ence between the mean reliability estimate across samples withina condition minus the corresponding population reliability for thatcondition. Negative (positive) bias values indicate that the estima-tor underestimated (overestimated) the population reliability. Toease interpretation, we report measures of relative bias. Relativebias was calculated as the bias for a given reliability estimator ina given condition, divided by the population reliability for thatcondition, and multiplied by 100:

� �i�1

Ns

ryy_i

Ns� yy

� �1

yy� 100, (7)

where ryy_i is the reliability estimate for sample i, yy is thepopulation reliability for the given condition, and Ns is the numberof samples per condition. These relative bias values can be inter-preted as the percentage by which a given estimator over- orunderestimated the population reliability for a condition.

Root mean square error (RMSE) is a function of the averagesquared difference between the reliability estimates for sampleswithin a condition and the corresponding population reliability forthat condition. Specifically, it is calculated as

RMSE ���

i�1

Ns

(ryy_i � yy)2

Ns(8)

where ryy_i, yy,, and Ns are as defined above. RMSE is affected byboth the bias (to the extent it exists) and the sampling variability ofan estimator (i.e., the average deviation of sample estimates fromthe mean of sample estimates within a condition). If an estimatoris unbiased, the RMSE provides a pure measure of samplingvariability. Given that RMSE can reflect both types of estimationerror, it is viewed as an omnibus measure of the accuracy orprecision of an estimator. It indicates how distant, on average, thesample estimates resulting from a given estimator are from thepopulation parameter. In this study, we calculated RMSEs for eachreliability estimator for each condition as well as the ratio ofRMSEs for each traditional estimator (e.g., Pearson r, ICC[1,k])and G(k) over the RMSE for G(q,k) for each condition. Theseratios index the inaccuracy of these estimators relative to G(q,k)(Enders, 2003).

Results

Given the large number of conditions for which data weresimulated, we have streamlined the results by presenting only asampling of conditions that are sufficient for illustrating our mainfindings.10 An initial review of results for each condition sug-gested that, with the exception of results for Research Question 1,the results for each condition differed very little as a function ofratee sample size (Nt). Thus, to simplify our presentation, we limitresults pertinent to Research Questions 2 to 4 to conditions involv-ing ratee sample sizes of 150. Nevertheless, we do briefly note

results for conditions with ratee samples sizes of 50 and 300 whenthey differed from those presented.

Research Question 1: How Much “Within-Sample”Variability Is There in Pearson r?

To address our first research question, we calculated the rangeof Pearson r estimates across the 100 data analysis files generatedfor each sample within a condition (per Step 3 in the simulationprocess outlined above). Next, we calculated the mean, minimum,and maximum of these ranges across samples within each condi-tion. In reviewing results for the 480 conditions, we found thatthere was almost no difference in ranges as a function of the totalnumber of raters participating in the simulated study (Nr). Thus,we collapsed results across Nr levels, and we present results forconditions defined by each Variance Component Profile � RateeSample Size combination in Table 6.

The pattern of results in Table 6 suggests that the variability inPearson r was not substantial when ratee samples sizes were 150or 300. For example, the mean range in Pearson r estimates acrossfiles within samples for conditions based on a �T

2/�R2/�TR,e

2 variancecomponent profile of 50/25/25 and Nt � 150 was only .021, andthe maximum range observed across estimates within any samplefor such conditions was .056. The potential for within-samplevariability in Pearson r was greater when ratee sample sizes weresmall. For example, the mean range in Pearson r estimates acrossfiles within samples for conditions based on a �T

2/�R2/�TR,e

2 variancecomponent profile of 50/25/25 and Nt � 50 was .064, and themaximum range observed across estimates within any sample forsuch conditions was .170. This suggests that when working withsmall ratee sample sizes, researchers should be wary of the poten-tial for variability in Pearson r simply based on the way theystructure their data files for analysis. That is, as our mock exampleillustrated, the same underlying sample of data may produce dif-ferent estimates for interrater reliability based on Pearson r, par-ticularly in cases where the ratings are relatively unreliable (i.e.,when �T

2 is small relative to �R2 and �TR,e

2 ).

Research Question 2: How Do Interrater ReliabilityEstimators Compare in Terms of Bias?

To establish a frame of reference for examining issues of bias inPearson r, ICC(1,k), G(k), and G(q,k) interrater reliability estima-tors, Table 7 provides population reliabilities for a sampling ofsimulation conditions that capture the trend we observed in ourresults. The process for deriving the population reliabilities pre-sented in Table 7 was described earlier in Steps 9 and 10 of thesimulation process. By definition, population reliabilities for eachcondition are invariant across different ratee sample sizes. There-fore, results in Table 7 reflect population reliabilities for conditionsdefined by different Variance Component Profile � Total Numberof Raters Participating in the Simulated Study (Nr) combinations.Examination of Table 7 reveals a clear trend: As the overlapbetween the sets of raters that rated each ratee decreased (i.e., as

10 A Microsoft Excel file containing complete results for all conditionsdescribed in the Method section can be obtained from Dan J. Putka onrequest.

970 PUTKA, LE, MCCLOY, AND DIAZ

the total number of raters participating in the study [Nr] divergedfrom the number of raters per ratee [k � 2]), the populationreliability decreased and approached the ratio �T

2 / (�T2 � �R

2 / k ��TR,e

2 / k). Note that this ratio is actually the formula for ICC(1,k)presented by McGraw and Wong (1996) for nested designs (r:t).Conversely, as the overlap between the sets of raters that ratedeach ratee increased (i.e., as Nr approached k), the populationreliability increased and approached the ratio �T

2 / (�T2 � �TR,e

2 /k)—reflecting the diminished contribution of rater main effects toobserved score variance as designs effectively became more“crossed” (per Equation 2). These trends were most evident forconditions having large ratios of rater main effect variance to truescore variance (i.e., �R

2/�T2).

Tables 8 and 9 show relative bias in single-rater and two-raterreliability estimators for conditions in which ratee sample sizeswere 150. These tables reveal several findings of note. First, as theoverlap between the sets of raters that rated each ratee increased

(i.e., as Nr approached k), and the �R2/�T

2 ratio increased, bothPearson r and ICC(1,k) became notably more negatively biased,whereas G(q,k) exhibited little (�2%) or no bias regardless ofcondition. Nevertheless, for conditions where there was little over-lap between the sets of raters who rated each ratee (e.g., conditionsin which the total number of raters participating in the study [Nr]was greater than 30) all estimators appeared to be relatively free ofbias (�5%). This suggests that when one has designs where thereis little overlap between the sets of raters who rate each ratee, allof these estimators may produce results of similar quality—at leastin terms of bias. We view this finding as good news, particularlyfor past large-scale primary studies that have relied heavily onPearson r (e.g., Mount et al., 1998; Rothstein, 1990; Scullen et al.,2000).

For those conditions where the bias in Pearson r and ICC(1,k)was most problematic (i.e., small number of participating raters[Nr] and high �R

2/�T2 ratio), the bias appeared to be worse for

Table 6Within-Sample Variability of Pearson r-Based Estimates of Interrater Reliability

�T2/�R

2/�TR,e2

Nt � 50 Nt � 150 Nt � 300

Mrange Minrange Maxrange Mrange Minrange Maxrange Mrange Minrange Maxrange

Statistics for single-rater reliability estimates (Pearson r)

35/55/10 .073 .027 .196 .024 .010 .064 .012 .005 .03135/45/20 .072 .030 .197 .024 .010 .062 .012 .005 .03335/35/30 .072 .032 .190 .024 .011 .065 .012 .005 .03135/25/40 .072 .032 .182 .024 .010 .063 .012 .005 .03235/15/50 .072 .032 .187 .024 .010 .064 .012 .005 .03035/05/60 .072 .031 .183 .024 .011 .060 .012 .005 .02950/45/05 .063 .018 .176 .021 .007 .055 .010 .004 .02850/35/15 .063 .022 .179 .021 .009 .059 .011 .005 .02850/25/25 .064 .024 .170 .021 .010 .056 .011 .005 .02850/15/35 .063 .025 .164 .021 .009 .053 .011 .005 .02750/05/45 .063 .026 .161 .021 .010 .057 .011 .005 .02775/20/05 .039 .010 .112 .013 .004 .040 .007 .002 .02075/15/10 .039 .012 .121 .013 .005 .036 .007 .003 .01875/10/15 .040 .013 .111 .013 .006 .037 .007 .003 .01875/05/20 .040 .014 .112 .013 .006 .037 .007 .003 .016

Statistics for two-rater reliability estimates (Spearman–Brown prophesy of Pearson r)

35/55/10 .086 .020 .390 .028 .008 .099 .014 .004 .04635/45/20 .084 .023 .329 .028 .009 .092 .014 .005 .04835/35/30 .083 .026 .299 .028 .010 .090 .014 .005 .04335/25/40 .082 .024 .289 .027 .009 .078 .014 .005 .04035/15/50 .081 .027 .292 .027 .010 .079 .013 .005 .03735/05/60 .080 .024 .259 .026 .011 .074 .013 .006 .03550/45/05 .060 .011 .245 .019 .005 .070 .010 .003 .03450/35/15 .059 .014 .238 .019 .006 .070 .010 .003 .03150/25/25 .058 .016 .233 .019 .007 .059 .010 .004 .03050/15/35 .057 .017 .197 .019 .007 .052 .010 .004 .02750/05/45 .057 .018 .199 .019 .008 .054 .009 .004 .02575/20/05 .026 .006 .111 .009 .002 .033 .004 .001 .01775/15/10 .026 .006 .106 .009 .003 .030 .004 .002 .01475/10/15 .026 .007 .090 .009 .003 .028 .004 .002 .01375/05/20 .026 .008 .084 .009 .003 .025 .004 .002 .011

Note. �T2 � ratee main effect (true score) variance component; �R

2 � rater main effect variance component; �TR,e2 � ratee � rater interaction effect and

residual error variance component; Mrange � mean range in Pearson r-based reliability estimates across analysis files within samples (mean is acrosssamples, within simulation condition defined by the given variance component profile � ratee sample size [Nt] combination); Minrange � minimum rangein Pearson r-based reliability estimates across analysis files within samples (minimum is across samples, within condition); Maxrange � maximum rangein Pearson r-based reliability estimates across analysis files within samples (maximum is across samples, within condition).

971ILL-STRUCTURED MEASUREMENT DESIGNS

two-rater reliability estimates than for single-rater estimates. Ourrationale for this finding is that the bias resulting from failing toproperly scale the contribution of rater main effects to observedvariance is exacerbated when one begins to project single-raterestimates to k-rater estimates. Neither Spearman–Brown nor ICC-based methods for projecting single-rater estimates to k-rater esti-mates consider q in making that projection. As such, the downwardbias that results from failing to properly scale the contribution ofrater main effects appears to get magnified as one projects single-rater estimates to k-rater estimates by using these traditional meth-ods in the context of ISMDs.

A final noteworthy aspect of Tables 8 and 9 regards the results forG(1) and G(2) relative to the other estimators. Recall from the intro-duction that there are two expected sources of negative bias in Pearsonr and ICC(1,k) when they are used to estimate the reliability of ratingsarising from ISMDs. First, both estimators fail to explicitly model thecontribution of rater main effects to observed score variance (resultingin deflation of true score variance due to non-independence among

ratings for ratees that share raters). Second, both estimators fail toproperly scale the contribution of rater main effects to observed scorevariance using the q multiplier (per Equation 4). Examination ofresults for G(k) relative to the other estimators can help tease apart therelative contribution of these sources of bias.

The average relative bias of G(1) across conditions summarizedin Table 8 was –2.5%, whereas the average relative biases ofPearson r and ICC(1,1) were –5.5% and –.5.7%, respectively.Contrast these values with the average relative bias of G(q,1)across conditions (–0.4%), and it appears that a little more thanhalf of the negative bias in Pearson r and ICC(1,1) was attributableto their failure to model rater main effects (3.0 – 3.2 percentagepoints), and a little less than half of the bias in those estimators wasattributable to their failure to properly scale the contribution ofrater main effects using the q multiplier (2.1 percentage points).With regard to two-rater reliability estimators, the average relativebias of G(2) across conditions summarized in Table 9 was –3.5%,whereas the average relative bias of the Spearman–Brown proph-esy for two raters (based on Pearson r) and ICC(1,2) was –6.2%.Contrast these values with the average relative bias of G(q,2)across conditions (–0.4%), and it appears that a little less than halfof the negative bias in the Spearman–Brown prophesy andICC(1,2) was attributable to their failure to model rater maineffects (2.7 percentage points), and a little more than half of thebias in those estimators was attributable to their failure to properlyscale the contribution of rater main effects using the q multiplier(3.1 percentage points).

Research Question 3: How Do Interrater ReliabilityEstimators Compare in Terms of RMSE?

Although bias is certainly an important criterion to considerwhen evaluating the quality of a statistical estimator, so too is thestability of the estimates produced across samples. Tables 10 and11 provide insight into the stability of estimates produced bysingle-rater and two-rater reliability estimators for conditions inwhich ratee sample sizes were 150.

The values presented in Tables 10 and 11 reflect RMSE ratios aswell as RMSEs. RMSEs are presented for G(q,1) and G(q,2), andRMSE ratios are presented for all other estimators. Recall thatthese RMSE ratios reflect the RMSE of a given estimator over theRMSE of G(q,k) for a given condition, and thus they reflect theinaccuracy of that estimator relative to G(q,k). The results pre-sented in Tables 10 and 11 suggest that, regardless of conditionexamined, G(q,k) produced estimates that were as precise or moreprecise than those provided by the other estimators. G(q,1) pro-duced estimates that were on average 1.30 times closer to popu-lation reliabilities (SD � 0.25, range � 1.00–2.13) than Pearson rand ICC(1,1) single-rater reliability estimators for conditions ex-amined in Table 10. Similarly, G(q,2) produced estimates thatwere, on average, 1.63 times closer to population reliabilities(SD � 0.77, range � 1.00–4.50) than were Spearman–Brown andICC(1,2) two-rater reliability estimators for conditions examinedin Table 11. As with the bias statistics presented in Tables 8 and9, the advantage in precision gained from using G(q,k) increased

(text continues on pg. 977)

Table 7Single-Rater and Two-Rater Population Reliabilities

�T2/�R

2/�TR,e2

Total number of raters participating the study (Nr)

3 5 10 15 30 50 150 300

Single-rater population reliability35/55/10 .49 .43 .39 .38 .36 .36 .36 .3535/45/20 .45 .41 .38 .37 .36 .36 .35 .3535/35/30 .42 .39 .37 .36 .36 .36 .35 .3535/25/40 .39 .38 .36 .36 .35 .35 .36 .3535/15/50 .37 .36 .36 .35 .35 .35 .35 .3535/05/60 .36 .35 .36 .35 .35 .35 .35 .3550/45/05 .64 .59 .54 .54 .51 .51 .50 .5050/35/15 .60 .55 .53 .52 .51 .51 .50 .5050/25/25 .56 .53 .52 .52 .51 .50 .50 .5050/15/35 .53 .52 .51 .51 .50 .50 .50 .5050/05/45 .51 .51 .50 .50 .50 .50 .50 .5075/20/05 .82 .79 .78 .76 .75 .75 .75 .7575/15/10 .80 .77 .76 .76 .75 .75 .75 .7575/10/15 .78 .76 .76 .76 .75 .75 .75 .7575/05/20 .76 .76 .75 .75 .75 .75 .75 .75

Two-rater population reliability35/55/10 .73 .64 .58 .56 .54 .53 .52 .5235/45/20 .68 .61 .57 .54 .53 .53 .52 .5235/35/30 .63 .59 .55 .54 .53 .53 .52 .5235/25/40 .59 .56 .54 .53 .52 .52 .52 .5235/15/50 .56 .54 .53 .52 .52 .52 .52 .5235/05/60 .53 .53 .53 .52 .52 .52 .52 .5250/45/05 .84 .77 .72 .71 .68 .67 .67 .6750/35/15 .80 .74 .71 .69 .68 .67 .67 .6750/25/25 .75 .72 .69 .69 .67 .67 .67 .6750/15/35 .71 .70 .68 .68 .67 .67 .67 .6750/05/45 .68 .67 .67 .67 .67 .67 .67 .6775/20/05 .93 .90 .88 .87 .86 .86 .86 .8675/15/10 .91 .89 .87 .87 .86 .86 .86 .8675/10/15 .89 .88 .87 .86 .86 .86 .86 .8675/05/20 .87 .87 .86 .86 .86 .86 .86 .86

Note. Cell values reflect population reliabilities for simulation conditionsdefined by each Variance Component Profile � Total Number of RatersParticipating in the Study (Nr) combination. �T

2 � ratee main effect (truescore) variance component; �R

2 � rater main effect variance component;�TR,e

2 � ratee � rater interaction effect and residual error variance com-ponent.

972 PUTKA, LE, MCCLOY, AND DIAZ

Tab

le8

Rel

ativ

eB

ias

ofSi

ngle

-Rat

erR

elia

bili

tyE

stim

ator

s(N

t�

150)

�T2/�

r2

/�T

R,e

2

Tot

alnu

mbe

rof

rate

rspa

rtic

ipat

ing

inth

est

udy

(Nr)

Pear

son

rIC

C(1

,1)

G(1

)G

(q,1

)

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35/5

5/10

�38

.5�

25.6

�14

.9�

10.7

�6.

3�

4.3

�0.

9�

34.5

�25

.5�

15.1

�10

.9�

6.6

�4.

6�

1.2

�12

.4�

8.9

�5.

1�

3.6

�2.

5�

1.4

�1.

2�

0.2

�0.

2�

0.2

�0.

2�

0.8

�0.

3�

1.0

35/4

5/20

�32

.8�

22.0

�12

.3�

8.3

�5.

1�

2.5

�1.

0�

31.0

�22

.0�

12.5

�8.

5�

5.3

�2.

8�

1.3

�10

.6�

7.8

�4.

6�

2.8

�1.

8�

1.2

�0.

3�

0.4

�0.

6�

0.6

0.0

�0.

3�

0.4

�0.

235

/35/

30�

27.5

�18

.5�

10.5

�7.

2�

4.1

�3.

2�

0.8

�26

.5�

18.6

�10

.7�

7.5

�4.

3�

3.5

�1.

0�

8.9

�6.

6�

4.0

�2.

9�

2.5

�1.

6�

0.5

�0.

7�

0.8

�0.

8�

0.7

�1.

4�

0.9

�0.

435

/25/

40�

21.0

�13

.7�

7.7

�5.

0�

2.7

�1.

5�

0.2

�20

.9�

13.9

�8.

0�

5.3

�3.

1�

1.8

�0.

4�

7.4

�4.

9�

3.5

�2.

4�

1.0

�0.

9�

0.1

�1.

1�

0.7

�1.

2�

0.9

�0.

2�

0.4

0.0

35/1

5/50

�12

.3�

9.5

�4.

9�

3.2

�2.

7�

2.2

�0.

6�

12.5

�9.

8�

5.2

�3.

5�

3.0

�2.

4�

0.9

�4.

3�

4.2

�2.

6�

1.2

�1.

9�

1.6

�1.

0�

0.4

�1.

6�

1.2

�0.

2�

1.4

�1.

3�

0.9

35/0

5/60

�5.

5�

3.1

�2.

7�

1.8

�1.

1�

1.1

�0.

9�

5.8

�3.

4�

3.0

�2.

2�

1.4

�1.

4�

1.2

�2.

7�

1.5

�1.

9�

1.5

�1.

1�

1.2

�1.

1�

1.1

�0.

5�

1.4

�1.

2�

1.0

�1.

1�

1.1

50/4

5/05

�22

.7�

15.0

�8.

2�

5.8

�3.

4�

2.1

�0.

5�

22.5

�15

.1�

8.4

�6.

0�

3.6

�2.

4�

0.7

�10

.4�

7.4

�4.

1�

3.1

�1.

5�

1.2

�0.

4�

0.1

�0.

3�

0.1

�0.

4�

0.1

�0.

3�

0.2

50/3

5/15

�18

.1�

12.7

�7.

1�

4.8

�2.

2�

1.2

�0.

6�

18.0

�12

.9�

7.4

�5.

0�

2.4

�1.

4�

0.8

�8.

1�

5.9

�3.

5�

2.2

�1.

1�

0.7

�0.

50.

00.

0�

0.4

0.0

0.0

0.0

�0.

450

/25/

25�

14.9

�9.

6�

4.7

�3.

4�

2.3

�1.

0�

0.7

�15

.0�

9.8

�4.

9�

3.6

�2.

6�

1.3

�0.

8�

6.9

�4.

5�

2.4

�1.

6�

1.5

�1.

1�

0.6

�0.

6�

0.2

�0.

1�

0.1

�0.

7�

0.6

�0.

550

/15/

35�

9.5

�5.

7�

3.0

�2.

6�

1.5

�0.

6�

0.3

�9.

8�

5.9

�3.

3�

2.8

�1.

8�

0.9

�0.

6�

4.6

�3.

1�

1.5

�1.

4�

1.2

�0.

5�

0.4

�0.

3�

0.5

0.0

�0.

5�

0.7

�0.

2�

0.4

50/0

5/45

�3.

4�

2.1

�1.

2�

1.5

�0.

6�

0.4

�0.

1�

3.6

�2.

3�

1.4

�1.

8�

0.8

�0.

6�

0.4

�1.

8�

1.3

�0.

9�

1.3

�0.

7�

0.6

�0.

4�

0.4

�0.

4�

0.5

�1.

0�

0.6

�0.

5�

0.4

75/2

0/05

�7.

5�

5.0

�2.

5�

1.9

�1.

0�

0.5

�0.

1�

7.7

�5.

2�

2.7

�2.

1�

1.2

�0.

7�

0.3

�5.

2�

3.7

�2.

0�

1.5

�0.

8�

0.6

0.0

0.0

�0.

2�

0.2

�0.

2�

0.1

�0.

20.

075

/15/

10�

5.8

�3.

9�

2.0

�1.

3�

0.6

�0.

5�

0.1

�5.

9�

4.1

�2.

1�

1.5

�0.

8�

0.7

�0.

3�

4.1

�2.

9�

1.6

�1.

0�

0.6

�0.

4�

0.2

�0.

1�

0.1

�0.

10.

0�

0.1

�0.

1�

0.1

75/1

0/15

�4.

1�

2.7

�1.

3�

0.9

�0.

6�

0.3

0.1

�4.

3�

2.9

�1.

5�

1.1

�0.

7�

0.4

�0.

1�

3.1

�2.

0�

1.2

�0.

9�

0.5

�0.

30.

0�

0.3

�0.

1�

0.2

�0.

2�

0.2

�0.

10.

075

/05/

20�

2.3

�1.

3�

0.7

�0.

4�

0.2

0.0

0.0

�2.

5�

1.5

�0.

9�

0.6

�0.

4�

0.2

�0.

2�

1.8

�1.

1�

0.6

�0.

4�

0.3

�0.

2�

0.1

�0.

2�

0.1

�0.

2�

0.1

�0.

1�

0.1

�0.

1

Not

e.C

ell

valu

esre

flec

tth

epe

rcen

tage

byw

hich

the

give

nes

timat

orun

der-

orov

eres

timat

edth

esi

ngle

-rat

erpo

pula

tion

relia

bilit

yfo

rth

esi

mul

atio

nco

nditi

onde

fine

dby

the

give

nV

aria

nce

Com

pone

ntPr

ofile

�T

otal

Num

ber

ofR

ater

sPa

rtic

ipat

ing

inth

eSt

udy

(Nr)

com

bina

tion.

Perc

enta

ges

for

thos

ees

timat

ors

that

prod

uced

mor

eth

ana

5%un

dere

stim

ate

ofth

epo

pula

tion

relia

bilit

yfo

rth

egi

ven

sim

ulat

ion

cond

ition

appe

arin

bold

face

type

.�T2

�ra

tee

mai

nef

fect

(tru

esc

ore)

vari

ance

com

pone

nt;�

R2�

rate

rm

ain

effe

ctva

rian

ceco

mpo

nent

;�T

R,e

2�

Rat

ee�

Rat

erin

tera

ctio

nef

fect

and

resi

dual

erro

rva

rian

ceco

mpo

nent

.

973ILL-STRUCTURED MEASUREMENT DESIGNS

Tab

le9

Rel

ativ

eB

ias

of2-

Rat

erR

elia

bili

tyE

stim

ator

s(N

t�

150)

�T2

/�r2

/�T

R,e

2

Tot

alnu

mbe

rof

rate

rspa

rtic

ipat

ing

inth

est

udy

(Nr)

Spea

rman

–Bro

wn

prop

hecy

ofPe

arso

nr

ICC

(1,2

)G

(2)

G(q

,2)

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35/5

5/10

�47

.7�

28.4

�15

.9�

11.4

�6.

7�

4.8

�1.

3�

41.0

�28

.0�

16.0

�11

.5�

6.9

�5.

0�

1.5

�20

.7�

13.5

�7.

5�

5.2

�3.

2�

1.9

�1.

1�

0.2

0.0

�0.

2�

0.2

�0.

6�

0.4

�0.

835

/45/

20�

38.7

�24

.2�

13.1

�8.

8�

5.5

�2.

8�

1.5

�35

.8�

24.1

�13

.3�

8.9

�5.

7�

3.0

�1.

7�

17.4

�11

.7�

6.5

�4.

0�

2.4

�1.

4�

0.6

�0.

3�

0.5

�0.

50.

2�

0.2

�0.

1�

0.4

35/3

5/30

�31

.7�

20.0

�11

.2�

7.8

�4.

5�

3.6

�1.

2�

29.9

�20

.1�

11.4

�7.

9�

4.7

�3.

8�

1.4

�14

.3�

9.8

�5.

6�

3.9

�2.

9�

1.9

�0.

8�

0.7

�0.

8�

0.9

�0.

6�

1.2

�0.

9�

0.6

35/2

5/40

�23

.3�

14.7

�8.

2�

5.6

�3.

1�

1.8

�0.

5�

22.9

�14

.9�

8.4

�5.

8�

3.3

�2.

1�

0.7

�11

.2�

7.2

�4.

6�

3.2

�1.

5�

1.2

�0.

4�

0.9

�0.

7�

1.2

�0.

9�

0.2

�0.

4�

0.3

35/1

5/50

�13

.5�

10.2

�5.

4�

3.6

�3.

1�

2.5

�1.

0�

13.6

�10

.4�

5.6

�3.

8�

3.3

�2.

7�

1.2

�6.

6�

5.8

�3.

3�

1.9

�2.

3�

2.0

�1.

2�

0.4

�1.

8�

1.3

�0.

4�

1.6

�1.

5�

1.1

35/0

5/60

�6.

0�

3.5

�3.

2�

2.2

�1.

5�

1.4

�1.

3�

6.2

�3.

7�

3.4

�2.

5�

1.7

�1.

6�

1.5

�3.

7�

2.2

�2.

4�

1.9

�1.

5�

1.5

�1.

4�

1.3

�0.

8�

1.7

�1.

4�

1.2

�1.

3�

1.4

50/4

5/05

�26

.9�

16.6

�8.

6�

6.1

�3.

6�

2.3

�0.

6�

26.3

�16

.6�

8.8

�6.

2�

3.7

�2.

5�

0.8

�15

.6�

10.2

�5.

5�

4.0

�2.

1�

1.4

�0.

40.

0�

0.1

0.0

�0.

3�

0.1

�0.

2�

0.2

50/3

5/15

�21

.0�

13.6

�7.

5�

5.0

�2.

3�

1.3

�0.

7�

20.7

�13

.7�

7.7

�5.

2�

2.5

�1.

4�

0.8

�12

.2�

8.3

�4.

7�

3.0

�1.

4�

0.8

�0.

60.

00.

0�

0.4

0.0

0.1

0.1

�0.

450

/25/

25�

16.6

�10

.2�

5.0

�3.

6�

2.5

�1.

2�

0.8

�16

.6�

10.4

�5.

1�

3.7

�2.

6�

1.4

�0.

9�

10.0

�6.

4�

3.2

�2.

2�

1.7

�1.

1�

0.7

�0.

5�

0.3

�0.

1�

0.1

�0.

6�

0.5

�0.

650

/15/

35�

10.3

�6.

0�

3.3

�2.

8�

1.7

�0.

7�

0.4

�10

.5�

6.2

�3.

4�

2.9

�1.

8�

0.9

�0.

6�

6.5

�4.

1�

2.1

�1.

9�

1.4

�0.

6�

0.5

�0.

4�

0.4

�0.

2�

0.6

�0.

7�

0.2

�0.

450

/05/

45�

3.6

�2.

2�

1.3

�1.

7�

0.8

�0.

6�

0.3

�3.

8�

2.4

�1.

5�

1.8

�0.

9�

0.7

�0.

5�

2.5

�1.

7�

1.1

�1.

5�

0.9

�0.

7�

0.5

�0.

4�

0.4

�0.

5�

1.1

�0.

6�

0.5

�0.

575

/20/

05�

8.3

�5.

3�

2.6

�2.

0�

1.1

�0.

6�

0.1

�8.

3�

5.3

�2.

7�

2.1

�1.

2�

0.7

�0.

2�

6.5

�4.

4�

2.3

�1.

7�

0.9

�0.

6�

0.1

0.0

�0.

1�

0.1

�0.

2�

0.1

�0.

10.

075

/15/

10�

6.2

�4.

1�

2.0

�1.

4�

0.7

�0.

5�

0.2

�6.

3�

4.2

�2.

1�

1.5

�0.

8�

0.6

�0.

3�

5.0

�3.

4�

1.8

�1.

2�

0.7

�0.

5�

0.2

0.0

0.0

�0.

10.

0�

0.1

�0.

1�

0.1

75/1

0/15

�4.

4�

2.8

�1.

3�

1.0

�0.

6�

0.3

0.1

�4.

5�

2.9

�1.

4�

1.1

�0.

7�

0.4

0.0

�3.

7�

2.3

�1.

3�

0.9

�0.

5�

0.3

0.0

�0.

2�

0.1

�0.

1�

0.2

�0.

2�

0.1

0.1

75/0

5/20

�2.

4�

1.3

�0.

7�

0.4

�0.

20.

00.

0�

2.5

�1.

5�

0.8

�0.

5�

0.3

�0.

1�

0.1

�2.

1�

1.2

�0.

7�

0.4

�0.

2�

0.1

�0.

1�

0.1

�0.

1�

0.1

�0.

1�

0.1

0.0

�0.

1

Not

e.C

ellv

alue

sre

flec

tthe

perc

enta

geby

whi

chth

egi

ven

estim

ator

unde

r-or

over

estim

ated

the

two-

rate

rpo

pula

tion

relia

bilit

yfo

rth

esi

mul

atio

nco

nditi

onde

fine

dby

the

give

nV

aria

nce

Com

pone

ntPr

ofile

�T

otal

Num

ber

ofR

ater

sPa

rtic

ipat

ing

inth

eSt

udy

(Nr)

com

bina

tion.

Perc

enta

ges

for

thos

ees

timat

ors

that

prod

uced

mor

eth

ana

5%un

dere

stim

ate

ofth

epo

pula

tion

relia

bilit

yfo

rth

egi

ven

sim

ulat

ion

cond

ition

appe

arin

bold

face

type

.�T2

�ra

tee

mai

nef

fect

(tru

esc

ore)

vari

ance

com

pone

nt;�

R2�

rate

rm

ain

effe

ctva

rian

ceco

mpo

nent

;�T

R,e

2�

rate

e�

rate

rin

tera

ctio

nef

fect

and

resi

dual

erro

rva

rian

ceco

mpo

nent

;IC

C�

intr

acla

ssco

rrel

atio

n.

974 PUTKA, LE, MCCLOY, AND DIAZ

Tab

le10

Rel

ativ

eP

reci

sion

ofSi

ngle

-Rat

erR

elia

bili

tyE

stim

ator

sC

ompa

red

toG

(q,1

)(N

t�

150)

�T2

/�r2

/�T

R,e

2

Tot

alnu

mbe

rof

rate

rspa

rtic

ipat

ing

inth

est

udy

(Nr)

Pear

son

rIC

C(1

,1)

G(1

)G

(q,1

)

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35/5

5/10

1.99

1.61

1.46

1.43

1.53

1.52

1.38

1.78

1.59

1.45

1.43

1.52

1.52

1.38

1.14

1.06

1.02

1.01

1.01

1.00

1.00

.171

.132

.096

.080

.061

.057

.053

35/4

5/20

2.11

1.64

1.43

1.40

1.47

1.45

1.23

1.98

1.63

1.43

1.40

1.48

1.45

1.22

1.18

1.08

1.03

1.01

1.00

1.00

1.00

.125

.109

.083

.071

.059

.052

.060

35/3

5/30

2.13

1.70

1.44

1.40

1.34

1.36

1.13

2.02

1.69

1.44

1.40

1.34

1.36

1.13

1.18

1.09

1.03

1.02

1.01

1.00

1.00

.100

.087

.072

.067

.059

.054

.065

35/2

5/40

1.98

1.57

1.32

1.30

1.25

1.19

1.05

1.92

1.57

1.32

1.29

1.25

1.19

1.04

1.16

1.07

1.02

1.01

1.00

1.00

1.00

.080

.075

.067

.067

.061

.063

.071

35/1

5/50

1.58

1.31

1.19

1.17

1.12

1.08

1.02

1.57

1.31

1.19

1.17

1.12

1.08

1.01

1.09

1.04

1.01

1.00

1.00

1.00

1.00

.071

.072

.066

.064

.068

.069

.072

35/0

5/60

1.13

1.07

1.04

1.03

1.03

1.02

1.01

1.13

1.07

1.03

1.03

1.03

1.02

1.00

1.01

1.00

1.00

1.00

1.00

1.00

1.00

.070

.071

.073

.070

.071

.070

.071

50/4

5/05

1.79

1.44

1.27

1.24

1.32

1.32

1.32

1.75

1.44

1.27

1.24

1.32

1.32

1.31

1.20

1.09

1.04

1.03

1.01

1.01

1.00

.169

.146

.107

.093

.066

.059

.048

50/3

5/15

1.84

1.50

1.33

1.27

1.27

1.30

1.14

1.81

1.50

1.33

1.27

1.27

1.30

1.13

1.22

1.12

1.05

1.03

1.01

1.00

1.00

.128

.107

.088

.075

.059

.056

.053

50/2

5/25

1.87

1.52

1.28

1.28

1.24

1.19

1.07

1.86

1.52

1.28

1.28

1.24

1.19

1.07

1.25

1.13

1.05

1.03

1.01

1.01

1.00

.096

.081

.071

.064

.058

.055

.060

50/1

5/35

1.68

1.35

1.21

1.17

1.11

1.09

1.04

1.68

1.36

1.21

1.17

1.11

1.09

1.03

1.20

1.09

1.03

1.02

1.01

1.00

1.00

.071

.065

.062

.059

.058

.057

.060

50/0

5/45

1.20

1.09

1.06

1.05

1.02

1.02

1.00

1.20

1.09

1.05

1.05

1.02

1.02

1.00

1.05

1.02

1.01

1.00

1.00

1.00

1.00

.062

.059

.062

.062

.063

.062

.059

75/2

0/05

1.66

1.36

1.18

1.17

1.16

1.16

1.08

1.67

1.36

1.18

1.18

1.17

1.17

1.08

1.34

1.19

1.09

1.06

1.03

1.02

1.00

.098

.087

.068

.059

.046

.041

.036

75/1

5/10

1.65

1.36

1.20

1.16

1.11

1.13

1.06

1.66

1.37

1.21

1.17

1.11

1.13

1.06

1.36

1.19

1.08

1.05

1.02

1.01

1.00

.076

.069

.055

.048

.043

.038

.036

75/1

0/15

1.59

1.35

1.15

1.11

1.08

1.06

1.03

1.60

1.35

1.16

1.11

1.10

1.06

1.02

1.35

1.18

1.07

1.04

1.02

1.01

1.00

.058

.053

.044

.041

.038

.037

.036

75/0

5/20

1.42

1.17

1.08

1.07

1.02

1.03

1.00

1.43

1.18

1.09

1.08

1.03

1.03

1.00

1.25

1.10

1.04

1.02

1.01

1.00

1.00

.041

.041

.038

.036

.037

.035

.035

Not

e.C

ell

valu

esfo

rG

(q,1

)ar

ero

otm

ean

squa

reer

rors

(RM

SEs)

ofG

(q,1

)fo

rth

esi

mul

atio

nco

nditi

onde

fine

dby

the

give

nV

aria

nce

Com

pone

ntPr

ofile

�T

otal

Num

ber

ofR

ater

sPa

rtic

ipat

ing

inth

eSt

udy

(Nr)

com

bina

tion.

Cel

lva

lues

for

all

othe

res

timat

ors

are

RM

SEra

tios

(RM

SEfo

rgi

ven

estim

ator

/R

MSE

for

G[q

,1])

for

the

give

nsi

mul

atio

nco

nditi

on.

RM

SEra

tios

grea

ter

than

1.0

indi

cate

the

give

nes

timat

orw

asle

sspr

ecis

eth

anG

(q,1

).�

T2�

rate

em

ain

effe

ct(t

rue

scor

e)va

rian

ceco

mpo

nent

;�

R2�

rate

rm

ain

effe

ctva

rian

ceco

mpo

nent

;�

TR

,e2

�ra

tee

�ra

ter

inte

ract

ion

effe

ctan

dre

sidu

aler

ror

vari

ance

com

pone

nt;

ICC

�in

trac

lass

corr

elat

ion.

975ILL-STRUCTURED MEASUREMENT DESIGNS

Tab

le11

Rel

ativ

eP

reci

sion

ofT

wo-

Rat

erR

elia

bili

tyE

stim

ator

sC

ompa

red

toG

(q,2

)(N

t�

150)

�T2

/�r2

/�T

R,e

2

Tot

alnu

mbe

rof

rate

rspa

rtic

ipat

ing

inth

est

udy

(Nr)

Spea

rman

–Bro

wn

Prop

hecy

ofPe

arso

nr

ICC

(1,2

)G

(2)

G(q

,2)

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35

1015

3050

300

35/5

5/10

4.50

2.42

1.81

1.70

1.67

1.65

1.41

3.62

2.35

1.81

1.70

1.67

1.65

1.40

2.04

1.39

1.15

1.09

1.04

1.02

1.00

.116

.117

.095

.082

.066

.061

.059

35/4

5/20

4.50

2.41

1.75

1.59

1.60

1.50

1.26

3.97

2.36

1.75

1.59

1.60

1.50

1.26

2.12

1.40

1.15

1.08

1.03

1.01

1.00

.088

.100

.085

.074

.063

.056

.067

35/3

5/30

4.09

2.38

1.72

1.57

1.43

1.42

1.14

3.68

2.37

1.72

1.57

1.43

1.42

1.14

1.95

1.39

1.14

1.07

1.03

1.01

1.00

.078

.082

.075

.072

.065

.060

.072

35/2

5/40

3.16

2.03

1.49

1.40

1.31

1.22

1.06

3.00

2.03

1.49

1.40

1.31

1.22

1.05

1.66

1.27

1.09

1.04

1.01

1.00

1.00

.071

.075

.071

.072

.066

.070

.080

35/1

5/50

2.06

1.50

1.27

1.23

1.15

1.10

1.02

2.05

1.51

1.27

1.23

1.15

1.10

1.02

1.29

1.13

1.04

1.02

1.01

1.00

1.00

.071

.077

.073

.071

.077

.077

.081

35/0

5/60

1.21

1.11

1.05

1.04

1.04

1.02

1.01

1.21

1.11

1.05

1.04

1.03

1.02

1.00

1.04

1.01

1.00

1.00

1.00

1.00

1.00

.078

.079

.082

.078

.079

.078

.080

50/4

5/05

4.06

2.17

1.56

1.43

1.44

1.39

1.34

3.87

2.15

1.56

1.43

1.45

1.39

1.33

2.36

1.48

1.19

1.12

1.06

1.03

1.00

.088

.104

.086

.077

.057

.052

.043

50/3

5/15

3.95

2.21

1.62

1.44

1.35

1.35

1.15

3.76

2.22

1.62

1.45

1.35

1.35

1.15

2.28

1.54

1.20

1.11

1.04

1.02

1.00

.072

.077

.073

.064

.052

.049

.048

50/2

5/25

3.62

2.13

1.49

1.42

1.31

1.22

1.08

3.56

2.14

1.49

1.42

1.31

1.22

1.08

2.21

1.49

1.16

1.09

1.04

1.02

1.00

.058

.061

.059

.055

.052

.049

.054

50/1

5/35

2.55

1.68

1.32

1.25

1.14

1.11

1.04

2.56

1.69

1.33

1.25

1.15

1.11

1.03

1.72

1.30

1.09

1.06

1.02

1.01

1.00

.050

.053

.053

.052

.052

.051

.054

50/0

5/45

1.34

1.14

1.08

1.07

1.02

1.03

1.00

1.34

1.15

1.08

1.08

1.02

1.03

1.00

1.13

1.05

1.02

1.01

1.01

1.00

1.00

.053

.052

.056

.056

.057

.056

.053

75/2

0/05

3.71

1.99

1.41

1.33

1.24

1.20

1.08

3.73

2.01

1.42

1.34

1.25

1.21

1.09

2.85

1.70

1.28

1.18

1.08

1.05

1.00

.037

.045

.040

.037

.029

.026

.024

75/1

5/10

3.49

1.96

1.41

1.29

1.16

1.16

1.06

3.51

1.98

1.42

1.30

1.17

1.17

1.07

2.79

1.68

1.26

1.15

1.06

1.04

1.00

.029

.037

.033

.030

.028

.025

.024

75/1

0/15

2.93

1.81

1.30

1.19

1.12

1.07

1.03

2.95

1.83

1.31

1.21

1.14

1.08

1.03

2.44

1.57

1.21

1.12

1.05

1.02

1.00

.025

.029

.027

.026

.025

.024

.024

75/0

5/20

2.02

1.36

1.14

1.11

1.03

1.04

1.00

2.04

1.38

1.16

1.12

1.04

1.04

1.00

1.76

1.26

1.09

1.05

1.02

1.01

1.00

.021

.025

.024

.023

.024

.023

.023

Not

e.C

ell

valu

esfo

rG

(q,2

)ar

ero

otm

ean

squa

reer

rors

(RM

SEs)

ofG

(q,2

)fo

rth

esi

mul

atio

nco

nditi

onde

fine

dby

the

give

nV

aria

nce

Com

pone

ntPr

ofile

�T

otal

Num

ber

ofR

ater

sPa

rtic

ipat

ing

inth

eSt

udy

(Nr)

com

bina

tion.

Cel

lva

lues

for

all

othe

res

timat

ors

are

RM

SEra

tios

(RM

SEfo

rgi

ven

estim

ator

/RM

SEfo

rG

[q,2

])fo

rth

egi

ven

sim

ulat

ion

cond

ition

.R

MSE

ratio

sgr

eate

rth

an1.

0in

dica

teth

egi

ven

estim

ator

was

less

prec

ise

than

G(q

,2).

�T2

�ra

tee

mai

nef

fect

(tru

esc

ore)

vari

ance

com

pone

nt;

�R2

�ra

ter

mai

nef

fect

vari

ance

com

pone

nt;

�T

R,e

2�

rate

e�

rate

rin

tera

ctio

nef

fect

and

resi

dual

erro

rva

rian

ceco

mpo

nent

;IC

C�

intr

acla

ssco

rrel

atio

n.

976 PUTKA, LE, MCCLOY, AND DIAZ

as the overlap between the sets of raters who rated each rateeincreased (i.e., as Nr approached k) and the �R

2/�T2 ratio increased.

Additionally, the precision advantage gained from using G(q,k) asopposed to Pearson r or ICC(1,k) appeared to be greater fortwo-rater estimates than for single-rater estimates.

Unlike results presented for the bias of these estimators, we didobserve some differences in RMSE results for conditions in whichratee samples sizes of 50 or 300 were used. The primary differencefor those conditions was that the RMSEs for each estimator tendedto be somewhat larger for conditions in which ratee samples sizeswere 50 and somewhat smaller when ratee sample sizes were 300.Nevertheless, the overall pattern of results with regard to RMSEratios remained quite similar.

Research Question 4: How Well Can Rater Main EffectVariance Be Estimated in ISMDs?

Table 12 shows bias and sampling variability statistics for ratermain effect variance component (�R

2) estimates for conditions inwhich ratee samples sizes were 150. The pattern of findings showsthat REML estimation produced unbiased estimates of rater maineffect variance components regardless of condition. The samplingvariability of these estimates (i.e., the standard deviation of �R

2

estimates across samples within a condition) appeared to be largelya function of the (a) total number of raters participating in thestudy (Nr) and (b) magnitude of the rater main effect variancecomponent (�R

2). Sampling variability decreased as the total num-ber of participating raters increased and as the magnitude of therater main effect variance component decreased. Although consis-tent with the literature that has examined the sampling variabilityof variance component estimators, this finding seems a bit coun-terintuitive in light of conventional wisdom in the organizationalresearch literature. Specifically, despite the fact that increasing thetotal number of raters participating in the study caused the Ratee �Rater data matrix to become more sparsely populated (given thenumber of raters per ratee [k] was being held constant at 2), wewere able to achieve more precise estimates of variance attribut-able to rater main effects. These findings suggest that the conven-tional wisdom regarding the inability to generate unique estimatesof �R

2 for ratings arising from ISMDs should be revisited (Schmidtet al., 2000).

Finally, it is worth noting a few connections between resultspresented in Table 12 and results regarding the precision of thereliability estimators presented earlier. Despite the fact that thesampling variability of the rater main effect variance component(�R

2) was high in several conditions (e.g., particularly when thetotal number of raters participating in the study was small) usingsuch estimates to calculate G(q,k) in those conditions still pro-duced estimates of interrater reliability that were more precise thanthose produced by Pearson r and ICC(1,k). For example, for thecondition based on a �T

2/�R2/�TR,e

2 variance component profile of50/35/15 and Nr � 10, the sampling variability of �R

2 was .173 (seeTable 12). Despite this high sampling variability, Table 11 showsthat G(q,2) provided estimates of interrater reliability in this con-dition that were an average of 1.62 times closer to the two-raterpopulation reliability than were estimates provided by both theSpearman–Brown reliability estimator (based on Pearson r) andICC(1,2). Specifically, the RMSE of G(q,2) in this condition was.073, whereas for the Spearman–Brown estimator and ICC(1,2) itwas .118.

Discussion

This article explores the implications of ISMDs for estimationof interrater reliability and evaluates the performance of traditionalinterrater reliability estimators relative to a new, alternative esti-mator—G(q,k). In doing so, we believe this article makes severalcontributions to the extant organizational research literature, in thatit (a) provides the first published treatment of the implications ofISMDs for traditional estimators of interrater reliability; (b) intro-duces and evaluates a new interrater reliability estimator that canbe used regardless of whether one’s design is crossed, nested, orill-structured; and (c) illustrates the feasibility of uniquely estimat-

Table 12Bias and Sampling Variability of Rater Main Effect VarianceComponent Estimates (Nt � 150)

�T2/�r

2

/�TR,e2

Total number of raters participating in the study (Nr)

3 5 10 15 30 50 150 300

Bias in estimate of �R2

35/55/10 .012 �.002 �.009 �.009 .010 �.008 �.002 .00435/45/20 .007 .006 .002 .003 �.004 �.004 .002 �.00235/35/30 .000 .003 .003 .001 �.002 �.004 .002 .00135/25/40 .001 �.001 .001 �.005 .001 �.001 �.002 .00035/15/50 �.005 .000 .000 .000 .001 .000 �.003 �.00135/05/60 .000 .002 .001 .000 .000 .000 .007 .01250/45/05 �.009 �.003 �.004 �.010 �.001 .002 .002 �.00250/35/15 �.010 .004 .000 .002 .000 �.002 .000 .00050/25/25 .004 �.002 .000 �.006 .001 .002 �.004 �.00250/15/35 .004 �.003 .002 .000 .001 .000 .001 �.00250/05/45 �.001 .000 .000 .001 .001 .001 .004 .00375/20/05 �.009 .001 �.008 �.001 .002 .000 .001 .00075/15/10 �.004 .008 .001 .003 .001 .004 �.001 .00175/10/15 .000 .004 .000 .001 .001 .001 .000 �.00275/05/20 .002 .001 �.001 �.001 .000 .001 .001 .002

Sampling variability in estimate of �R2

35/55/10 .549 .385 .254 .201 .152 .123 .085 .08035/45/20 .457 .331 .224 .177 .125 .102 .082 .08735/35/30 .358 .253 .176 .142 .103 .088 .078 .08435/25/40 .253 .180 .127 .100 .086 .073 .071 .08435/15/50 .159 .116 .083 .073 .060 .058 .064 .08035/05/60 .057 .046 .036 .033 .035 .036 .049 .06150/45/05 .436 .332 .213 .178 .117 .097 .069 .06450/35/15 .366 .238 .173 .147 .099 .082 .065 .07150/25/25 .258 .172 .124 .101 .079 .068 .063 .07150/15/35 .152 .109 .079 .067 .056 .052 .057 .06750/05/45 .055 .043 .035 .033 .030 .034 .045 .05575/20/05 .195 .141 .095 .078 .055 .045 .036 .03875/15/10 .149 .112 .075 .061 .045 .039 .036 .04075/10/15 .107 .075 .051 .044 .035 .033 .035 .04075/05/20 .054 .039 .029 .026 .024 .024 .030 .036

Note. Cell values reflect bias and sampling variability of estimates of �R2

(i.e., the standard deviation of �R2 estimates across samples) for the simu-

lation condition defined by the given variance component profile � totalnumber of raters participating in the study (Nr) combination. �T

2 � rateemain effect (true score) variance component; �R

2 � rater main effectvariance component; �TR,e

2 � ratee � rater interaction effect and residualerror variance component.

977ILL-STRUCTURED MEASUREMENT DESIGNS

ing variance attributable to rater main effects in ISMDs by capi-talizing on advances in random effects modeling. The results ofthis study provide several key insights and have implications forexisting and future research.

First, the seriousness of the first issue we raised regarding thewithin-sample variability of Pearson r-based estimates of interraterreliability appears to be largely dependent on ratee sample size.With sufficiently large samples of ratees, it appears that arbitraryassignment of raters to columns within ratees to permit calculationof Pearson r will result in little within-sample variability in esti-mates. Nevertheless, when researchers are dealing with fairlysmall sample sizes (e.g., 50 ratees or fewer) and ratings that appearto be fairly unreliable, they should be wary of the potential forwithin-sample variability in Pearson r as a function of how theyconstruct their analysis file. Of course, researchers can avoidwithin-sample variability altogether by using any of the otherapproaches to interrater reliability estimation discussed in thisarticle when faced with ISMDs. ICC(1,k) is invariant to how ratersare “labeled” within ratees, and G(q,k) explicitly models the ef-fects associated with unique individual raters—assuming that rat-ers are labeled in a manner that uniquely identifies them.

With regard to the bias and precision of Pearson r and ICC-based estimators, we found these estimators to be negativelybiased under several ISMD conditions. The negative bias tended toincrease with increases in the (a) overlap between the sets of ratersthat rated each ratee, and (b) ratio of rater main effect variance totrue score variance (i.e,. �R

2/�T2). In contrast, G(q,k) exhibited little

(�2.0%) or no bias regardless of ISMD conditions. Despite thesepositive findings for G(q,k), all reliability estimators examinedappeared to be relatively free of bias when there was little overlapbetween the sets of raters who rated each ratee (e.g., when Nr 30and k � 2). Such a finding might lead one to conclude that G(q,k)offers little advantage over more traditional estimators under suchconditions. We believe, however, that results presented in Table12, as well as results from Tables 10 and 11, contradict such aconclusion.

One side benefit of calculating G(q,k) compared with Pearson rand ICC(1,k) is that it allows researchers to distinguish betweentwo sources of error: (a) rater main effects (e.g., leniency/severitydifferences) and (b) Ratee � Rater interaction effects and residualerror. Differentiating between these effects is of theoretical value(Hoyt, 2000). Second, even though there was little difference in therelative bias of estimators for conditions where there was littleoverlap between the sets of raters who rated each ratee, G(q,k) stilloffered an advantage over Pearson r and ICC(1,k) in terms ofprecision in these conditions—particularly for estimates of two-rater reliability (see Table 11). These findings, combined with theease with which G(q,k) and its components can be calculated byusing the program we have provided as supplemental online ma-terial, make G(q,k) an attractive option relative to traditionalmethods for estimating interrater reliability of ratings that arisefrom ISMDs.

In addition to having implications for studies that allow for localestimation of interrater reliability, the results also have implica-tions for researchers attempting to apply meta-analytic estimates ofinterrater reliability to data collected locally. Researchers haveoften acknowledged that one frequently encounters designs wherethere is but one rater per ratee (Murphy & DeShon, 2000; Schmidtet al., 2000). In this case, common practice is to apply meta-

analytic estimates from previous work to estimate reliability of theratings (e.g., Conway et al., 1995; Viswesvaran et al., 1996). Nometa-analysis to date, however, has taken into account the impli-cations ISMDs have for interrater reliability estimation. This, inturn, has implications for how accurate a particular meta-analyticestimate will be for the local study.

For example, consider a situation in which the meta-analyticestimate of single-rater reliability is .40 and we have three localstudies, each with 50 raters and only one rating per ratee. In StudyA, 2 raters each rate 25 ratees. In Study B, 5 raters each rate 10ratees. In Study C, 25 raters each rate 2 ratees. Researchers in eachof the three studies use the meta-analytic estimate of .40 toestimate single-rater reliability for ratings gathered in their study.Assume that the same variance components underlie ratings gath-ered in all three studies: �T

2 � .40, �R2 � .30, and �TR,e

2 � .30. Theexpected single-rater reliability based on G(q,1) would be .47 inStudy A (q � .50), .43 in Study B (q � .80), and .40 in Study C(q � .96; based on Equation 4). By applying the meta-analyticestimate of .40 to each study, interrater reliability would be un-derestimated by 15.0% in Study A, 6.0% in Study B, and wouldprovide a relatively unbiased estimate in Study C (without round-ing, the underestimate is 1.2%). If meta-analytic estimates ofvariance components underlying G(q,k) currently existed in theliterature (e.g., Hoyt & Kerns, 1999), then a meta-analytic estimateof G(q,1) could be calculated for the three studies above bycalculating the q multiplier for the local study and using it incombination with the meta-analytic variance component estimatesto solve Equation 4. The point of this example is that researchersneed to be cognizant that current meta-analytic estimates of inter-rater reliability are averages across studies where the q multiplierlikely varied, and, as such, these may not provide an accurateestimate of reliability for the ratings arising from the type of designobserved in their local study. As noted above, however, meta-analytically derived variance components can potentially be usefulfor estimating interrater reliability using G(q,k) if locally availabledata do not support their estimation.

The simulation results have other practical implications. Forexample, when assessment centers or interviews are the basis forselection or promotion decisions, banding is sometimes used togroup applicants into categories based on their performance on thegiven assessment (Campion et al., 2001). Under one common formof banding, the width of the bands is a function of the standarderror of the difference (SED) between assessment scores, which inturn is a function of the reliability of the assessment (Cascio,Outtz, Zedeck, & Goldstein, 1991). To the extent that ISMDsunderlie ratings gathered on such assessments, and standard prac-tices are followed for estimating the reliability of ratings for theassessment, SED-based score bands will tend to be too wide, asreliability will tend to be underestimated. Our results indicate thiswill be particularly true for small-scale assessment programswhere the total number of participating raters can be small. Makinga band wider than it should be would be detrimental for theorganization hiring individuals by using a top down process be-cause it increases the chance that individuals who are substantivelydifferent on the assessment of interest would be treated similarly.

Finally, another important outcome of this study is that itcounters claims in past research suggesting that the sparsely pop-ulated nature of ISMDs eliminates researchers’ ability to uniquelydistinguish the contribution of rater main effects and residual

978 PUTKA, LE, MCCLOY, AND DIAZ

variance to observed score variance (Schmidt et al., 2000). Thesimulation results suggest that it is quite possible. Regardless ofISMD condition examined, REML variance component estimateswere unbiased, and the precision with which they were estimatedincreased as the total number of raters participating in the simu-lated study increased. Aside from the benefit this offers for calcu-lating G(q,k), differentiating these effects from other sources oferror is important theoretically in that it will allow researchers todetermine whether the error in their ratings is arising more fromdifferences in rater leniency/severity (rater main effect variance) orfrom a combination of raters’ idiosyncratic perceptions of a rateeand random response error (Ratee � Rater interaction and residualvariance; Hoyt & Kerns, 1999).

Such findings suggest that it might be worth revisiting large datasets and attempting to fit the types of random effects models wedescribed here to partition error into rater main effect and residualcomponents. Reanalyzing ratings data from the Project A database(Campbell, 1990), or other large sets of ratings data (e.g., Knapp,McCloy, & Heffner, 2004; Knapp & Tremble, 2007, Mount etal.,1998; Rothstein, 1990), would go quite a way toward establish-ing a meta-analytic database of variance component estimates thatare specific to the types of ratings used by organizational research-ers and practitioners. As noted above, such a database could thenbe used to support the calculation of G(q,k) in primary studies thatpreclude its estimation on locally available data.

Suggestions for Future Research

Though the current study contributes to the organizational re-search literature in several ways, it is just a first step in consideringthe implications of ISMDs for interrater reliability estimation. Inlight of the novelty of the material presented herein, we focusedour efforts on a relatively well defined set of simulation conditions.Future research efforts could extend our work in several ways.

First, we suggest that researchers extend the present work byexamining ISMD conditions that involve more than two raters perratee and differing numbers of raters per ratee. We have frequentlyencountered ISMDs in which ratees have different numbers ofraters. When a Pearson r approach to estimating interrater reliabil-ity for such designs is employed by randomly selecting two of theraters for each ratee and then randomly assigning them to be“Rater 1” or “Rater 2” for purposes of calculating a correlation, wehave observed far more “within-sample” variability in estimatesthan what we observed in the present study (Knapp & Tremble,2007). In other words, we have found that which two raters areselected for each ratee for purposes of calculating Pearson r hasmuch more of an effect on the resulting correlation than how thosetwo raters are assigned to columns within ratees. Indeed, theformer variability is akin to the variability seen across split-halfreliability estimates—even though such estimates are based on thesame underlying set of data, one can observe quite differentestimates depending on the split half one takes. Future work canexamine the magnitude of the relative advantage that G(q,k) offersover Pearson r under such ISMD conditions.

Second, future research should investigate the effects of allow-ing non-independence between different types of score effects(e.g., between ti and trij; Murphy & DeShon, 2000). In the presentstudy, we explored the effects of non-independence of residualeffects across ratees. This was in contrast to much past research

regarding the robustness of reliability estimators to non-independence, which has primarily focused on the implications ofnon-independence of residuals within ratees that perhaps arisesfrom the presence of an additional factor (other than true score) orfrom temporal ordering of the data (e.g., Bost, 1995; Maxwell,1968). Although the issue of non-independence between differenttypes of score effects may not be an issue for all types of ratingsdata, Murphy and DeShon (2000) have made a case for suchnon-independence when job performance ratings are gathered foradministrative purposes. Their work suggests that such non-independence would likely downwardly bias estimates of interraterreliability based on Pearson r. Given that the type of non-independence they discussed (i.e., non-independence between dif-ferent score effects) and the type of non-independence we dis-cussed in this study (i.e., non-independence of residual effectsacross ratees) are different, the simultaneous presence of bothtypes of non-independence would likely produce even more neg-ative bias in Pearson r-based interrater reliability estimates thanpresented here. Future research could examine this possibility andalso explore how the type of non-independence discussed byMurphy and DeShon affects both the interpretation of, and qualityof estimation provided by, G(q,k).

Our third suggestion regards future reporting standards for re-search involving ratings. While reviewing the organizational re-search literature for this effort, we struggled to discern the exactnature of the measurement design used in nearly all studies.Indeed, previous authors have lamented that organizational re-searchers typically have not elaborated on their measurement de-signs (Scullen et al., 2000). Our study reinforces the notion thatsuch lack of detail is unfortunate because it limits the ability ofreviewers and future researchers to evaluate the appropriateness ofmethods that were used to analyze data in these studies. Absentsuch information, consumers of research could easily (a) assumethat the measurement design used in a study mimics those com-monly discussed in psychometric texts and monographs or, equallyas hazardous, (b) conclude that measurement design has littleimplication for the analytic techniques used. We hope this articleleads future researchers to be more wary of data arising fromISMDs and offer more detail when describing their designs.

At a minimum, we suggest that future researchers report thefollowing features of their measurement design any time theirratings data deviate from either a fully crossed or fully nesteddesign: (a) number of ratees sampled for the study, (b) totalnumber of raters participating in the study, (c) the number of ratersper ratee (or harmonic mean number of raters per ratee in the caseof unequal numbers of raters per ratee), and (d) the q multiplier fortheir design. Having information on the number of raters per rateeas well as the q multiplier for the design will allow future research-ers to gauge the amount of overlap between the sets of raters thatrated each ratee. All of this information can be easily generated byusing the program we provided as part of the supplemental onlinematerial. In addition, researchers should note any factors unique totheir assessment situation that resulted in some particular syste-maticity in their measurement design (e.g., entire blocks of can-didates being rated by the same rater pair as a function of the dayon which an assessment was conducted). Until such details beginto appear in the published literature, it will limit future researchers’ability to aggregate and interpret interrater reliability estimates in

979ILL-STRUCTURED MEASUREMENT DESIGNS

light of uncertainty regarding the measurement designs that gaverise to those estimates (Scullen et al., 2000).

Finally, if researchers report G(q,k) for their study, they shouldalso report estimates for the variance components underlyingG(q,k). Again, this information is easily generated by using theprogram we provided as part of the supplemental online materialand would contribute to the construction of a meta-analytic data-base of variance components which, as noted earlier, could be usedto calculate G(q,k) in primary studies where locally available datapreclude its estimation.

Closing Thoughts

This article highlights how ISMDs can present organizationalresearchers and practitioners with situations for which traditionalreliability estimators are either inappropriate or do not provide themost accurate result. Unfortunately, such problems appear to havebeen masked by verbiage in the existing organizational researchliterature that reinforces the notion that single-facet measurementdesigns are either crossed or nested. This article makes clear thatISMDs fall between these two design extremes (crossed andnested) and, though pervasive in research and practice, representan area that has seen little exploration in the organizational liter-ature. To our knowledge, there has been no published piece oforganizational literature that has delineated the problems ISMDscreate for commonly used approaches to interrater reliability esti-mation as we do here. Although such problems have been alludedto in previous presentations (e.g., McCloy & Putka, 2004), therehas been little substantive discussion and empirical evaluation of aviable alternative for estimating reliability under ISMDs. Thepresent article fills both voids.

In the interest of advancing organizational research and practice,we implore researchers and practitioners to (a) be more explicitabout the measurement design underlying their ratings data; (b) becognizant of the effects that something as seemingly pedestrian ashow they build their data files could have on their results; and (c)let that cognizance be reflected in how they code their data,describe their measurement designs, perform their analyses, andinterpret and present their results. We look forward to future workin this area.

References

American Educational Research Association, American Psychological As-sociation, & National Council on Measurement in Education. (1999).Standards for educational and psychological testing. Washington, DC:American Educational Research Association.

Bates, D. (2007). Linear mixed model implementation in lme4. RetrievedAugust 3, 2007, from http://cran.cnr.berkeley.edu/doc/vignettes/lme4/Implementation.pdf

Bliese, P. D. (2002). Multilevel random coefficient modeling in organiza-tional research. In F. Drasgow & N. Schmitt (Eds.), Measuring andanalyzing behavior in organizations (pp. 401–445). San Francisco:Jossey-Bass.

Bost, J. E. (1995). The effects of correlated errors on generalizability anddependability coefficients. Applied Psychological Measurement, 19,191–203.

Brennan, R. L. (2001). Generalizability theory. New York: Springer-Verlag.

Campbell, J. P. (Ed.). (1990). Project A: The U.S. Army selection andclassification project [Special issue]. Personnel Psychology, 43(2).

Campion, M. A., Campion, J. E., & Hudson, J. P. (1994). Structuredinterviewing: A note on incremental validity and alternative questiontypes. Journal of Applied Psychology, 79, 998–1002.

Campion, M. A., Outtz, J. L., Zedeck, S., Schmidt, F. L., Kehoe, J. F.,Murphy, K. R., & Guion, R. M. (2001). The controversy over scorebanding in personnel selection: Answers to 10 key questions. PersonnelPsychology, 54, 149–185.

Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statisticalimplications of six methods of score use in personnel selection. HumanPerformance, 4, 233–264.

Conway, J. M., Jako, R. A., & Goodman, D. F. (1995). A meta-analysis ofinterrater and internal consistency reliability of selection interviews.Journal of Applied Psychology, 80, 565–579.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Designand analysis issues for field settings. Chicago: Rand McNally.

Crocker, L., & Algina, L. (1986). Introduction to classical and modern testtheory. New York: Holt, Rinehart, and Winston.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). Thedependability of behavioral measurements: Theory of generalizabilityfor scores and profiles. New York: Wiley.

DeShon, R. P. (2002). Generalizability theory. In F. Drasgow & N. Schmitt(Eds.), Measuring and analyzing behavior in organizations (pp. 189–220). San Francisco: Jossey-Bass.

Enders, C. K. (2003). Using the expectation maximization algorithm toestimate coefficient alpha for scales with item-level missing data. Psy-chological Methods, 8, 322–337.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.),Educational measurement (3rd ed., pp. 105–146). New York: AmericanCouncil on Education and Macmillan.

Greguras, G. J., & Robie, C. (1998). A new look at within-source interraterreliability of 360 degree feedback ratings. Journal of Applied Psychol-ogy, 83, 960–968.

Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York:McGraw Hill.

Harris, M. M., Becker, A. S., & Smith, D. E. (1993). Does the assessmentcenter scoring method affect the cross-situational consistency of ratings?Journal of Applied Psychology, 78, 675–678.

Hoyt, W. T. (2000). Rater bias in psychological research: When is it aproblem and what can we do about it? Psychological Methods, 5,64–86.

Hoyt, W. T., & Kerns, M. (1999). Magnitude and moderators of bias inobserver ratings: A meta-analysis. Psychological Methods, 4, 403–424.

Kenny, D. A., & Judd, C. M. (1986). Consequences of violating theindependence assumption in analysis of variance. Psychological Bulle-tin, 99, 422–431.

Knapp, D. J., McCloy, R. A., & Heffner, T. (Eds.). (2004). Validation ofmeasures designed to maximize 21st-century Army NCO performance(Technical Report 1145). Arlington, VA: U.S. Army Research Institutefor the Behavioral and Social Sciences.

Knapp, D. J., & Tremble, T. R. (Eds.). (2007). Concurrent validation ofexperimental Army enlisted personnel selection and classification mea-sures (Technical Report 1205). Arlington, VA: U.S. Army ResearchInstitute for the Behavioral and Social Sciences.

Lance, C. E., Foster, M. R., Gentry, W. A., & Thoresen, J. D. (2004).Assessor cognitive processes in an operational assessment center. Jour-nal of Applied Psychology, 89, 22–35.

Landy, F. J., & Farr, J. L. (1980). Performance rating. PsychologicalBulletin, 87, 72–107.

Latham, G. P., Saari, L. M., Pursell, E. D., & Campion, M. A. (1980). Thesituational interview. Journal of Applied Psychology, 65, 422–427.

LeBreton, J. M., Burgess, J. R. D., Kaiser, R. B., Atchley, E. K., & James,L. R. (2003). The restriction of variance hypothesis and interrater reli-ability and agreement: Are ratings from multiple sources really dissim-ilar. Organizational Research Methods, 6, 80–128.

980 PUTKA, LE, MCCLOY, AND DIAZ

Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D. (1996).SAS system for mixed models. Cary, NC: SAS Institute Inc.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental testscores. Reading, MA: Addison-Wesley.

Marcoulides, G. A. (1990). An alternative method for estimating variancecomponents in generalizability theory. Psychological Reports, 66, 102–109.

Maxwell, A. E. (1968). The effect of correlated error on reliability coef-ficients. Educational and Psychological Measurement, 28, 803–811.

McCloy, R. A., & Putka, D. J. (2004, April). Estimating interrater reli-ability: Conquering the messiness of real-world data. Master tutorialconducted at the 19th annual Society for Industrial and OrganizationalPsychology conference, Chicago, IL.

McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ:Erlbaum.

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about someintraclass correlation coefficients. Psychological Methods, 1, 30–46.

Miller, I., & Miller, M. (1999). John E. Freund’s mathematical statistics(6th ed.). Saddle River, NJ: Prentice Hall.

Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A.(1998). Trait rater and level effects in 360-degree performance ratings.Personnel Psychology, 51, 557–576.

Murphy, K. R., & DeShon, R. (2000). Interrater correlations do notestimate the reliability of job performance ratings. Personnel Psychol-ogy, 53, 873–900.

Nathan, B. R., & Tippins, N. (1990). The consequences of halo “error” inperformance ratings: A field study of the moderating effect of halo ontest validation results. Journal of Applied Psychology, 75, 290–296.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.).New York: McGraw Hill.

Pulakos, E. D., & Schmitt, N. (1995). Experience-based and situationalinterview questions: Studies of validity. Personnel Psychology, 48,289–308.

R Development Core Team. (2007). R Language Definition, Version 2.5.1(Draft). Retrieved August 8, 2007, from http://www.r-project.org

Rothstein, H. R. (1990). Interrater reliability of job performance ratings:Growth to asymptote level with increasing opportunity to observe.Journal of Applied Psychology, 75, 322–327.

Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings:Assessing the psychometric quality of rating data. Psychological Bulle-tin, 88, 413–428.

Sackett, P. R., & Dreher, G. F. (1982). Constructs and assessment centerdimensions: Some troubling empirical findings. Journal of AppliedPsychology, 67, 401–410.

Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychologicalresearch: Lessons from 26 research scenarios. Psychological Methods, 1,199–223.

Schmidt, F. L., Viswesvaran, C., & Ones, D. S. (2000). Reliability is notvalidity and validity is not reliability. Personnel Psychology, 53, 901–912.

Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latentstructure of job performance ratings. Journal of Applied Psychology, 85,956–970.

Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance compo-nents. New York: Wiley.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer.Newbury Park, CA: Sage.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses inassessing rater reliability. Psychological Bulletin, 86, 420–428.

Society for Industrial and Organizational Psychology. (2003). Principlesfor the validation and use of personnel selection procedures (4th ed.).Bowling Green, OH: Author.

Sussmann, M., & Robertson, D. U. (1986). The validity of validity: Ananalysis of validation study designs. Journal of Applied Psychology, 71,461–468.

Tsacoumis, S. (2007). Assessment centers. In D. L. Whetzel & G. R.Wheaton (Eds.), Applied measurement: Industrial psychology in humanresources management (pp. 259–292). Mahwah, NJ: Erlbaum.

Tziner, A., & Dolan, S. (1982). Validity of an assessment center foridentifying future female officers in the military. Journal of AppliedPsychology, 67, 728–736.

Van Iddekinge, C. H., Raymark, P. H., Eidson, C. E., & Attenweiler, W. J.(2004). What do structured selection interviews really measure? Theconstruct validity of behavior description interviews. Human Perfor-mance, 17, 71–93.

Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparativeanalysis of the reliability of job performance ratings. Journal of AppliedPsychology, 81, 557–574.

Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a generalfactor in ratings of job performance? A meta-analytic framework fordisentangling substantive and error influences. Journal of Applied Psy-chology, 90, 108–131.

Winer, B. J. (1971). Statistical principles in experimental design (2nd ed.).New York: McGraw Hill.

Received September 8, 2006Revision received February 25, 2008

Accepted February 26, 2008 �

981ILL-STRUCTURED MEASUREMENT DESIGNS