Peer assessment in a digital age: A meta-analysis comparing peer and teacher ratings.

Peer Assessment in the Digital Age: A Meta-Analysis Comparing Peer and Teacher Ratings

Hongli Li, Georgia State University

Yao Xiong, Xiaojiao Zang, Mindy Kornhaber, Youngsun Lyu, Kyung Sun Chung, Hoi Suen,

The Pennsylvania State University

Correspondence should be addressed to Hongli Li, Department of Educational Policy Studies,

Georgia State University, P. O. Box 3977, Atlanta GA 30303. Email: hli24@gsu.edu.

Paper presented at the 2014 Annual Meeting of the American Educational Research Association

(AERA), Philadelphia, PA.

Abstract

Given the wide use of peer assessment, especially in higher education, the relative

accuracy of peer ratings compared to teacher ratings is a major concern for both educators and

researchers. This concern has grown with the increase of peer assessment in digital platforms. In

this meta-analysis, using a variance-known hierarchical linear modeling approach, we synthesize

findings from studies on peer assessment since 1999 when computer-assisted peer assessment

started to proliferate. The estimated average Pearson correlation between peer and teacher ratings

is found to be .63, which is moderately strong. This correlation is significantly higher when (a)

the peer assessment is paper-based rather than computer-assisted; (b) the subject area is not

medical/clinical; (c) the course is graduate level rather than undergraduate or K-12; (d)

individual work instead of group work is assessed; (e) the assessors and assessees are matched at

random; (f) the peer assessment is voluntary instead of compulsory; (g) the peer assessment is

non-anonymous; (h) peer raters provide both scores and qualitative comments instead of only

scores; and (i) peer raters are involved in developing the rating criteria. The findings are

expected to inform practitioners regarding peer assessment practices that are more likely to

exhibit better agreement with teacher assessment.

Keywords: Peer assessment, meta-analysis

Introduction

Peer assessment has been increasingly used to reduce teachers’ workload and to promote

student learning. Based on a meta-analysis of 48 studies published between 1959 and 1999,

Falchikov and Goldfinch (2000) found the weighted correlation coefficient between peer ratings

and teacher ratings to be .69, which is moderately strong. Given the rise of digital and blended

courses, the use of computer-assisted peer assessment has grown remarkably in recent years

(Topping, 1998). However, among the 48 studies included in Falchikov and Goldfinch’s meta-

analysis, very few involved computer-assisted peer assessment. Thus, they did not draw any

conclusion regarding the agreement between peer and teacher ratings in computer-assisted peer

assessment.

It is, therefore, necessary to synthesize more recent studies on peer assessment in order to

understand how well peer ratings correlate with teacher ratings in the current digital age. Also,

given the diversity of peer assessment practices (Gielen, Dochy, & Onghena, 2011), it is

important to understand which factors are likely to influence the agreement between peer and

teacher ratings. The purpose of the present study is to conduct a meta-analysis on the agreement

between peer and teacher ratings based on studies published since 1999. Specifically, two

research questions are addressed: (1) What is the agreement between peer and teacher ratings?

(2) Which factors are likely to influence this agreement?

Literature Review

Peer assessment is defined as “an arrangement in which individuals consider the amount,

level, value, worth, quality or success of the products or outcomes of learning of peers of similar

status” (Topping, 1998, p. 250). In addition to increasing teachers’ efficiency in grading, peer

assessment is advocated as an effective pedagogical strategy for enhancing learning. For

instance, peer assessment has been found to increase students’ engagement (Bloxham & West,

2004), promote students’ critical thinking (Sims, 1989), and increase students’ motivation to

learn (Topping, 2005). Additionally, peer assessment can function as a formative pedagogical

tool (Topping, 2009) or a summative assessment tool (Tsai, 2013). However, despite the great

potential of peer assessment, researchers and practitioners remain concerned about whether

students have the ability to assign reliable and valid ratings to their peers’ work (Liu & Carless,

2006). Specifically, reliability refers to inter-rater consistency across peer raters, whereas validity

refers to the consistency between peer ratings and teacher ratings, assuming teacher ratings to be

the gold standard (Falchikov & Goldfinch, 2000). Validity is a major concern for teachers who

are interested in using peer assessment and is also the primary focus of the present meta-analysis.

In this meta-analysis we use the term “teacher ratings” to be consistent with Falchikov and

Goldfinch (2000), whereas an alternative term used in the literature is “expert ratings.” A

“teacher” could refer to a classroom instructor, a teaching assistant, or a supervisor.

Previous studies on peer assessment showed inconsistent levels of agreement between

peer and teacher ratings. For instance, some studies found low correlations between peer and

teacher ratings, such as .29 in Kovach, Resch, and Verhulst (2009). Others, such as Cho, Schunn,

and Wilson (2006), reported a moderate agreement of .60 between peer and teacher ratings. Yet

others found high agreement between peer and teacher ratings, for example, .97 and .98 in Harris

(2011). In this study, a group of undergraduate students graded their peers’ scientific laboratory

reports. Students had to sign the reports that they assessed, and they would be penalized if there

was a serious discrepancy between their ratings and the teacher ratings. Harris attributed the high

agreement to a few factors, such as the well-structured rating task, the fairly constrained rating

schedule, and the sufficient guidance provided to peer raters. Falchikov and Goldfinch’s meta-

analysis provided a synthesis on the agreement between peer and teacher ratings for studies

conducted from 1959 to 1999. However, it is not clear whether studies published since 1999 will

generate a similar result, particularly given that peer assessment is increasingly mediated by

computer technology.

In their meta-analysis, Falchikov and Goldfinch (2000) further investigated the influence

of various factors on the agreement between peer and teacher ratings. Among the major factors

they explored were subject area, quality of study, number of peer raters, level of course, nature of

assessment task, and dimensional versus global judgment, etc. For instance, they found that

global judgments with clearly stated criteria were superior to judgments on separate dimensions.

However, many of their findings are not conclusive depending on whether they included or

excluded a study by Burnett and Cavaye (1980). Burnett and Cavaye reported a correlation of .99

between peer ratings and grades given by teachers. When this study was omitted in the meta-

analysis, the number of peer raters, whether the courses were at advanced level, and whether the

courses were about social sciences became statistically non-significant, but the quality of studies

became statistically significant. Given the large influence of this study, Falchikov and Goldfinch

provided different findings depending on whether the Burnett and Cavaye study was included in

the meta-analysis or not. This inconclusiveness, alongside the growing use of computer-assisted

peer assessment, necessitates the present meta-analysis, in which we examine the influence of

different factors on the agreement between peer and teacher ratings in the digital age.

Four reviews shaped the factors included in the present meta-analysis. Topping (1998)

proposed a typology of 17 variables to help describe the diversity of peer assessment activities.

van den Berg, Admiraal, and Pilot (2006) and van Gennip, Segers, and Tillema (2009)

reorganized these 17 variables. Gielen et al. (2011) further added three more variables to

Topping’s list and then classified the 20 variables into five categories: decisions concerning the

use of peer assessment, the link between peer assessment and other elements in the learning

environment, interactions between peers, the composition of assessment groups, and the

management of the assessment procedure. Based on these four reviews, as shown in Table 1, we

list 17 variables that might influence the agreement between peer and teacher ratings. These

variables were classified into two categories, those related to peer assessment settings and those

related to peer assessment procedures.

Regarding peer assessment settings, a primary factor to be considered is the mode of peer

assessment, i.e., whether the assessment is paper-based or computer-assisted. Computer-assisted

peer assessment is advocated as having some advantages over traditional paper-based peer

assessment: (a) online environments ensure anonymity and promote fair assessment without

being influenced by “friendship bias” (Lin, Liu, & Yuan, 2001; Wen & Tsai, 2008); (b)

computer assistance enhances efficiency especially for teachers in large classes; and (c)

assessment can be performed freely without time and location restrictions (Wen & Tsai, 2008).

In addition to peer assessment mode, other important variables include the subject area

and the task being assessed (Falchikov & Boud, 1989; Falchikov & Goldfinch, 2000).

Furthermore, the level of the courses in which the peer assessment occurs is worth considering.

Falchikov and Goldfinch’s meta-analysis included only peer assessment in higher education

because few studies had been conducted at the K-12 level then. In the present meta-analysis, we

examine whether the agreement between peer and teacher ratings varies across courses at the K-

12, undergraduate, and graduate levels.

Regarding peer assessment procedures, one factor is the constellation of assessors and the

constellation of assessees (i.e., those who are assessed). For instance, assessors and assessees can

be individuals, pairs, or groups (van Gennip et al., 2009). Other factors are the number of peer

raters per assignment and the number of teachers per assignment (Falchikov & Goldfinch, 2000).

In addition, because the social context of peer assessment may introduce pressure, risk, or

competition among peers (Falchikov, 1995), it is important to examine whether assessors and

assessees are matched at random or not.

Another potential factor is whether peer assessment is compulsory or voluntary (Topping,

2005). Some studies entailed compulsory participation in peer assessment (e.g., Bouzidi &

Jaillet, 2009), whereas in others, participants were self-selected (e.g., Hafner & Hafner, 2003).

Additionally, friendships among peers may result in scoring bias (Magin, 2001). Thus, it is

necessary to examine whether the peer assessment is anonymous or not. Furthermore, feedback

format may be an influential factor (Falchikov, 1995). Sometimes, peer raters provided only

scores (e.g., Sealey, 2013), whereas at other times they provided both scores and qualitative

comments (Liang & Tsai, 2010).

Finally, establishing high-quality peer assessment requires organizing, training, and

monitoring peer raters. Rating quality is said to improve when peer assessments are supported by

training, checklists, exemplification, teacher assistance, and monitoring (Berg, 1999; Miller,

2003; Pond, UI-Haq, & Wade, 1995). As discussed by Falchikov and Goldfinch (2000), peer

raters’ familiarity with and ownership of assessment criteria tend to improve the accuracy of peer

assessment. Therefore, the present meta-analysis also includes three variables to reflect whether

peer raters receive training, whether explicit rating criteria are used, and whether peer raters are

involved in developing rating criteria.

Methods

Selecting Studies and Coding Procedures

The following criteria were used to select studies for inclusion in the present meta-

analysis. First, we included only studies in which one or more numerical measures of the

agreement between peer and teacher ratings were presented or could be directly obtained from

information provided in the study. Typically, the agreement was measured by Pearson

correlation. Second, to address the file drawer problem, i.e., studies producing significant effects

are more likely to be published (Glass, 1977; Rosenthal, 1979), both published and unpublished

studies were considered. Third, only studies written in English and published (or released) since

1999 were considered. Finally, studies conducted in both educational and medical/clinical

settings were included.

The following procedures were used to search for eligible studies. First, using various

key words such as peer assessment, peer evaluation, peer rating, peer grading, peer scoring, and

peer feedback, we searched several well-known online databases (ERIC, PsycINFO, JSTOR, and

ProQuest). We also used the same key words to search Google scholar for the most recent

published and unpublished studies not yet included in the databases. Second, we searched review

articles on peer assessment (e.g., Gielen et al., 2011; Speyer, Pilz, Kruis, & Brunings, 2011; van

Gennip et al., 2009) for relevant studies. Third, we reviewed references cited in the studies that

we had already determined to be eligible and added those that we had not already found through

other sources. Finally, we contacted scholars in the field to ask for recent studies that they may

have encountered but had not yet been published. In total, initially 292 articles were located from

searching. After carefully examining each article and applying the inclusion criteria, we found 70

of them were eligible. In most cases, an article was excluded because it did not provide

numerical measures of the agreement between peer and teacher ratings. For example, a paper

authored by Senger and Kanthan (2012) showed the comparison between peer and teacher

ratings in graphs instead of numerical values and thus could not be included.

Many of the eligible studies involved more than one comparison (or effect size). For

instance, Liu, Lin, and Yuan (2002) reported the agreement between peer and teacher ratings for

six assignments separately, such that this article generated six comparisons (or effect sizes). The

multiple effect sizes within each study can be averaged, or one effect size can be selected from

each study. As a result, the number of effect sizes is drastically reduced, so that it is very

challenging to study the effects of the comparison characteristics. Alternatively, the dependence

can be ignored when appropriate (Borenstein, Hedges, Higgins, & Rothstein, 2009). Because the

samples used to calculate the six effect sizes were mutually exclusive in Liu et al. (2002), each

single comparison from the Liu et al. study was treated as a unit of analysis. Likewise, after a

thorough screening, we determined that 269 comparisons from 70 studies were eligible for

inclusion in the present meta-analysis.

Based on the literature discussed previously, a coding sheet was built after several rounds

of discussion among the authors, as shown in Table 1. The authors coded all the studies

collaboratively, and each article was coded by at least two coders in order to establish reliability.

A measure of inter-coder reliability, percentage agreement, was calculated for each coded

variable. The percentage agreement for variables related to peer assessment settings ranged from

95% to 100%, and the percentage agreement for variables related to peer assessment procedures

ranged from 80% to 100%. Disagreements in regard to the codes assigned were resolved through

a process of discussion among the coders until an agreement was reached.

Table 1 presents details pertaining to the coded variables. Here are two examples.

Regarding subject area, given the small number of studies involving the arts, we collapsed social

science and the arts into one category. The subject area of science or engineering was coded as a

second category, and medical or clinical constituted a third category. Regarding task being rated,

two categories were established. Essays, reports, proposals, portfolios, design tasks, and exams

were coded as “writing or project.” Teaching demonstrations, musical or medical performances

were coded as “performance.” As shown in Table 1, categorical variables were dummy coded for

subsequent analyses. These variables were subsequently used as potential predictors to account

for variations across the 269 effect sizes.

Insert Table 1 here

Data Analysis Procedure

Raudenbush and Bryk (1985, 2002) proposed a two-level variance-known hierarchical

linear modeling approach to meta-analysis. This approach focuses on discovering and explaining

variations in effect sizes, and groups of subjects are regarded as nested within the primary studies

included in the meta-analysis. The level-1 model investigates how effect sizes vary across

studies, whereas the level-2 model focuses on explaining the potential sources of this variation.

In particular, the level-2 model examines multiple predictors of effect sizes simultaneously. With

this approach, analysts are able to estimate the average effect size across a set of studies and test

hypotheses about the effects of study characteristics on study outcomes. This approach was

adopted for the current meta-analysis.

The meta-analysis was conducted following the variance-known hierarchical linear

modeling approach with the HLM 6.08 software (Raudenbush, Bryk, & Congdon, 2004). As

demonstrated in Raudenbush and Bryk (2002), when Pearson correlation is reported in the

studies, the correlation coefficient rj is transformed to the standardized effect size measure dj

(i.e., Fisher’s zeta transformation):

dj = ½ ln [(1 + rj)/(1 - rj)], [1]

The sampling variance of dj is approximately

Vj = 1/ (nj -3), [2]

rj is the sample correlation between two variables observed in study j; and

nj is the sample size in study j.

With the hierarchical linear modeling approach, the level-1 outcome variable in the meta-

analysis is dj, which is the Fisher’s zeta transformation of rj reported for each comparison. When

the variation in effect sizes is statistically significant, level-2 analysis is used to determine the

extent to which the predictors contribute to explaining that variation. As described by

Raudenbush and Bryk (2002), the level-1 model (often referred to as the unconditional model) is

dj = δj + ej [3]

δj is the true overall effect size across comparisons; and

ej is the sampling error associated with dj as an estimate of δj.

Here, we assume that ej ~ N(0, Vj).

In the level-2 model, the true population effect size, δj, depends on comparison

characteristics and a level-2 random error term:

δj = γ0 + γ1W1j + γ2W2j + … γkWkj + µj [4]

W1j …Wkj are the predictors explaining δj across the effect sizes (see Table 1 for the list of

predictors);

γ0 is the expected overall effect size when each Wij is zero;

γ1 … γk are regression coefficients associated with the comparison characteristics W1 to

Wk; and

µj is a level-2 random error.

Results

With the procedure described in the methods section, 269 comparisons from 70 studies

were included in the present meta-analysis. The distribution of the Pearson correlations is

illustrated in Figure 1. The correlations ranged from -.19 to .98, with a mean of .57 and a

standard deviation of .24. There were no obvious outliers beyond the range of -3 and 3 standard

deviation units. In addition, Table 1 shows the predictors and the corresponding frequencies. The

variable of the “constellation of assessors” was not included because the frequency of assessors

in groups was only one.

Insert Figure 1 Here

To begin, an intercept-only model was fit with no predictors included. The intercept (i.e.,

the estimated grand-mean effect size) was .74, which was statistically different from zero (t (268)

= 28.69, p < .001). This value indicates that on average the agreement between peer and teacher

ratings was .74 in the Fisher’s zeta metric, which is equivalent to .63 in the Pearson correlation

metric. Furthermore, the estimated variance of the effect size was .14, significantly different

from zero (χ2 = 2,471.35, df = 268, p < .001). This result suggests that variability existed in the

true effect sizes across the comparisons. Therefore, we proceeded to a level-2 conditional model

to determine which characteristics explain this variability.

We examined potential interactions among the predictors by conducting a

multicollinearity diagnostic test with the 17 predictors (Belsley, Kuh, & Welsch, 1980). The

multicollinearity among these predictors does not seem to be serious. All 17 predictors were thus

entered into the model simultaneously. For model parsimony, the predictors that lacked statistical

significance at the .05 level were dropped from the model one at a time starting from the one

with the largest p value. Because assessment mode was our focal interest, this predictor was

always kept in the model. Among the listed predictors in Table 1, eight were dropped eventually,

including W2 (subject area is science/engineering), W4 (task is performance), W6 (level of course

is K-12), W8 (number of peer raters is between 6 to 10), W9 (number of peer raters is larger than

10), W10 (number of teachers per assignment), W15 (there are explicit rating criteria), and W17

(peer raters receive training). As a result, nine significant predictors were retained in the final

model. Below we report detailed results of the final model shown in Table 2. Specifically, we

describe the regression coefficient of each predictor in the Fisher’s zeta metric, i.e., how a

predictor influenced the correlation between peer and teacher ratings when all the other

predicators were controlled for in the model.

Insert Table 2 here

Among the variables related to peer-assessment settings, to begin with, the assessment

mode turned out to be a significant factor. When the peer assessment was computer-assisted, the

correlation between peer and teacher ratings was significantly lower by .14 standard deviation

units than when the peer assessment was paper-based. Second, when the subject area was

medical/clinical compared to when the subject area was social science/arts, the correlation was

significantly lower by .35 standard deviation units. In the full model with all the predictors, the

correlation was slightly higher when the subject area was science/engineering than when it was

social science/arts, though the difference was not statistically significant. Finally, when the

course was graduate level, the correlation was significantly higher by .18 standard deviation

units than when the course was undergraduate level. Undergraduate and K-12 courses, however,

did not show any significant difference.

Among the variables related to peer assessment procedures, to begin with, the

constellation of assessees was significant. When assessees were a group, the correlation between

peer and teacher ratings was significantly lower by .26 standard deviation units compared to

when assessees were individuals. Second, when assessors and assessees were not matched at

random, the correlation was significantly lower by .27 standard deviation units. Third, when peer

assessment was voluntary, the correlation was significantly higher by .28 standard deviation

units than when peer assessment was compulsory. Fourth, when peer raters were non-

anonymous, the correlation was significantly higher by .15 standard deviation units than when

peer raters were anonymous. Fifth, when peer raters provided both scores and comments, the

correlation was also significantly higher by .15 standard deviation units than when peer raters

provided only scores. Finally, when peer raters were involved in developing the rating criteria,

the correlation was significantly higher by .60 standard deviation units, than when peer raters

were not involved.

In the final model, when all nine significant predictors were included, the estimated

variance of the effect sizes was .09, significantly different from zero (χ2 = 1623.73, df = 259, p

< .001). Using the variance component in the intercept-only model as the baseline (Raudenbush

& Bryk, 2002), we found that these nine predictors explained 34.30% of the variance in the

effect sizes. This indicates that other sources of variability still exists among the effect sizes

beyond what have been accounted for in this meta-analysis.

Insert Table 3 here

Table 3 represents the estimated effect sizes in the Pearson correlation metric calculated

based on the final model. First, let’s define a default scenario when all the predictors are zero, i.e.

assessment mode is paper-based; subject area is social science/arts; course is undergraduate

level; assessees are individuals; assessors and assessees are matched at random; peer assessment

is compulsory; peer raters are anonymous; only scores are provided; and peer raters are not

involved in developing criteria. In this default scenario, the estimated correlation in Fisher’s Z

metric is .69 (i.e., the intercept in the final model), which is equivalent to .60 in the Pearson

correlation metric. When assessment mode is computer-assisted and all the other predictors are

held at zero as described in the default scenario, the estimated correlation in Fisher’s Z metric

is .55 (i.e. .69-.14). This value is equivalent to.50 in the Pearson correlation metric. In a similar

way, we calculated the estimated Pearson correlations for other scenarios listed in Table 3. In

general, the estimated Pearson correlations between peer and teacher ratings vary substantially

across different conditions, ranging from .33 to .86.

Discussion

Correlation between Peer and Teacher Ratings

As shown in the intercept-only model, based on 269 effect sizes from 70 studies, the

estimated average Pearson correlation between peer and teaching ratings was .63. This is

significantly different from zero and moderately strong in a practical sense. This result does not

depart much from what was reported by Falchikov and Goldfinch’s (2000), in which the

weighted Pearson correlation between peer and teacher ratings was .69. In both the present meta-

analysis and the one conducted by Falchikov and Goldfinch, the correlations were adjusted based

on sample sizes, and thus the results are comparable. The present meta-analysis confirms that

peer ratings generally show a moderately high level of agreement with teacher ratings. It also

finds that the peer-teacher rating agreement based on studies since 1999 is only slightly lower

than that based on studies before 1999. At the same time, the present meta-analysis reveals

insights about factors influencing peer assessment conducted in the digital age, discussed below.

Factors Related to Peer Assessment Settings

The present meta-analysis reveals that computer-assisted peer assessment generates

significantly lower agreement between peer and teacher ratings than traditional paper-based peer

assessment (γ = -.14, t = -2.25, p < .05). Though computer-assisted peer assessment is seen as

more efficient (Lin et al., 2001; Wen & Tsai, 2008), peer raters in a computer-assisted

environment might perform worse, perhaps due to reduced attention, effort, or instructional

support (Suen, 2014). It is also the case that computer-assisted peer assessment is still in its

infancy and thus yet to show its full potential. Furthermore, the “computer-assisted” mode could

cover a broad range regarding the extent to which the computer technology is used in peer

assessment. For example, in Lin et al. (2001), a sophisticated web-based peer assessment system,

named NetPeas, was used to administer the peer assessment tasks, whereas in Chen and Tsai

(2009), the peer raters mainly used computer technology for uploading and downloading peer

assessment materials. It would be desirable to further classify computer-assisted mode into more

refined categories regarding how much technology is used. However, many studies included in

the meta-analysis did not provide sufficient information regarding this so that we only included a

broad category of “computer-assisted” peer assessment. Further research is needed to study the

effective design and use of technologies to improve the accuracy of peer assessment in digital

environments.

When the subject area was medical/clinical compared to when it was social science/arts

or science/engineering, the correlation between peer and teacher ratings was significantly lower

(γ = -.35, t = -3.19, p < .001). This accords with Falchikov and Goldfinch (2000) in that peer

ratings in professional practice (e.g., clinical skills or teaching practice) were more problematic

than those in academic practices. The present meta-analysis also shows that the correlation was

slightly higher when the subject area was science/engineering compared to when it was social

science/arts although the difference was not statistically significant. A plausible reason is that the

science and engineering tasks are more likely to have clear-cut right or wrong answers, making

them easier for peers to assess. Determining the proper level of granularity in categorizing

subject areas has been a challenge and is always somewhat arbitrary. Having too many

categories will not only reduce the sample size of each category but also introduce too many

dummy-coded predictors for the analysis. We opted to use three categories for the subject areas:

“medical/clinical,” “science/engineering,” and “social science/ arts.” With only four studies

being in arts, further subdividing the category of “social science/ arts” would be problematic.

In addition, the correlation between peer and teacher ratings for graduate courses was

significantly higher than for undergraduate courses (γ = .18, t = 2.49, p < .05). This was to be

expected as students taking advanced courses are likely to be more cognitively advanced and

potentially have higher reflection skills than those taking introductory courses (Falchikov &

Boud, 1989; Nulty, 2011). Reflection skills are argued to be an important factor related to the

peer assessment quality (Sluijsmans, Brand-Gruwel, van Merriënboer, & Bastiaens, 2002). We

also expected peer and teacher ratings on undergraduate courses to have a higher correlation than

those on K-12 courses. However, the difference we found was not statistically significant,

probably because the present meta-analysis included only a small number of studies involving K-

12 students.

Factors Related to Peer Assessment Procedures

As shown in the present meta-analysis, when assessees were groups, the correlation

between peer and teacher ratings was significantly lower than when assessees were individuals (γ

= -.26, t = -3.56, p < .001). When assessees are groups, the work being assessed is typically

group work, which usually involves interactions and dynamics among group members.

Therefore, assessing group work is likely to be more challenging compared to assessing

individual work (Panitz, 2003). This partially explains why peer ratings were less accurate when

assessees were groups than when assessees were individuals.

When assessors and assessees were matched at random, the correlation between peer and

teacher ratings was higher than when the matching was not random (γ = -.27, t = -3.83, p < .001).

Assessment bias exists when there is a systematic tendency for peer assessment scores “to be

influenced by anything other than the trait, behavior, or outcome they are supposed to be

measuring” (Kane & Lawler, 1978, p. 558). It is reasonable that randomly matching assessors

and assessees helps to reduce certain systematic biases and thus lead to higher agreement

between peer and teacher ratings.

Voluntary peer ratings showed more agreement with teacher ratings than did compulsory

peer ratings (γ = .28, t = 3.40, p < .01). When students are given choices, they are more likely to

engage in the task and the learning process (Boud, 2012). Also, voluntary peer assessment

usually happens when peers are interested in and motivated to participate in peer assessment

activities, which might lead to more accurate peer ratings. Therefore, it will be helpful for

teachers to boost students’ interest in and motivation for conducting peer assessment.

The correlation between peer and teacher ratings was significantly higher when peer

raters were non-anonymous than when they were anonymous (γ = .15, t = 2.40, p < .05).

Anonymity is believed to lead to a fairer environment and more honest ratings (Joinson, 1999).

However, as discussed by Cestone, Levine, and Lane (2008), when peer assessment is

anonymous, students may provide harsher criticism and evaluations. Also, previous research

noted that peer raters may lack effort or seriousness when performing the assessment (Hanrahan

& Isaacs, 2001). It is possible that non-anonymity may lead peer raters to take the assessment

more seriously, thereby generating more accurate ratings. Bloxham and West (2004) asked peer

raters to evaluate each other’s peer rating quality and found this practice encouraged peer raters

to engage seriously in the peer rating process.

When peer raters provided both scores and comments, the correlation between peer and

teacher ratings was significantly higher than when peer raters provided only scores (γ = .15, t =

2.57, p < .05). It is reported that students feel more comfortable giving qualitative feedback than

giving a purely quantitative evaluation of their peers’ work (Cestone et al., 2008). Moreover, it is

likely that qualitative comments enable students to more actively reflect on their peers’ work,

thereby supporting a clearer rationale for their numerical rating (Avery, 2014). This reflection

and documentation helps to improve the accuracy of peer ratings. In addition, Davies (2006;

2009) reported strong correlations between qualitative comments (i.e., negative/positive) and

numerical ratings, which supports the accuracy of qualitative comments.

A striking finding is that peer rater involvement in developing rating criteria yielded

much higher correlations between peer and teacher ratings than when peer raters were not

involved in developing the criteria (γ = .60, t = 5.55, p < .001). Discussion, negotiation, and joint

construction of assessment criteria is likely to give students a greater sense of ownership and

investment in their evaluations (Cestone et al., 2008; Topping, 2003). Such student involvement

can also make the rating criteria more understandable and easier to apply (Orsmond, Merry, &

Reiling, 1996). In addition, through this involvement, students can gain an opportunity to reflect

on their own learning. For this reason, we suggest the practice of involving peer raters in

developing rating criteria.

However, the correlation between peer and teacher ratings was not significantly higher

when peer raters had received training. A possible reason for this is that the quality of peer-rater

training may vary across the studies included in this meta-analysis such that a dichotomous

coding of whether peers had received training may not have been sufficient to capture its effect.

In addition, raters might have performed peer assessment or have received training prior to the

peer assessment activity reported in the study. Nevertheless, this information was not available in

most articles and thus was not included in the present meta-analysis. Further, we did not find the

correlation to be significantly higher when explicit criteria were used. This is probably because

almost all the studies included in this meta-analysis used explicit criteria, so that it was difficult

to detect a statistically significant effect for this predictor.

Neither the number of teachers per assignment nor the number of peer raters per

assignment was statistically significant. In the original full model, when the number of peer

raters per assignment was medium (6–10) or high (larger than 10) compared to when the number

of peer raters per assignment was low (1–5), the correlation between peer and teacher ratings was

slightly higher, although the difference was not statistically significant at the .05 level. This

direction agrees with the previous literature (Cho et al., 2006; Winch & Anderson, 1967)

wherein the agreement between peer and teacher ratings is higher when more peer raters are

involved. Falchikov and Goldfinch (2000), however, found that the agreement was lower when

the number of peer raters was more than 20. Nevertheless, there is no agreed upon value for the

optimal number of peer raters per assignment.

Conclusion, Implications, and Future Study

Based on empirical studies published since 1999, this meta-analysis investigates the

agreement between peer and teacher ratings and factors that might significantly influence this

agreement. We found that peer and teacher ratings overall agree with each other at a moderately

high level (r = .63). This correlation is significantly higher when (a) the peer assessment is

paper-based rather than computer-assisted; (b) the subject area is not medical/clinical; (c) the

course is graduate level rather than undergraduate or K-12; (d) individual work instead of group

work is assessed; (e) the assessors and assessees are matched at random; (f) the peer assessment

is voluntary instead of compulsory; (g) the peer assessment is non-anonymous; (h) peer raters

provide both scores and qualitative comments instead of only scores; and (i) peer raters are

involved in developing the rating criteria. It is important to note that the effect of each variable

was examined when the other variables were controlled for. The findings of this meta-analysis

are expected to inform educational practitioners on how to structure peer assessment in ways that

maximize assessment quality in a variety of settings.

Given the prevailing practice of peer assessment in higher education, this meta-analysis

especially has important implications for higher education at the digital age. A noteworthy

finding is that the agreement between peer and teacher ratings was significantly lower when the

peer assessment was computer-assisted rather than paper-based. With the rapid growth of

technology and calls for cost containment in higher education, we anticipate that peer assessment

will be increasingly mediated by computer technology. Such trends are evident in the eruption of

massive open online courses (MOOCs) throughout higher education, where computer-assisted

peer assessment is the primary assessment (Suen, 2014). Given these trends, it is essential to

conduct research on improving the quality and accuracy of peer assessment in the digital age. In

addition, the present meta-analysis confirms that agreement between peer and teacher ratings is

higher for graduate courses than for undergraduate or K-12 courses. This finding provides a basis

for educators to use peer assessment with more confidence in graduate-level courses.

In summary, the present meta-analysis provides important evidence on the agreement

between peer and teacher ratings and the factors likely to influence the extent of this agreement.

This is based on the assumption that teacher ratings are the gold standard, which is the norm in

the field but one worthy of further investigation. Furthermore, peer assessment involves a wide

range of activities, and a list of variables influencing the agreement between peer and teacher

ratings will never be exhaustive (Gielen et al., 2011). We included only theoretically meaningful

predictors that could be reliably coded. As a result, the current meta-analysis explained only

about one third of the variation of the agreement between peer and teacher ratings, and it is

important to examine the influence of other potential factors in future research. For instance, peer

assessment can be formative or summative, but the present meta-analysis did not examine

whether the purpose of peer assessment influences the agreement between peer and teacher

ratings. Also, peer raters from different cultural backgrounds could show different performances

or styles in terms of allocating scores to their peers (Fan, 2011). Finally, the present meta-

analysis did not look into the reliability of peer assessment (i.e., the inter-rater consistency across

peer raters) and the effect of peer assessment on learning. All these issues need to be addressed

in future work.

A final note is to reflect on the use of meta-analysis. Meta-analysis is a quantitative

synthesis of findings from different studies. One criticism against meta-analysis is that

researchers compare apples and oranges across studies because each individual study is different

in nature (Sharpe, 1997). As stated by Glass (2000), a meta-analysis asks questions about fruit,

for which both apples and oranges contribute important information. Further, meta-analysis

researchers take into account the characteristics of different studies while combining their

results. In the present meta-analysis, our level-2 analysis is to ascertain whether differences in

level of courses, subject areas, as well as other variables explain variance of the effect size, thus,

minimizing the problem of taking an unqualified average effect size across all studies.

Nevertheless, the results of our meta-analysis are necessarily influenced by the level-2 predictors

we included and the way we chose to operationalize them. For example, we chose to take the

grossest level of granularity when we divided subject areas into only three categories:

medical/clinical, science/engineering, and social sciences/arts. Also, there are potentially other

predictors not included due to a lack of information provided in the studies. It is, therefore,

necessary to replicate and or extend the current meta-analysis when new studies are available.

References

References marked with an asterisk indicate studies included in the meta-analysis.

*Ahangari, S., Rassekh-Alqol, B., & Hamed, L. A. (2013). The effect of peer assessment on oral

presentation in an EFL context. International Journal of Applied Linguistics & English

Literature, 2(3), 45–53.

*Avery, J. (2014). Leveraging crowdsourced peer-to-peer assessments to enhance the case

method of learning. Journal for Advancement of Marketing Education, 22(1), 1–15.

*Barker, T. & Bennet, S. (2010). Marking complex assignments using peer assessment with an

electronic voting system and an automated feedback tool. Proceedings of CAA 2010

Conference, Southampton, UK. Retrieved from

http://caaconference.co.uk/pastConferences/2010/Barker-CAA2010.pdf

*Basehore, P. M., Pomerantz, S. C., & Gentile, M. (2014). Reliability and benefits of medical

student peers in rating complex clinical skills. Medical Teacher, 36(5), 409–414.

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential

data and sources of collinearity. New York, NY: John Wiley and Sons.

Berg, E. C. (1999). The effects of trained peer response on ESL students’ revision types and

writing quality. Journal of Second Language Writing, 8(3), 215–241.

Bloxham, S., & West, A. (2004). Understanding the rules of the game: Making peer assessment

as a medium for developing students’ conceptions of assessment. Assessment &

Evaluation in Higher Education, 29(6), 721–733.

Borenstein, M., Hedges, V. L., Higgins, J., & Rothstein, R. H. (2009). Introduction to meta-

analysis. Chichester, UK: John Wiley & Sons.

Boud, D. (2012). Developing student autonomy in learning (2nd ed.). New York, NY: Nichols

Publishing Company.

*Bouzidi, L., & Jaillet, A. (2009). Can online peer assessment be trusted? Journal of Educational

Technology & Society, 12(4), 257–268.

Burnett, W., & Cavaye, G. (1980). Peer assessment by fifth year students of surgery. Assessment

in Higher Education, 5(3), 273–278.

*Campbell, K. S., Mothersbaugh, D. L., Brammer, C., & Taylor, T. (2001). Peer versus self

assessment of oral business presentation performance. Business Communication

Quarterly, 64(3), 23–42.

Cestone, C. M., Levine, R. E., & Lane, D. R. (2008). Peer assessment and evaluation in team-

based learning. New Directions for Teaching and Learning, 2008 (116), 69–78.

*Chen, Y. C., & Tsai, C. C. (2009). An educational research course facilitated by online peer

assessment. Innovations in Education & Teaching International, 46(1), 105–117.

*Cheng, W., & Warren, M. (1999). Peer and teacher assessment of the oral and written tasks of a

group project. Assessment & Evaluation in Higher Education, 23(3), 301–314.

*Cho, K., Schunn, C., & Wilson, R. (2006). Validity and reliability of scaffolded peer

assessment of writing from instructor and student perspectives. Journal of Educational

Psychology, 98(4), 891–901.

*Coulson, M. (2009). Peer marking of talks in a small, second year biosciences course. 2009

UniServe Science Proceedings. Retrieved from http://ojs-

prod.library.usyd.edu.au/index.php/IISME/article/viewFile/6197/6845

*Daniel, R. (2005). Sharing the learning process: Peer assessment applications in practice.

Proceedings of the Effective Teaching and Learning Conference 2004. Retrieved from

http://researchonline.jcu.edu.au/5024/

Davies, P. (2006). Peer assessment: Judging the quality of students’ work by comments rather

than marks. Innovations in Education and Teaching International, 43(1), 69–82.

Davies, P. (2009). Review and reward within the computerised peer‐assessment of

essays. Assessment & Evaluation in Higher Education, 34(3), 321–333.

*Davis, D. J. (2002). Comparison of faculty, peer, self, and nurse assessment of obstetrics and

gynecology residents. Obstet Gynecol, 99(4), 647–651.

Falchikov, N. (1995). Peer feedback marking: Developing peer assessment, Innovations in

Education and Training International, 32(2), 175–187.

Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis.

Review of Educational Research, 59(4), 395–430.

Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-

analysis comparing peer and teacher marks. Review of Educational Research, 70(3), 287–

Fan, M. (2011). International students' perceptions, practices and identities of peer assessment

in the British university: a case study. Paper presented at the Internationalisation of

Pedagogy and Curriculum in Higher Education Conference, Coventry, UK. Retrieved

from https://www.heacademy.ac.uk/node/3770

*Freeman S. & Parks, J.W. (2010). How accurate is peer grading? CBE Life Science Education,

9(4), 482–488.

*Garcia-Ros, R. (2011). Analysis and validation of a rubric to assess oral presentation skills in

university contexts. Electronic Journal of Research in Educational Psychology, 9(3),

1,043–1,062.

Gielen, S., Dochy, F., & Onghena, P. (2011). An inventory of peer assessment diversity.

Assessment & Evaluation in Higher Education, 36(2), 137–155.

Glass, G. V. (1977). Integrating findings: The meta-analysis of research. Review of Research in

Education, 5, 351–379.

Glass, G. V. (2000). Meta-analysis at 25. Retrieved from

http://glass.ed.asu.edu/gene/papers/meta25.html

*Gracias, N., & Garcia, R. (2013). Can we trust peer grading in oral presentations? Towards

optimizing a critical resource nowadays: Teachers’ time. Paper presented at 5th

International Conference on Education and New Learning Technologies (EDULEARN),

Barcelona, Spain. Retrieved from

http://users.isr.ist.utl.pt/~ngracias/publications/Gracias13_Edulearn_1329.pdf

*Griesbaum, J. & Gortz, M. (2010). Using feedback to enhance collaborative learning and

exploratory student concerning the added value of self- and peer-assessment by first-year

students in a blended learning lecture. International Journal of E-learning, 9(4), 481–

*Hafner, J., & Hafner, P. (2003). Quantitative analysis of the rubric as an assessment tool: An

empirical study of student peer-group rating. International Journal of Science Education,

25(12), 1,509–1,528.

*Hamer, J. Purchase, H., Luxton-Reilly, A., & Denny, P. (2014, online first). A comparison of

peer and tutor feedback. Assessment & Evaluation in Higher Education.

DOI:10.1080/02602938.2014.893418.

*Hammer, R., Ronen, M., & Kohen-Vacs, D., (2012). On-line project-based peer assessed

competitions as an instructional strategy in higher education. Interdisciplinary Journal of

E-Learning and Learning Objects, 8, 1–14.

*Han, Y., James, D. H., & McLain, R. M. (2013). Relationships between student peer and

faculty evaluations of clinical performance: A pilot study. Journal of Nursing Education

and Practice, 3(8), 1–9.

Hanrahan, S. J., & Isaacs, G. (2001). Assessing self- and peer-assessment: The students’ views.

Higher Education Research & Development, 20(1), 53–70.

*Harris, J. (2011). Peer assessment in large undergraduate classes: An evaluation of a procedure

for marking laboratory reports and a review of related practices. Advances in Physiology

Education, 35(2), 178–187.

*Heyman, J. E., & Sailors, J. J. (2011). Peer assessment of class participation: Applying peer

nomination to overcome rating inflation. Assessment & Evaluation in Higher Education,

36(5), 605–618.

*Hidayat, M. T. (2013). Self-, peer- and teacher-assessment in translation course. Retrieved from

http://file.upi.edu/Direktori/FPBS/JUR._PEND._BAHASA_INGGRIS/19670609199403

DIDI_SUKYADI/SELF,%20PEER%20AND%20TEACHER%20ASSESSMENT%20IN

%20TRANSLATION%20COURSE.pdf

Joinson, A. N. (1999). Social desirability, anonymity and Internet-based questionnaires.

Behavior Research Methods, Instruments and Computers, 31(3), 433–438.

*Jones, I., & Alcock, L. (2013, online first). Peer assessment without assessment criteria. Studies

in Higher Education. DOI:10.1080/03075079.2013.821974.

*Kakar, S. P., Catalanotti, J. S., Flory, A. L., Simmerns, S. J., Lewis, K. L., Mintz, M. L.,

Hwaywood, Y.C., & Blatt, B.C. (2013). Evaluating oral case presentations using a

checklist: How do senior student-evaluators compare with faculty? Academic Medicine,

88(9), 1,363–1,367.

Kane, J. S., & Lawler, E. E. (1978). Methods of peer assessment. Psychological Bulletin, 85(3),

555–586.

*Killic, G. B., & Cakan, M. (2007). Peer assessment of elementary science teaching skills.

Journal of Science Teacher Education 18(1), 91–107.

*Kovach, A. R., Resch, S. R., & Verhulst, J. S. (2009). Peer assessment of professionalism: A

five-year experience in medical clerkship. Society of General Internal Medicine, 24(6),

742–746.

*Langan, M. A., Shuker, D. M., Cullen, W. R., Penney, D., Preziosi, F .R., & Wheater, P. C.

(2008). Relationships between student characteristics and self-, peer and tutor evaluations

of oral presentations. Assessment & Evaluation in Higher Education, 33(2), 179–190.

*Lanning, S. K., Brickhouse, T. H., Gunsolley, J. C., Ranson, S. L., & Wilett, R. M. (2011).

Communication skills instruction: An analysis of self, peer-group, student instructors and

faculty assessment. Patient Education and Counseling, 83(2), 145–151.

*Liang, J. C., & Tsai, C. C. (2010). Learning through science writing via online peer assessment

in a college biology course. Internet and Higher Education, 13(4), 242–247.

*Lin, S. S. J., Liu, E. X. F. & Yuan, S. M. (2001). Web-based peer assessment: Feedback for

students with various thinking-styles. Journal of Computer Assisted Learning, 17(4),

420–432.

*Liow, J-L. (2008). Peer assessment in thesis oral presentation. European Journal of

Engineering Education, 33(5–6), 525–537.

*Lirely, R., Keech, M. K., Vanhook, C., & Little, P. (2011). Development and evaluative

contextual usage of peer assessment of research presentations in a graduate tax

accounting course. International Journal of Business and Social Science, 2(23), 89–94.

Liu, N. F, & Carless, D. (2006). Peer feedback: The learning element of peer assessment.

Teaching in Higher Education, 11(3), 279–290.

*Liu, E. Z.-F., Lin, S. S. J., & Yuan, S.-M. (2002). Alternatives to instructor assessment: A case

study of comparing self and peer assessment with instructor assessment under a

networked innovative assessment procedures. International Journal of Instructional

Media, 29(4), 395–404.

*Liu, C. C., & Tsai, C. M. (2005). Peer assessment through web-based knowledge acquisition:

Tools to support conceptual awareness. Innovations in Education and Teaching

International, 42(1), 43–59.

*MacALPINE , J. M. K. (1999). Improving and encouraging peer assessment of student

presentations. Assessment & Evaluation in Higher Education, 24(1), 15–25.

Magin, D. (2001). Reciprocity as a source of bias in multiple peer assessment of group work.

Studies in Higher Education, 26(1), 53–63.

*Magin, D., & Helmore, P. (2001). Peer and teacher assessments of oral presentation skills: How

reliable are they? Studies in Higher Education, 26(3), 287–298.

*Mehrdad, N., Bigdeli, S., & Ebrahimi, H. (2012). A comparative study on self, peer and teacher

evaluation to evaluate clinical skills of nursing students. Procedia-Social and Behavioral

Sciences, 47, 1,847–1,852.

*Mika, S. (2006). Peer- and instructor assessment of oral presentations in Japanese university

EFL classrooms: A pilot study. Waseda Global Forum, 3, 99–107. Retrieved from

https://dspace.wul.waseda.ac.jp/dspace/bitstream/2065/11344/1/13M.Shimura.pdf

Miller, P. J. (2003). The effect of scoring criteria specificity on peer and self-assessment.

Assessment & Evaluation in Higher Education, 28(4), 383–394.

*Napoles, J. (2008). Relationships among instructor, peer and self-evaluations of undergraduate

music education majors’ micro-teaching experiences. Journal of Research in Music

Education, 56(1), 82–91.

Nulty, D. D. (2011). Peer and self-assessment in the first year of university. Assessment &

Evaluation in Higher Education, 36(5), 493–507.

*Okuda, R. & Otsu, R. (2010). Peer assessment for speeches as an aid to teacher grading. The

Language Teacher, 34(4), 41–47.

Orsmond, P., Merry, S., & Reiling, K. (1996). The importance of marking criteria in the use of

peer assessment. Assessment & Evaluation in Higher Education, 21(3), 239–250.

*Ostafichuk , P. M. (2012). Peer-to-peer assessment in large classes: A study of several

techniques used in design courses. Paper presented at 2012 ASEE Annual Conference,

San Antonio, Texas. Retrieved from

http://www.asee.org/public/conferences/8/papers/4547/view

*Otoshi, J. & Heffernan, N. (2007). An analysis of peer assessment in EFL college oral

presentation classrooms. The Language Teacher, 31(11), 3–8.

*Panadero, E., Romero, M., & Strijbos, J. M. (2013). The impact of a rubric and friendship on

peer assessment: Effects on construct validity, performance, and perceptions of fairness

and comfort. Studies in Educational Evaluation, 39(4), 195–203.

Panitz, T. (2003). Faculty and student resistance to cooperative learning. In J. L. Cooper, P.

Robinson, & D. Ball (Eds), Small group instruction in higher education: Lessons from

the past, visions of the future (pp.193-200). Stillwater, OK: New Forums Press.

*Papinczak, T., Young, L., Groves, M., & Haynes, M. (2007). An analysis of peer, self, and tutor

assessment in problem-based learning tutorials. Medical Teacher, 29(5), 122–132.

*Patri, M. (2002). The influence of peer feedback on self- and peer-assessment of oral skills.

Language Testing, 19(2), 109–131.

Pond, K., UI-Haq, R., & Wade, W. (1995). Peer review: A precursor to peer assessment.

Innovations in Education & Training International, 32(4), 314–323.

*Raes, A., Vanderhoven, E., & Schellens, T. (2013, online first). Increasing anonymity in peer

assessment by using classroom response technology within face-to-face higher education.

Studies in Higher Education. DOI:10.1080/03075079.2013.823930.

Raudenbush, S. W., & Bryk, A. S. (1985). Empirical Bayes meta-analysis. Journal of

Educational Statistics, 10(2), 75–98.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data

analysis methods (2nd ed.). London: Sage.

Raudenbush, S. W., Bryk, A. S, & Congdon, R. (2004). HLM 6 for Windows [Computer

software]. Lincolnwood, IL: Scientific Software International.

Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results. Psychological

Bulletin, 86(3), 638–641.

*Rudy, D. W., Fejfar, M. C., Griffith, C. H., & Wilson, J. F. (2001). Self- and peer assessment

in a first-year communication and interviewing course. Evaluation & the Health

Professions, 24(4), 436–445.

*Sadler, M. P., & Good, E. (2006). The impact of self- and peer-grading on student learning.

Educational Assessment, 11(1), 1–31.

*Saito, H. (2008). EFL classroom peer assessment: Training effects on rating and commenting.

Language Testing, 25(4), 553–581.

*Saito, H., & Fujita, T. (2009). Peer-assessing peers’ contribution to EFL group presentations.

RELC Journal, 40(2), 149–171.

*Sealey, R. M. (2013). Peer assessing in higher education: Perspectives of studies and staff.

Education Research and Perspectives, 40, 276–298.

Senger, J. L., & Kanthan, R. (2012). Student evaluations: Synchronous tripod of learning

portfolio assessment—self-assessment, peer-assessment, instructor-assessment. Creative

Education, 3(1), 155–163.

Sharpe, D. (1997). Of apples and oranges, file drawers and garbage: Why validity issues in meta-

analysis will not go away. Clinical Psychology Review,17(8), 881–901.

*Sila A. & Bartan, O. (2010). Self and peer assessment in different ability groups. Paper

presented at International Conference on New Trends in Education and Their

Implications, Antalya, Turkey. Retrieved from

http://www.iconte.org/FileUpload/ks59689/File/8.pdf

Sims, G. K. (1989). Student peer review in the classroom: A teaching and grading tool. Journal

of Agronomic Education, 8(2), 105–108.

Sluijsmans, D., Brand-Gruwel, S., van Merriënboer, J. J., & Bastiaens, T. J. (2002). The training

of peer assessment skills to promote the development of reflection skills in teacher

education. Studies in Educational Evaluation, 29(1), 23–42.

Speyer, R., Pilz, W., van Der Kruis, J., & Brunings, J. (2011). Reliability and validity of student

peer assessment in medical education: A systematic review. Medical Teacher, 33(11),

572–585.

*Sitthiworachart, J. & Joyt, M. (2008). Computer support of effective peer assessment in an

undergraduate programming class. Journal of Computer Assisted Learning, 24(3), 217–

*Steensels, C., Leemans, L., Buelens, H., Laga, E., Lecoutere, A., Laekeman, G., & Simoens, S.

(2006). Peer assessment: A valuable tool to differentiate between student contributions to

group work? Pharmacy Education, 6(2), 111–118.

Suen, H. K. (2014). Peer assessment for massive open online courses (MOOCs). The

International Review of Research in Open and Distance Learning, 15(3), 312–327.

*Sulliva, M. E., Hitchcock, M. A., & Dunnington, G. L. (1999). Peer and self assessment during

problem-based tutorials. American Journal of Surgery, 177(3), 266–269.

*Tepsuriwong, S. & Bunson, T. (2013). Introducing peer assessment to a reading classroom:

Expanding Thai university students’ learning boundaries beyond alphabetical symbols.

Mediterranean Journal of Social Sciences, 4(14), 279–286.

*Thomas, P. A., Gebo, K. A., & Hellmann, D. B. (1999). A pilot study of peer review in

residency training. Journal of General Internal Medicine, 14(9), 551–554.

Topping, J. K., (1998). Peer assessment between students in colleges and universities. Review of

Educational Research, 68(3), 249–276.

Topping, K. J. (2003). Self and peer assessment in school and university: Reliability, validity and

utility. In M. S. R. Segers, F. J. R. C. Dochy, & E. C. Cascallar (Eds.), Optimizing new

modes of assessment: In search of qualities and standards (pp. 55–87). Dordrecht, The

Netherlands: Kluwer Academic.

Topping, K. J. (2005). Trends in peer learning. Educational Psychology, 25(6), 631–645.

Topping, K. J. (2009). Peer assessment. Theory into Practice, 48(1), 20–27.

*Tsai, P. V. (2013). Peering into peer assessment: An investigation of the reliability of peer

assessment in MOOCs. Unpublished master thesis, Princeton University, Princeton, NJ.

*Tsai, C.-C., & Liang, J.-C. (2009).The development of science activities via on-line peer

assessment: The role of scientific epistemological views. Instructional Science, 37(3),

293–310.

*Tsai, C.-C., Lin, S. S .J., & Yuan, S.-M. (2002). Developing science activities through a

networked peer assessment system. Computer & Education, 38(1-3), 241–252.

*Tseng, S.-C., & Tsai, C.-C. (2007). On-line peer assessment and the role of the peer feedback:

A study of high school computer course. Computer & Education, 49(4), 1,161–1,174.

*Tseng, S-C., & Tsai, C.-C. (2009). Exploring the role of student online peer assessment self-

efficacy in online peer assessment learning environments. In G. Siemens & C. Fulford

(Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and

Telecommunications 2009 (pp. 3,357–3,361). Chesapeake, VA: AACE.

van den Berg, I., Admiraal, W., & Pilot, A. (2006). Peer assessment in university teaching:

Evaluating seven courses designs. Assessment & Evaluation in Higher Education, 31(1),

19–36.

van Gennip, N, Segers, M., & Tillema, H. (2009). Peer assessment for learning from a social

perspective: The influence of interpersonal variables and structural features. Educational

Research Review, 4(1), 41–54.

*Verkade, H. & Bryson-Richardson, R. J. (2013). Student acceptance and application of peer

assessment in a final year genetics undergraduate oral presentation. Journal of Peer

Learning, 6(1), 1–18.

*Vozniuk, A., Holzer, A., & Gillet, D. (2014). Peer assessment based on ratings in a social

media course. Proceedings of the Fourth International Conference on Learning Analytics

And Knowledge. New York, NY.

*Wen, L. M., & Tsai, C.-C. (2008). Online peer assessment in an in service science and

mathematics teacher education course. Teaching in Higher Education, 13(1), 55–67.

Winch, R. F., & Anderson, R. B. W. (1967). Two problems involved in the use of peer rating

scales and some observations on Kendall's coefficient of concordance. Sociometry, 30(3),

316–322.

*Xiao, Y., & Lucking, R. (2008). The impact of two types of peer assessment on students’

performance and satisfaction within a Wiki environment. Internet and Higher Education,

11(3), 186–193.

*Yinjaroen, P., & Chiramanee, T. (2011). Peer assessment of oral English proficiency. Paper

presented at The 3rd International Conference on Humanities and Social Sciences. Hat

Yai, Songkhla, Thailand. Retrieved from

http://tar.thailis.or.th/bitstream/123456789/660/1/001.pdf

*Zakian, M., Moradan, A., Naghibi, S. E. (2012). The relationship between self-, peer-, and

teacher-assessments of EFL learners’ speaking. World J Arts, Languages, and Social

Sciences, 1(1), 1–5.

Table 1

Variables and Frequencies

Variable Coding and Notation Frequency

Assessment

Paper-based (Reference group)

Computer-assisted (W1)

Subject area Social science/arts (Reference group)

Science/engineering (W2)

Medical/clinical (W3) 30

Task being rated Writing or project (Reference group)

Performance (W4)

Level of course Undergraduate level (Reference group)

Graduate level (W5)

K-12 (W6)

Constellation of

assessees

Individual (Reference group)

Group (W7)

Number of peer

raters per

assignment

The number is smaller than 6 (Reference group)

The number is between 6 and 10 (W8)

The number is larger than 10 (W9)

Number of

teachers per

assignment

0 = The number is 1 (Reference group)

1 = The number is more than 1 (W10)

Matching of

assessors and

assessees

Assessors and assessees are matched at random (Reference

group)

Assessors and assessees are not matched at random (W11)

Requirement Peer assessment is compulsory (Reference group)

Peer assessment is voluntary (W12)

Interaction

between peers

Anonymous rating (Reference group)

Non-anonymous (W13)

Feedback format Only scores without written comments (Reference group)

Scores with written comments (W14)

Explicit rating

criteria

No explicit rating criteria (Reference group)

There are explicit rating criteria (W15)

Involvement of

students

Peer raters are not involved in developing the rating criteria

(Reference group)

Peer raters are involved in the developing rating criteria (W16)

Peer raters

training

Peer raters receive training (Reference group)

Peer raters do not receive training (W17)

Table 2

Results of the Final Model with Only Significant Predictors

Predictors Regression

coefficient*

Standard

T ratio P value

Intercept .69 .05 14.44 .000

Assessment mode is computer-assisted (versus

paper-based)

-.14 .06 -2.25 .026

Subject area is medical/clinical (versus social

science/arts)

-.35 .11 -3.19 .002

Course is graduate level (versus undergraduate

level)

.18 .07 2.49 .013

Assessees are groups (versus individuals) -.26 .07 -3.56 .000

Assessors and assessees are not matched at

random (versus matched at random)

-.27 .07 -3.83 .000

Peer assessment is voluntary (versus

compulsory)

.28 .08 3.40 .001

Peer raters are non-anonymous (versus

anonymous)

.15 .06 2.40 .017

Both scores and comments are provided (versus

only scores are provided)

.15 .06 2.57 .011

Peer raters are involved in developing the rating

criteria (versus peer raters are not involved)

.60 .11 5.55 .000

Note: The regression coefficients are in the Fisher’s Zeta metric.

Table 3

Estimated Pearson Correlations between Peer and Teacher Ratings in the Final Model

Conditions Pearson

Correlation

Default scenario (when all the predictors are zero) 0.60

Assessment mode is computer-assisted, all else the same in the default scenario 0.50

Subject area is medical/clinical, all else the same in the default scenario 0.33

Course is graduate level, all else the same in the default scenario 0.70

Assessees are groups, all else the same in the default scenario 0.41

Assessors and assessees are not matched at random, all else the same in the

default scenario 0.40

Peer assessment is voluntary, all else the same in the default scenario 0.75

Peer raters are non-anonymous, all else the same in the default scenario 0.69

Both scores and comments are provided, all else the same in the default scenario 0.69

Peer raters are involved in developing the rating criteria, all else the same in the

default scenario 0.86

Figure 1. Distribution of the Pearson Correlation Coefficients

Peer assessment in a digital age: A meta-analysis comparing peer and teacher ratings.

Documents

Transcript of Peer assessment in a digital age: A meta-analysis comparing peer and teacher ratings.

Brickwork Ratings India Pvt. Ltd.

Peer-to-Peer Information Access and Retrieval

Performance and Dependability of Structured Peer-to-Peer Overlays

Instructor evaluation ratings: A longitudinal analysis

Peer to Peer Urbanism

Counteracting free riding in Peer-to-Peer networks

Peer Review vs Metric‐based Assessment: Testing for Bias in the RAE Ratings of UK Economics Departments

Peer 2 Peer Sistemas Operativos Distribuidos

Fluid Limits Applied to Peer to Peer Network Analysis

care ratings press release - idbitrustee

HEAD-TAIL VIDEO STREAMING OVER PEER TO PEER SYSTEMS

Peer-to-Peer Cloud Provisioning: Service Discovery and Load-Balancing

ENHANCING BITTORRENT-LIKE PEER-TO-PEER CONTENT ...

A peer-to-peer approach to web service discovery

Mobile Peer-to-Peer Application for Resource Sharing

The ownership of ratings

Comparing Animal Teeth

Peer-to-Peer mobile applications using JXTA/JXME

Pet Insurance Star Ratings - Canstar

Routing Indices For Peer-to-Peer Systems