Post on 25-Feb-2023
Peer Assessment in the Digital Age: A Meta-Analysis Comparing Peer and Teacher Ratings
Hongli Li, Georgia State University
Yao Xiong, Xiaojiao Zang, Mindy Kornhaber, Youngsun Lyu, Kyung Sun Chung, Hoi Suen,
The Pennsylvania State University
Correspondence should be addressed to Hongli Li, Department of Educational Policy Studies,
Georgia State University, P. O. Box 3977, Atlanta GA 30303. Email: hli24@gsu.edu.
Paper presented at the 2014 Annual Meeting of the American Educational Research Association
(AERA), Philadelphia, PA.
Abstract
Given the wide use of peer assessment, especially in higher education, the relative
accuracy of peer ratings compared to teacher ratings is a major concern for both educators and
researchers. This concern has grown with the increase of peer assessment in digital platforms. In
this meta-analysis, using a variance-known hierarchical linear modeling approach, we synthesize
findings from studies on peer assessment since 1999 when computer-assisted peer assessment
started to proliferate. The estimated average Pearson correlation between peer and teacher ratings
is found to be .63, which is moderately strong. This correlation is significantly higher when (a)
the peer assessment is paper-based rather than computer-assisted; (b) the subject area is not
medical/clinical; (c) the course is graduate level rather than undergraduate or K-12; (d)
individual work instead of group work is assessed; (e) the assessors and assessees are matched at
random; (f) the peer assessment is voluntary instead of compulsory; (g) the peer assessment is
non-anonymous; (h) peer raters provide both scores and qualitative comments instead of only
scores; and (i) peer raters are involved in developing the rating criteria. The findings are
expected to inform practitioners regarding peer assessment practices that are more likely to
exhibit better agreement with teacher assessment.
Keywords: Peer assessment, meta-analysis
Introduction
Peer assessment has been increasingly used to reduce teachers’ workload and to promote
student learning. Based on a meta-analysis of 48 studies published between 1959 and 1999,
Falchikov and Goldfinch (2000) found the weighted correlation coefficient between peer ratings
and teacher ratings to be .69, which is moderately strong. Given the rise of digital and blended
courses, the use of computer-assisted peer assessment has grown remarkably in recent years
(Topping, 1998). However, among the 48 studies included in Falchikov and Goldfinch’s meta-
analysis, very few involved computer-assisted peer assessment. Thus, they did not draw any
conclusion regarding the agreement between peer and teacher ratings in computer-assisted peer
assessment.
It is, therefore, necessary to synthesize more recent studies on peer assessment in order to
understand how well peer ratings correlate with teacher ratings in the current digital age. Also,
given the diversity of peer assessment practices (Gielen, Dochy, & Onghena, 2011), it is
important to understand which factors are likely to influence the agreement between peer and
teacher ratings. The purpose of the present study is to conduct a meta-analysis on the agreement
between peer and teacher ratings based on studies published since 1999. Specifically, two
research questions are addressed: (1) What is the agreement between peer and teacher ratings?
(2) Which factors are likely to influence this agreement?
Literature Review
Peer assessment is defined as “an arrangement in which individuals consider the amount,
level, value, worth, quality or success of the products or outcomes of learning of peers of similar
status” (Topping, 1998, p. 250). In addition to increasing teachers’ efficiency in grading, peer
assessment is advocated as an effective pedagogical strategy for enhancing learning. For
instance, peer assessment has been found to increase students’ engagement (Bloxham & West,
2004), promote students’ critical thinking (Sims, 1989), and increase students’ motivation to
learn (Topping, 2005). Additionally, peer assessment can function as a formative pedagogical
tool (Topping, 2009) or a summative assessment tool (Tsai, 2013). However, despite the great
potential of peer assessment, researchers and practitioners remain concerned about whether
students have the ability to assign reliable and valid ratings to their peers’ work (Liu & Carless,
2006). Specifically, reliability refers to inter-rater consistency across peer raters, whereas validity
refers to the consistency between peer ratings and teacher ratings, assuming teacher ratings to be
the gold standard (Falchikov & Goldfinch, 2000). Validity is a major concern for teachers who
are interested in using peer assessment and is also the primary focus of the present meta-analysis.
In this meta-analysis we use the term “teacher ratings” to be consistent with Falchikov and
Goldfinch (2000), whereas an alternative term used in the literature is “expert ratings.” A
“teacher” could refer to a classroom instructor, a teaching assistant, or a supervisor.
Previous studies on peer assessment showed inconsistent levels of agreement between
peer and teacher ratings. For instance, some studies found low correlations between peer and
teacher ratings, such as .29 in Kovach, Resch, and Verhulst (2009). Others, such as Cho, Schunn,
and Wilson (2006), reported a moderate agreement of .60 between peer and teacher ratings. Yet
others found high agreement between peer and teacher ratings, for example, .97 and .98 in Harris
(2011). In this study, a group of undergraduate students graded their peers’ scientific laboratory
reports. Students had to sign the reports that they assessed, and they would be penalized if there
was a serious discrepancy between their ratings and the teacher ratings. Harris attributed the high
agreement to a few factors, such as the well-structured rating task, the fairly constrained rating
schedule, and the sufficient guidance provided to peer raters. Falchikov and Goldfinch’s meta-
analysis provided a synthesis on the agreement between peer and teacher ratings for studies
conducted from 1959 to 1999. However, it is not clear whether studies published since 1999 will
generate a similar result, particularly given that peer assessment is increasingly mediated by
computer technology.
In their meta-analysis, Falchikov and Goldfinch (2000) further investigated the influence
of various factors on the agreement between peer and teacher ratings. Among the major factors
they explored were subject area, quality of study, number of peer raters, level of course, nature of
assessment task, and dimensional versus global judgment, etc. For instance, they found that
global judgments with clearly stated criteria were superior to judgments on separate dimensions.
However, many of their findings are not conclusive depending on whether they included or
excluded a study by Burnett and Cavaye (1980). Burnett and Cavaye reported a correlation of .99
between peer ratings and grades given by teachers. When this study was omitted in the meta-
analysis, the number of peer raters, whether the courses were at advanced level, and whether the
courses were about social sciences became statistically non-significant, but the quality of studies
became statistically significant. Given the large influence of this study, Falchikov and Goldfinch
provided different findings depending on whether the Burnett and Cavaye study was included in
the meta-analysis or not. This inconclusiveness, alongside the growing use of computer-assisted
peer assessment, necessitates the present meta-analysis, in which we examine the influence of
different factors on the agreement between peer and teacher ratings in the digital age.
Four reviews shaped the factors included in the present meta-analysis. Topping (1998)
proposed a typology of 17 variables to help describe the diversity of peer assessment activities.
van den Berg, Admiraal, and Pilot (2006) and van Gennip, Segers, and Tillema (2009)
reorganized these 17 variables. Gielen et al. (2011) further added three more variables to
Topping’s list and then classified the 20 variables into five categories: decisions concerning the
use of peer assessment, the link between peer assessment and other elements in the learning
environment, interactions between peers, the composition of assessment groups, and the
management of the assessment procedure. Based on these four reviews, as shown in Table 1, we
list 17 variables that might influence the agreement between peer and teacher ratings. These
variables were classified into two categories, those related to peer assessment settings and those
related to peer assessment procedures.
Regarding peer assessment settings, a primary factor to be considered is the mode of peer
assessment, i.e., whether the assessment is paper-based or computer-assisted. Computer-assisted
peer assessment is advocated as having some advantages over traditional paper-based peer
assessment: (a) online environments ensure anonymity and promote fair assessment without
being influenced by “friendship bias” (Lin, Liu, & Yuan, 2001; Wen & Tsai, 2008); (b)
computer assistance enhances efficiency especially for teachers in large classes; and (c)
assessment can be performed freely without time and location restrictions (Wen & Tsai, 2008).
In addition to peer assessment mode, other important variables include the subject area
and the task being assessed (Falchikov & Boud, 1989; Falchikov & Goldfinch, 2000).
Furthermore, the level of the courses in which the peer assessment occurs is worth considering.
Falchikov and Goldfinch’s meta-analysis included only peer assessment in higher education
because few studies had been conducted at the K-12 level then. In the present meta-analysis, we
examine whether the agreement between peer and teacher ratings varies across courses at the K-
12, undergraduate, and graduate levels.
Regarding peer assessment procedures, one factor is the constellation of assessors and the
constellation of assessees (i.e., those who are assessed). For instance, assessors and assessees can
be individuals, pairs, or groups (van Gennip et al., 2009). Other factors are the number of peer
raters per assignment and the number of teachers per assignment (Falchikov & Goldfinch, 2000).
In addition, because the social context of peer assessment may introduce pressure, risk, or
competition among peers (Falchikov, 1995), it is important to examine whether assessors and
assessees are matched at random or not.
Another potential factor is whether peer assessment is compulsory or voluntary (Topping,
2005). Some studies entailed compulsory participation in peer assessment (e.g., Bouzidi &
Jaillet, 2009), whereas in others, participants were self-selected (e.g., Hafner & Hafner, 2003).
Additionally, friendships among peers may result in scoring bias (Magin, 2001). Thus, it is
necessary to examine whether the peer assessment is anonymous or not. Furthermore, feedback
format may be an influential factor (Falchikov, 1995). Sometimes, peer raters provided only
scores (e.g., Sealey, 2013), whereas at other times they provided both scores and qualitative
comments (Liang & Tsai, 2010).
Finally, establishing high-quality peer assessment requires organizing, training, and
monitoring peer raters. Rating quality is said to improve when peer assessments are supported by
training, checklists, exemplification, teacher assistance, and monitoring (Berg, 1999; Miller,
2003; Pond, UI-Haq, & Wade, 1995). As discussed by Falchikov and Goldfinch (2000), peer
raters’ familiarity with and ownership of assessment criteria tend to improve the accuracy of peer
assessment. Therefore, the present meta-analysis also includes three variables to reflect whether
peer raters receive training, whether explicit rating criteria are used, and whether peer raters are
involved in developing rating criteria.
Methods
Selecting Studies and Coding Procedures
The following criteria were used to select studies for inclusion in the present meta-
analysis. First, we included only studies in which one or more numerical measures of the
agreement between peer and teacher ratings were presented or could be directly obtained from
information provided in the study. Typically, the agreement was measured by Pearson
correlation. Second, to address the file drawer problem, i.e., studies producing significant effects
are more likely to be published (Glass, 1977; Rosenthal, 1979), both published and unpublished
studies were considered. Third, only studies written in English and published (or released) since
1999 were considered. Finally, studies conducted in both educational and medical/clinical
settings were included.
The following procedures were used to search for eligible studies. First, using various
key words such as peer assessment, peer evaluation, peer rating, peer grading, peer scoring, and
peer feedback, we searched several well-known online databases (ERIC, PsycINFO, JSTOR, and
ProQuest). We also used the same key words to search Google scholar for the most recent
published and unpublished studies not yet included in the databases. Second, we searched review
articles on peer assessment (e.g., Gielen et al., 2011; Speyer, Pilz, Kruis, & Brunings, 2011; van
Gennip et al., 2009) for relevant studies. Third, we reviewed references cited in the studies that
we had already determined to be eligible and added those that we had not already found through
other sources. Finally, we contacted scholars in the field to ask for recent studies that they may
have encountered but had not yet been published. In total, initially 292 articles were located from
searching. After carefully examining each article and applying the inclusion criteria, we found 70
of them were eligible. In most cases, an article was excluded because it did not provide
numerical measures of the agreement between peer and teacher ratings. For example, a paper
authored by Senger and Kanthan (2012) showed the comparison between peer and teacher
ratings in graphs instead of numerical values and thus could not be included.
Many of the eligible studies involved more than one comparison (or effect size). For
instance, Liu, Lin, and Yuan (2002) reported the agreement between peer and teacher ratings for
six assignments separately, such that this article generated six comparisons (or effect sizes). The
multiple effect sizes within each study can be averaged, or one effect size can be selected from
each study. As a result, the number of effect sizes is drastically reduced, so that it is very
challenging to study the effects of the comparison characteristics. Alternatively, the dependence
can be ignored when appropriate (Borenstein, Hedges, Higgins, & Rothstein, 2009). Because the
samples used to calculate the six effect sizes were mutually exclusive in Liu et al. (2002), each
single comparison from the Liu et al. study was treated as a unit of analysis. Likewise, after a
thorough screening, we determined that 269 comparisons from 70 studies were eligible for
inclusion in the present meta-analysis.
Based on the literature discussed previously, a coding sheet was built after several rounds
of discussion among the authors, as shown in Table 1. The authors coded all the studies
collaboratively, and each article was coded by at least two coders in order to establish reliability.
A measure of inter-coder reliability, percentage agreement, was calculated for each coded
variable. The percentage agreement for variables related to peer assessment settings ranged from
95% to 100%, and the percentage agreement for variables related to peer assessment procedures
ranged from 80% to 100%. Disagreements in regard to the codes assigned were resolved through
a process of discussion among the coders until an agreement was reached.
Table 1 presents details pertaining to the coded variables. Here are two examples.
Regarding subject area, given the small number of studies involving the arts, we collapsed social
science and the arts into one category. The subject area of science or engineering was coded as a
second category, and medical or clinical constituted a third category. Regarding task being rated,
two categories were established. Essays, reports, proposals, portfolios, design tasks, and exams
were coded as “writing or project.” Teaching demonstrations, musical or medical performances
were coded as “performance.” As shown in Table 1, categorical variables were dummy coded for
subsequent analyses. These variables were subsequently used as potential predictors to account
for variations across the 269 effect sizes.
Insert Table 1 here
Data Analysis Procedure
Raudenbush and Bryk (1985, 2002) proposed a two-level variance-known hierarchical
linear modeling approach to meta-analysis. This approach focuses on discovering and explaining
variations in effect sizes, and groups of subjects are regarded as nested within the primary studies
included in the meta-analysis. The level-1 model investigates how effect sizes vary across
studies, whereas the level-2 model focuses on explaining the potential sources of this variation.
In particular, the level-2 model examines multiple predictors of effect sizes simultaneously. With
this approach, analysts are able to estimate the average effect size across a set of studies and test
hypotheses about the effects of study characteristics on study outcomes. This approach was
adopted for the current meta-analysis.
The meta-analysis was conducted following the variance-known hierarchical linear
modeling approach with the HLM 6.08 software (Raudenbush, Bryk, & Congdon, 2004). As
demonstrated in Raudenbush and Bryk (2002), when Pearson correlation is reported in the
studies, the correlation coefficient rj is transformed to the standardized effect size measure dj
(i.e., Fisher’s zeta transformation):
dj = ½ ln [(1 + rj)/(1 - rj)], [1]
The sampling variance of dj is approximately
Vj = 1/ (nj -3), [2]
where
rj is the sample correlation between two variables observed in study j; and
nj is the sample size in study j.
With the hierarchical linear modeling approach, the level-1 outcome variable in the meta-
analysis is dj, which is the Fisher’s zeta transformation of rj reported for each comparison. When
the variation in effect sizes is statistically significant, level-2 analysis is used to determine the
extent to which the predictors contribute to explaining that variation. As described by
Raudenbush and Bryk (2002), the level-1 model (often referred to as the unconditional model) is
dj = δj + ej [3]
where
δj is the true overall effect size across comparisons; and
ej is the sampling error associated with dj as an estimate of δj.
Here, we assume that ej ~ N(0, Vj).
In the level-2 model, the true population effect size, δj, depends on comparison
characteristics and a level-2 random error term:
δj = γ0 + γ1W1j + γ2W2j + … γkWkj + µj [4]
where
W1j …Wkj are the predictors explaining δj across the effect sizes (see Table 1 for the list of
predictors);
γ0 is the expected overall effect size when each Wij is zero;
γ1 … γk are regression coefficients associated with the comparison characteristics W1 to
Wk; and
µj is a level-2 random error.
Results
With the procedure described in the methods section, 269 comparisons from 70 studies
were included in the present meta-analysis. The distribution of the Pearson correlations is
illustrated in Figure 1. The correlations ranged from -.19 to .98, with a mean of .57 and a
standard deviation of .24. There were no obvious outliers beyond the range of -3 and 3 standard
deviation units. In addition, Table 1 shows the predictors and the corresponding frequencies. The
variable of the “constellation of assessors” was not included because the frequency of assessors
in groups was only one.
Insert Figure 1 Here
To begin, an intercept-only model was fit with no predictors included. The intercept (i.e.,
the estimated grand-mean effect size) was .74, which was statistically different from zero (t (268)
= 28.69, p < .001). This value indicates that on average the agreement between peer and teacher
ratings was .74 in the Fisher’s zeta metric, which is equivalent to .63 in the Pearson correlation
metric. Furthermore, the estimated variance of the effect size was .14, significantly different
from zero (χ2 = 2,471.35, df = 268, p < .001). This result suggests that variability existed in the
true effect sizes across the comparisons. Therefore, we proceeded to a level-2 conditional model
to determine which characteristics explain this variability.
We examined potential interactions among the predictors by conducting a
multicollinearity diagnostic test with the 17 predictors (Belsley, Kuh, & Welsch, 1980). The
multicollinearity among these predictors does not seem to be serious. All 17 predictors were thus
entered into the model simultaneously. For model parsimony, the predictors that lacked statistical
significance at the .05 level were dropped from the model one at a time starting from the one
with the largest p value. Because assessment mode was our focal interest, this predictor was
always kept in the model. Among the listed predictors in Table 1, eight were dropped eventually,
including W2 (subject area is science/engineering), W4 (task is performance), W6 (level of course
is K-12), W8 (number of peer raters is between 6 to 10), W9 (number of peer raters is larger than
10), W10 (number of teachers per assignment), W15 (there are explicit rating criteria), and W17
(peer raters receive training). As a result, nine significant predictors were retained in the final
model. Below we report detailed results of the final model shown in Table 2. Specifically, we
describe the regression coefficient of each predictor in the Fisher’s zeta metric, i.e., how a
predictor influenced the correlation between peer and teacher ratings when all the other
predicators were controlled for in the model.
Insert Table 2 here
Among the variables related to peer-assessment settings, to begin with, the assessment
mode turned out to be a significant factor. When the peer assessment was computer-assisted, the
correlation between peer and teacher ratings was significantly lower by .14 standard deviation
units than when the peer assessment was paper-based. Second, when the subject area was
medical/clinical compared to when the subject area was social science/arts, the correlation was
significantly lower by .35 standard deviation units. In the full model with all the predictors, the
correlation was slightly higher when the subject area was science/engineering than when it was
social science/arts, though the difference was not statistically significant. Finally, when the
course was graduate level, the correlation was significantly higher by .18 standard deviation
units than when the course was undergraduate level. Undergraduate and K-12 courses, however,
did not show any significant difference.
Among the variables related to peer assessment procedures, to begin with, the
constellation of assessees was significant. When assessees were a group, the correlation between
peer and teacher ratings was significantly lower by .26 standard deviation units compared to
when assessees were individuals. Second, when assessors and assessees were not matched at
random, the correlation was significantly lower by .27 standard deviation units. Third, when peer
assessment was voluntary, the correlation was significantly higher by .28 standard deviation
units than when peer assessment was compulsory. Fourth, when peer raters were non-
anonymous, the correlation was significantly higher by .15 standard deviation units than when
peer raters were anonymous. Fifth, when peer raters provided both scores and comments, the
correlation was also significantly higher by .15 standard deviation units than when peer raters
provided only scores. Finally, when peer raters were involved in developing the rating criteria,
the correlation was significantly higher by .60 standard deviation units, than when peer raters
were not involved.
In the final model, when all nine significant predictors were included, the estimated
variance of the effect sizes was .09, significantly different from zero (χ2 = 1623.73, df = 259, p
< .001). Using the variance component in the intercept-only model as the baseline (Raudenbush
& Bryk, 2002), we found that these nine predictors explained 34.30% of the variance in the
effect sizes. This indicates that other sources of variability still exists among the effect sizes
beyond what have been accounted for in this meta-analysis.
Insert Table 3 here
Table 3 represents the estimated effect sizes in the Pearson correlation metric calculated
based on the final model. First, let’s define a default scenario when all the predictors are zero, i.e.
assessment mode is paper-based; subject area is social science/arts; course is undergraduate
level; assessees are individuals; assessors and assessees are matched at random; peer assessment
is compulsory; peer raters are anonymous; only scores are provided; and peer raters are not
involved in developing criteria. In this default scenario, the estimated correlation in Fisher’s Z
metric is .69 (i.e., the intercept in the final model), which is equivalent to .60 in the Pearson
correlation metric. When assessment mode is computer-assisted and all the other predictors are
held at zero as described in the default scenario, the estimated correlation in Fisher’s Z metric
is .55 (i.e. .69-.14). This value is equivalent to.50 in the Pearson correlation metric. In a similar
way, we calculated the estimated Pearson correlations for other scenarios listed in Table 3. In
general, the estimated Pearson correlations between peer and teacher ratings vary substantially
across different conditions, ranging from .33 to .86.
Discussion
Correlation between Peer and Teacher Ratings
As shown in the intercept-only model, based on 269 effect sizes from 70 studies, the
estimated average Pearson correlation between peer and teaching ratings was .63. This is
significantly different from zero and moderately strong in a practical sense. This result does not
depart much from what was reported by Falchikov and Goldfinch’s (2000), in which the
weighted Pearson correlation between peer and teacher ratings was .69. In both the present meta-
analysis and the one conducted by Falchikov and Goldfinch, the correlations were adjusted based
on sample sizes, and thus the results are comparable. The present meta-analysis confirms that
peer ratings generally show a moderately high level of agreement with teacher ratings. It also
finds that the peer-teacher rating agreement based on studies since 1999 is only slightly lower
than that based on studies before 1999. At the same time, the present meta-analysis reveals
insights about factors influencing peer assessment conducted in the digital age, discussed below.
Factors Related to Peer Assessment Settings
The present meta-analysis reveals that computer-assisted peer assessment generates
significantly lower agreement between peer and teacher ratings than traditional paper-based peer
assessment (γ = -.14, t = -2.25, p < .05). Though computer-assisted peer assessment is seen as
more efficient (Lin et al., 2001; Wen & Tsai, 2008), peer raters in a computer-assisted
environment might perform worse, perhaps due to reduced attention, effort, or instructional
support (Suen, 2014). It is also the case that computer-assisted peer assessment is still in its
infancy and thus yet to show its full potential. Furthermore, the “computer-assisted” mode could
cover a broad range regarding the extent to which the computer technology is used in peer
assessment. For example, in Lin et al. (2001), a sophisticated web-based peer assessment system,
named NetPeas, was used to administer the peer assessment tasks, whereas in Chen and Tsai
(2009), the peer raters mainly used computer technology for uploading and downloading peer
assessment materials. It would be desirable to further classify computer-assisted mode into more
refined categories regarding how much technology is used. However, many studies included in
the meta-analysis did not provide sufficient information regarding this so that we only included a
broad category of “computer-assisted” peer assessment. Further research is needed to study the
effective design and use of technologies to improve the accuracy of peer assessment in digital
environments.
When the subject area was medical/clinical compared to when it was social science/arts
or science/engineering, the correlation between peer and teacher ratings was significantly lower
(γ = -.35, t = -3.19, p < .001). This accords with Falchikov and Goldfinch (2000) in that peer
ratings in professional practice (e.g., clinical skills or teaching practice) were more problematic
than those in academic practices. The present meta-analysis also shows that the correlation was
slightly higher when the subject area was science/engineering compared to when it was social
science/arts although the difference was not statistically significant. A plausible reason is that the
science and engineering tasks are more likely to have clear-cut right or wrong answers, making
them easier for peers to assess. Determining the proper level of granularity in categorizing
subject areas has been a challenge and is always somewhat arbitrary. Having too many
categories will not only reduce the sample size of each category but also introduce too many
dummy-coded predictors for the analysis. We opted to use three categories for the subject areas:
“medical/clinical,” “science/engineering,” and “social science/ arts.” With only four studies
being in arts, further subdividing the category of “social science/ arts” would be problematic.
In addition, the correlation between peer and teacher ratings for graduate courses was
significantly higher than for undergraduate courses (γ = .18, t = 2.49, p < .05). This was to be
expected as students taking advanced courses are likely to be more cognitively advanced and
potentially have higher reflection skills than those taking introductory courses (Falchikov &
Boud, 1989; Nulty, 2011). Reflection skills are argued to be an important factor related to the
peer assessment quality (Sluijsmans, Brand-Gruwel, van Merriënboer, & Bastiaens, 2002). We
also expected peer and teacher ratings on undergraduate courses to have a higher correlation than
those on K-12 courses. However, the difference we found was not statistically significant,
probably because the present meta-analysis included only a small number of studies involving K-
12 students.
Factors Related to Peer Assessment Procedures
As shown in the present meta-analysis, when assessees were groups, the correlation
between peer and teacher ratings was significantly lower than when assessees were individuals (γ
= -.26, t = -3.56, p < .001). When assessees are groups, the work being assessed is typically
group work, which usually involves interactions and dynamics among group members.
Therefore, assessing group work is likely to be more challenging compared to assessing
individual work (Panitz, 2003). This partially explains why peer ratings were less accurate when
assessees were groups than when assessees were individuals.
When assessors and assessees were matched at random, the correlation between peer and
teacher ratings was higher than when the matching was not random (γ = -.27, t = -3.83, p < .001).
Assessment bias exists when there is a systematic tendency for peer assessment scores “to be
influenced by anything other than the trait, behavior, or outcome they are supposed to be
measuring” (Kane & Lawler, 1978, p. 558). It is reasonable that randomly matching assessors
and assessees helps to reduce certain systematic biases and thus lead to higher agreement
between peer and teacher ratings.
Voluntary peer ratings showed more agreement with teacher ratings than did compulsory
peer ratings (γ = .28, t = 3.40, p < .01). When students are given choices, they are more likely to
engage in the task and the learning process (Boud, 2012). Also, voluntary peer assessment
usually happens when peers are interested in and motivated to participate in peer assessment
activities, which might lead to more accurate peer ratings. Therefore, it will be helpful for
teachers to boost students’ interest in and motivation for conducting peer assessment.
The correlation between peer and teacher ratings was significantly higher when peer
raters were non-anonymous than when they were anonymous (γ = .15, t = 2.40, p < .05).
Anonymity is believed to lead to a fairer environment and more honest ratings (Joinson, 1999).
However, as discussed by Cestone, Levine, and Lane (2008), when peer assessment is
anonymous, students may provide harsher criticism and evaluations. Also, previous research
noted that peer raters may lack effort or seriousness when performing the assessment (Hanrahan
& Isaacs, 2001). It is possible that non-anonymity may lead peer raters to take the assessment
more seriously, thereby generating more accurate ratings. Bloxham and West (2004) asked peer
raters to evaluate each other’s peer rating quality and found this practice encouraged peer raters
to engage seriously in the peer rating process.
When peer raters provided both scores and comments, the correlation between peer and
teacher ratings was significantly higher than when peer raters provided only scores (γ = .15, t =
2.57, p < .05). It is reported that students feel more comfortable giving qualitative feedback than
giving a purely quantitative evaluation of their peers’ work (Cestone et al., 2008). Moreover, it is
likely that qualitative comments enable students to more actively reflect on their peers’ work,
thereby supporting a clearer rationale for their numerical rating (Avery, 2014). This reflection
and documentation helps to improve the accuracy of peer ratings. In addition, Davies (2006;
2009) reported strong correlations between qualitative comments (i.e., negative/positive) and
numerical ratings, which supports the accuracy of qualitative comments.
A striking finding is that peer rater involvement in developing rating criteria yielded
much higher correlations between peer and teacher ratings than when peer raters were not
involved in developing the criteria (γ = .60, t = 5.55, p < .001). Discussion, negotiation, and joint
construction of assessment criteria is likely to give students a greater sense of ownership and
investment in their evaluations (Cestone et al., 2008; Topping, 2003). Such student involvement
can also make the rating criteria more understandable and easier to apply (Orsmond, Merry, &
Reiling, 1996). In addition, through this involvement, students can gain an opportunity to reflect
on their own learning. For this reason, we suggest the practice of involving peer raters in
developing rating criteria.
However, the correlation between peer and teacher ratings was not significantly higher
when peer raters had received training. A possible reason for this is that the quality of peer-rater
training may vary across the studies included in this meta-analysis such that a dichotomous
coding of whether peers had received training may not have been sufficient to capture its effect.
In addition, raters might have performed peer assessment or have received training prior to the
peer assessment activity reported in the study. Nevertheless, this information was not available in
most articles and thus was not included in the present meta-analysis. Further, we did not find the
correlation to be significantly higher when explicit criteria were used. This is probably because
almost all the studies included in this meta-analysis used explicit criteria, so that it was difficult
to detect a statistically significant effect for this predictor.
Neither the number of teachers per assignment nor the number of peer raters per
assignment was statistically significant. In the original full model, when the number of peer
raters per assignment was medium (6–10) or high (larger than 10) compared to when the number
of peer raters per assignment was low (1–5), the correlation between peer and teacher ratings was
slightly higher, although the difference was not statistically significant at the .05 level. This
direction agrees with the previous literature (Cho et al., 2006; Winch & Anderson, 1967)
wherein the agreement between peer and teacher ratings is higher when more peer raters are
involved. Falchikov and Goldfinch (2000), however, found that the agreement was lower when
the number of peer raters was more than 20. Nevertheless, there is no agreed upon value for the
optimal number of peer raters per assignment.
Conclusion, Implications, and Future Study
Based on empirical studies published since 1999, this meta-analysis investigates the
agreement between peer and teacher ratings and factors that might significantly influence this
agreement. We found that peer and teacher ratings overall agree with each other at a moderately
high level (r = .63). This correlation is significantly higher when (a) the peer assessment is
paper-based rather than computer-assisted; (b) the subject area is not medical/clinical; (c) the
course is graduate level rather than undergraduate or K-12; (d) individual work instead of group
work is assessed; (e) the assessors and assessees are matched at random; (f) the peer assessment
is voluntary instead of compulsory; (g) the peer assessment is non-anonymous; (h) peer raters
provide both scores and qualitative comments instead of only scores; and (i) peer raters are
involved in developing the rating criteria. It is important to note that the effect of each variable
was examined when the other variables were controlled for. The findings of this meta-analysis
are expected to inform educational practitioners on how to structure peer assessment in ways that
maximize assessment quality in a variety of settings.
Given the prevailing practice of peer assessment in higher education, this meta-analysis
especially has important implications for higher education at the digital age. A noteworthy
finding is that the agreement between peer and teacher ratings was significantly lower when the
peer assessment was computer-assisted rather than paper-based. With the rapid growth of
technology and calls for cost containment in higher education, we anticipate that peer assessment
will be increasingly mediated by computer technology. Such trends are evident in the eruption of
massive open online courses (MOOCs) throughout higher education, where computer-assisted
peer assessment is the primary assessment (Suen, 2014). Given these trends, it is essential to
conduct research on improving the quality and accuracy of peer assessment in the digital age. In
addition, the present meta-analysis confirms that agreement between peer and teacher ratings is
higher for graduate courses than for undergraduate or K-12 courses. This finding provides a basis
for educators to use peer assessment with more confidence in graduate-level courses.
In summary, the present meta-analysis provides important evidence on the agreement
between peer and teacher ratings and the factors likely to influence the extent of this agreement.
This is based on the assumption that teacher ratings are the gold standard, which is the norm in
the field but one worthy of further investigation. Furthermore, peer assessment involves a wide
range of activities, and a list of variables influencing the agreement between peer and teacher
ratings will never be exhaustive (Gielen et al., 2011). We included only theoretically meaningful
predictors that could be reliably coded. As a result, the current meta-analysis explained only
about one third of the variation of the agreement between peer and teacher ratings, and it is
important to examine the influence of other potential factors in future research. For instance, peer
assessment can be formative or summative, but the present meta-analysis did not examine
whether the purpose of peer assessment influences the agreement between peer and teacher
ratings. Also, peer raters from different cultural backgrounds could show different performances
or styles in terms of allocating scores to their peers (Fan, 2011). Finally, the present meta-
analysis did not look into the reliability of peer assessment (i.e., the inter-rater consistency across
peer raters) and the effect of peer assessment on learning. All these issues need to be addressed
in future work.
A final note is to reflect on the use of meta-analysis. Meta-analysis is a quantitative
synthesis of findings from different studies. One criticism against meta-analysis is that
researchers compare apples and oranges across studies because each individual study is different
in nature (Sharpe, 1997). As stated by Glass (2000), a meta-analysis asks questions about fruit,
for which both apples and oranges contribute important information. Further, meta-analysis
researchers take into account the characteristics of different studies while combining their
results. In the present meta-analysis, our level-2 analysis is to ascertain whether differences in
level of courses, subject areas, as well as other variables explain variance of the effect size, thus,
minimizing the problem of taking an unqualified average effect size across all studies.
Nevertheless, the results of our meta-analysis are necessarily influenced by the level-2 predictors
we included and the way we chose to operationalize them. For example, we chose to take the
grossest level of granularity when we divided subject areas into only three categories:
medical/clinical, science/engineering, and social sciences/arts. Also, there are potentially other
predictors not included due to a lack of information provided in the studies. It is, therefore,
necessary to replicate and or extend the current meta-analysis when new studies are available.
References
References marked with an asterisk indicate studies included in the meta-analysis.
*Ahangari, S., Rassekh-Alqol, B., & Hamed, L. A. (2013). The effect of peer assessment on oral
presentation in an EFL context. International Journal of Applied Linguistics & English
Literature, 2(3), 45–53.
*Avery, J. (2014). Leveraging crowdsourced peer-to-peer assessments to enhance the case
method of learning. Journal for Advancement of Marketing Education, 22(1), 1–15.
*Barker, T. & Bennet, S. (2010). Marking complex assignments using peer assessment with an
electronic voting system and an automated feedback tool. Proceedings of CAA 2010
Conference, Southampton, UK. Retrieved from
http://caaconference.co.uk/pastConferences/2010/Barker-CAA2010.pdf
*Basehore, P. M., Pomerantz, S. C., & Gentile, M. (2014). Reliability and benefits of medical
student peers in rating complex clinical skills. Medical Teacher, 36(5), 409–414.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential
data and sources of collinearity. New York, NY: John Wiley and Sons.
Berg, E. C. (1999). The effects of trained peer response on ESL students’ revision types and
writing quality. Journal of Second Language Writing, 8(3), 215–241.
Bloxham, S., & West, A. (2004). Understanding the rules of the game: Making peer assessment
as a medium for developing students’ conceptions of assessment. Assessment &
Evaluation in Higher Education, 29(6), 721–733.
Borenstein, M., Hedges, V. L., Higgins, J., & Rothstein, R. H. (2009). Introduction to meta-
analysis. Chichester, UK: John Wiley & Sons.
Boud, D. (2012). Developing student autonomy in learning (2nd ed.). New York, NY: Nichols
Publishing Company.
*Bouzidi, L., & Jaillet, A. (2009). Can online peer assessment be trusted? Journal of Educational
Technology & Society, 12(4), 257–268.
Burnett, W., & Cavaye, G. (1980). Peer assessment by fifth year students of surgery. Assessment
in Higher Education, 5(3), 273–278.
*Campbell, K. S., Mothersbaugh, D. L., Brammer, C., & Taylor, T. (2001). Peer versus self
assessment of oral business presentation performance. Business Communication
Quarterly, 64(3), 23–42.
Cestone, C. M., Levine, R. E., & Lane, D. R. (2008). Peer assessment and evaluation in team-
based learning. New Directions for Teaching and Learning, 2008 (116), 69–78.
*Chen, Y. C., & Tsai, C. C. (2009). An educational research course facilitated by online peer
assessment. Innovations in Education & Teaching International, 46(1), 105–117.
*Cheng, W., & Warren, M. (1999). Peer and teacher assessment of the oral and written tasks of a
group project. Assessment & Evaluation in Higher Education, 23(3), 301–314.
*Cho, K., Schunn, C., & Wilson, R. (2006). Validity and reliability of scaffolded peer
assessment of writing from instructor and student perspectives. Journal of Educational
Psychology, 98(4), 891–901.
*Coulson, M. (2009). Peer marking of talks in a small, second year biosciences course. 2009
UniServe Science Proceedings. Retrieved from http://ojs-
prod.library.usyd.edu.au/index.php/IISME/article/viewFile/6197/6845
*Daniel, R. (2005). Sharing the learning process: Peer assessment applications in practice.
Proceedings of the Effective Teaching and Learning Conference 2004. Retrieved from
http://researchonline.jcu.edu.au/5024/
Davies, P. (2006). Peer assessment: Judging the quality of students’ work by comments rather
than marks. Innovations in Education and Teaching International, 43(1), 69–82.
Davies, P. (2009). Review and reward within the computerised peer‐assessment of
essays. Assessment & Evaluation in Higher Education, 34(3), 321–333.
*Davis, D. J. (2002). Comparison of faculty, peer, self, and nurse assessment of obstetrics and
gynecology residents. Obstet Gynecol, 99(4), 647–651.
Falchikov, N. (1995). Peer feedback marking: Developing peer assessment, Innovations in
Education and Training International, 32(2), 175–187.
Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis.
Review of Educational Research, 59(4), 395–430.
Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-
analysis comparing peer and teacher marks. Review of Educational Research, 70(3), 287–
322.
Fan, M. (2011). International students' perceptions, practices and identities of peer assessment
in the British university: a case study. Paper presented at the Internationalisation of
Pedagogy and Curriculum in Higher Education Conference, Coventry, UK. Retrieved
from https://www.heacademy.ac.uk/node/3770
*Freeman S. & Parks, J.W. (2010). How accurate is peer grading? CBE Life Science Education,
9(4), 482–488.
*Garcia-Ros, R. (2011). Analysis and validation of a rubric to assess oral presentation skills in
university contexts. Electronic Journal of Research in Educational Psychology, 9(3),
1,043–1,062.
Gielen, S., Dochy, F., & Onghena, P. (2011). An inventory of peer assessment diversity.
Assessment & Evaluation in Higher Education, 36(2), 137–155.
Glass, G. V. (1977). Integrating findings: The meta-analysis of research. Review of Research in
Education, 5, 351–379.
Glass, G. V. (2000). Meta-analysis at 25. Retrieved from
http://glass.ed.asu.edu/gene/papers/meta25.html
*Gracias, N., & Garcia, R. (2013). Can we trust peer grading in oral presentations? Towards
optimizing a critical resource nowadays: Teachers’ time. Paper presented at 5th
International Conference on Education and New Learning Technologies (EDULEARN),
Barcelona, Spain. Retrieved from
http://users.isr.ist.utl.pt/~ngracias/publications/Gracias13_Edulearn_1329.pdf
*Griesbaum, J. & Gortz, M. (2010). Using feedback to enhance collaborative learning and
exploratory student concerning the added value of self- and peer-assessment by first-year
students in a blended learning lecture. International Journal of E-learning, 9(4), 481–
503.
*Hafner, J., & Hafner, P. (2003). Quantitative analysis of the rubric as an assessment tool: An
empirical study of student peer-group rating. International Journal of Science Education,
25(12), 1,509–1,528.
*Hamer, J. Purchase, H., Luxton-Reilly, A., & Denny, P. (2014, online first). A comparison of
peer and tutor feedback. Assessment & Evaluation in Higher Education.
DOI:10.1080/02602938.2014.893418.
*Hammer, R., Ronen, M., & Kohen-Vacs, D., (2012). On-line project-based peer assessed
competitions as an instructional strategy in higher education. Interdisciplinary Journal of
E-Learning and Learning Objects, 8, 1–14.
*Han, Y., James, D. H., & McLain, R. M. (2013). Relationships between student peer and
faculty evaluations of clinical performance: A pilot study. Journal of Nursing Education
and Practice, 3(8), 1–9.
Hanrahan, S. J., & Isaacs, G. (2001). Assessing self- and peer-assessment: The students’ views.
Higher Education Research & Development, 20(1), 53–70.
*Harris, J. (2011). Peer assessment in large undergraduate classes: An evaluation of a procedure
for marking laboratory reports and a review of related practices. Advances in Physiology
Education, 35(2), 178–187.
*Heyman, J. E., & Sailors, J. J. (2011). Peer assessment of class participation: Applying peer
nomination to overcome rating inflation. Assessment & Evaluation in Higher Education,
36(5), 605–618.
*Hidayat, M. T. (2013). Self-, peer- and teacher-assessment in translation course. Retrieved from
http://file.upi.edu/Direktori/FPBS/JUR._PEND._BAHASA_INGGRIS/19670609199403
1-
DIDI_SUKYADI/SELF,%20PEER%20AND%20TEACHER%20ASSESSMENT%20IN
%20TRANSLATION%20COURSE.pdf
Joinson, A. N. (1999). Social desirability, anonymity and Internet-based questionnaires.
Behavior Research Methods, Instruments and Computers, 31(3), 433–438.
*Jones, I., & Alcock, L. (2013, online first). Peer assessment without assessment criteria. Studies
in Higher Education. DOI:10.1080/03075079.2013.821974.
*Kakar, S. P., Catalanotti, J. S., Flory, A. L., Simmerns, S. J., Lewis, K. L., Mintz, M. L.,
Hwaywood, Y.C., & Blatt, B.C. (2013). Evaluating oral case presentations using a
checklist: How do senior student-evaluators compare with faculty? Academic Medicine,
88(9), 1,363–1,367.
Kane, J. S., & Lawler, E. E. (1978). Methods of peer assessment. Psychological Bulletin, 85(3),
555–586.
*Killic, G. B., & Cakan, M. (2007). Peer assessment of elementary science teaching skills.
Journal of Science Teacher Education 18(1), 91–107.
*Kovach, A. R., Resch, S. R., & Verhulst, J. S. (2009). Peer assessment of professionalism: A
five-year experience in medical clerkship. Society of General Internal Medicine, 24(6),
742–746.
*Langan, M. A., Shuker, D. M., Cullen, W. R., Penney, D., Preziosi, F .R., & Wheater, P. C.
(2008). Relationships between student characteristics and self-, peer and tutor evaluations
of oral presentations. Assessment & Evaluation in Higher Education, 33(2), 179–190.
*Lanning, S. K., Brickhouse, T. H., Gunsolley, J. C., Ranson, S. L., & Wilett, R. M. (2011).
Communication skills instruction: An analysis of self, peer-group, student instructors and
faculty assessment. Patient Education and Counseling, 83(2), 145–151.
*Liang, J. C., & Tsai, C. C. (2010). Learning through science writing via online peer assessment
in a college biology course. Internet and Higher Education, 13(4), 242–247.
*Lin, S. S. J., Liu, E. X. F. & Yuan, S. M. (2001). Web-based peer assessment: Feedback for
students with various thinking-styles. Journal of Computer Assisted Learning, 17(4),
420–432.
*Liow, J-L. (2008). Peer assessment in thesis oral presentation. European Journal of
Engineering Education, 33(5–6), 525–537.
*Lirely, R., Keech, M. K., Vanhook, C., & Little, P. (2011). Development and evaluative
contextual usage of peer assessment of research presentations in a graduate tax
accounting course. International Journal of Business and Social Science, 2(23), 89–94.
Liu, N. F, & Carless, D. (2006). Peer feedback: The learning element of peer assessment.
Teaching in Higher Education, 11(3), 279–290.
*Liu, E. Z.-F., Lin, S. S. J., & Yuan, S.-M. (2002). Alternatives to instructor assessment: A case
study of comparing self and peer assessment with instructor assessment under a
networked innovative assessment procedures. International Journal of Instructional
Media, 29(4), 395–404.
*Liu, C. C., & Tsai, C. M. (2005). Peer assessment through web-based knowledge acquisition:
Tools to support conceptual awareness. Innovations in Education and Teaching
International, 42(1), 43–59.
*MacALPINE , J. M. K. (1999). Improving and encouraging peer assessment of student
presentations. Assessment & Evaluation in Higher Education, 24(1), 15–25.
Magin, D. (2001). Reciprocity as a source of bias in multiple peer assessment of group work.
Studies in Higher Education, 26(1), 53–63.
*Magin, D., & Helmore, P. (2001). Peer and teacher assessments of oral presentation skills: How
reliable are they? Studies in Higher Education, 26(3), 287–298.
*Mehrdad, N., Bigdeli, S., & Ebrahimi, H. (2012). A comparative study on self, peer and teacher
evaluation to evaluate clinical skills of nursing students. Procedia-Social and Behavioral
Sciences, 47, 1,847–1,852.
*Mika, S. (2006). Peer- and instructor assessment of oral presentations in Japanese university
EFL classrooms: A pilot study. Waseda Global Forum, 3, 99–107. Retrieved from
https://dspace.wul.waseda.ac.jp/dspace/bitstream/2065/11344/1/13M.Shimura.pdf
Miller, P. J. (2003). The effect of scoring criteria specificity on peer and self-assessment.
Assessment & Evaluation in Higher Education, 28(4), 383–394.
*Napoles, J. (2008). Relationships among instructor, peer and self-evaluations of undergraduate
music education majors’ micro-teaching experiences. Journal of Research in Music
Education, 56(1), 82–91.
Nulty, D. D. (2011). Peer and self-assessment in the first year of university. Assessment &
Evaluation in Higher Education, 36(5), 493–507.
*Okuda, R. & Otsu, R. (2010). Peer assessment for speeches as an aid to teacher grading. The
Language Teacher, 34(4), 41–47.
Orsmond, P., Merry, S., & Reiling, K. (1996). The importance of marking criteria in the use of
peer assessment. Assessment & Evaluation in Higher Education, 21(3), 239–250.
*Ostafichuk , P. M. (2012). Peer-to-peer assessment in large classes: A study of several
techniques used in design courses. Paper presented at 2012 ASEE Annual Conference,
San Antonio, Texas. Retrieved from
http://www.asee.org/public/conferences/8/papers/4547/view
*Otoshi, J. & Heffernan, N. (2007). An analysis of peer assessment in EFL college oral
presentation classrooms. The Language Teacher, 31(11), 3–8.
*Panadero, E., Romero, M., & Strijbos, J. M. (2013). The impact of a rubric and friendship on
peer assessment: Effects on construct validity, performance, and perceptions of fairness
and comfort. Studies in Educational Evaluation, 39(4), 195–203.
Panitz, T. (2003). Faculty and student resistance to cooperative learning. In J. L. Cooper, P.
Robinson, & D. Ball (Eds), Small group instruction in higher education: Lessons from
the past, visions of the future (pp.193-200). Stillwater, OK: New Forums Press.
*Papinczak, T., Young, L., Groves, M., & Haynes, M. (2007). An analysis of peer, self, and tutor
assessment in problem-based learning tutorials. Medical Teacher, 29(5), 122–132.
*Patri, M. (2002). The influence of peer feedback on self- and peer-assessment of oral skills.
Language Testing, 19(2), 109–131.
Pond, K., UI-Haq, R., & Wade, W. (1995). Peer review: A precursor to peer assessment.
Innovations in Education & Training International, 32(4), 314–323.
*Raes, A., Vanderhoven, E., & Schellens, T. (2013, online first). Increasing anonymity in peer
assessment by using classroom response technology within face-to-face higher education.
Studies in Higher Education. DOI:10.1080/03075079.2013.823930.
Raudenbush, S. W., & Bryk, A. S. (1985). Empirical Bayes meta-analysis. Journal of
Educational Statistics, 10(2), 75–98.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data
analysis methods (2nd ed.). London: Sage.
Raudenbush, S. W., Bryk, A. S, & Congdon, R. (2004). HLM 6 for Windows [Computer
software]. Lincolnwood, IL: Scientific Software International.
Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results. Psychological
Bulletin, 86(3), 638–641.
*Rudy, D. W., Fejfar, M. C., Griffith, C. H., & Wilson, J. F. (2001). Self- and peer assessment
in a first-year communication and interviewing course. Evaluation & the Health
Professions, 24(4), 436–445.
*Sadler, M. P., & Good, E. (2006). The impact of self- and peer-grading on student learning.
Educational Assessment, 11(1), 1–31.
*Saito, H. (2008). EFL classroom peer assessment: Training effects on rating and commenting.
Language Testing, 25(4), 553–581.
*Saito, H., & Fujita, T. (2009). Peer-assessing peers’ contribution to EFL group presentations.
RELC Journal, 40(2), 149–171.
*Sealey, R. M. (2013). Peer assessing in higher education: Perspectives of studies and staff.
Education Research and Perspectives, 40, 276–298.
Senger, J. L., & Kanthan, R. (2012). Student evaluations: Synchronous tripod of learning
portfolio assessment—self-assessment, peer-assessment, instructor-assessment. Creative
Education, 3(1), 155–163.
Sharpe, D. (1997). Of apples and oranges, file drawers and garbage: Why validity issues in meta-
analysis will not go away. Clinical Psychology Review,17(8), 881–901.
*Sila A. & Bartan, O. (2010). Self and peer assessment in different ability groups. Paper
presented at International Conference on New Trends in Education and Their
Implications, Antalya, Turkey. Retrieved from
http://www.iconte.org/FileUpload/ks59689/File/8.pdf
Sims, G. K. (1989). Student peer review in the classroom: A teaching and grading tool. Journal
of Agronomic Education, 8(2), 105–108.
Sluijsmans, D., Brand-Gruwel, S., van Merriënboer, J. J., & Bastiaens, T. J. (2002). The training
of peer assessment skills to promote the development of reflection skills in teacher
education. Studies in Educational Evaluation, 29(1), 23–42.
Speyer, R., Pilz, W., van Der Kruis, J., & Brunings, J. (2011). Reliability and validity of student
peer assessment in medical education: A systematic review. Medical Teacher, 33(11),
572–585.
*Sitthiworachart, J. & Joyt, M. (2008). Computer support of effective peer assessment in an
undergraduate programming class. Journal of Computer Assisted Learning, 24(3), 217–
231.
*Steensels, C., Leemans, L., Buelens, H., Laga, E., Lecoutere, A., Laekeman, G., & Simoens, S.
(2006). Peer assessment: A valuable tool to differentiate between student contributions to
group work? Pharmacy Education, 6(2), 111–118.
Suen, H. K. (2014). Peer assessment for massive open online courses (MOOCs). The
International Review of Research in Open and Distance Learning, 15(3), 312–327.
*Sulliva, M. E., Hitchcock, M. A., & Dunnington, G. L. (1999). Peer and self assessment during
problem-based tutorials. American Journal of Surgery, 177(3), 266–269.
*Tepsuriwong, S. & Bunson, T. (2013). Introducing peer assessment to a reading classroom:
Expanding Thai university students’ learning boundaries beyond alphabetical symbols.
Mediterranean Journal of Social Sciences, 4(14), 279–286.
*Thomas, P. A., Gebo, K. A., & Hellmann, D. B. (1999). A pilot study of peer review in
residency training. Journal of General Internal Medicine, 14(9), 551–554.
Topping, J. K., (1998). Peer assessment between students in colleges and universities. Review of
Educational Research, 68(3), 249–276.
Topping, K. J. (2003). Self and peer assessment in school and university: Reliability, validity and
utility. In M. S. R. Segers, F. J. R. C. Dochy, & E. C. Cascallar (Eds.), Optimizing new
modes of assessment: In search of qualities and standards (pp. 55–87). Dordrecht, The
Netherlands: Kluwer Academic.
Topping, K. J. (2005). Trends in peer learning. Educational Psychology, 25(6), 631–645.
Topping, K. J. (2009). Peer assessment. Theory into Practice, 48(1), 20–27.
*Tsai, P. V. (2013). Peering into peer assessment: An investigation of the reliability of peer
assessment in MOOCs. Unpublished master thesis, Princeton University, Princeton, NJ.
*Tsai, C.-C., & Liang, J.-C. (2009).The development of science activities via on-line peer
assessment: The role of scientific epistemological views. Instructional Science, 37(3),
293–310.
*Tsai, C.-C., Lin, S. S .J., & Yuan, S.-M. (2002). Developing science activities through a
networked peer assessment system. Computer & Education, 38(1-3), 241–252.
*Tseng, S.-C., & Tsai, C.-C. (2007). On-line peer assessment and the role of the peer feedback:
A study of high school computer course. Computer & Education, 49(4), 1,161–1,174.
*Tseng, S-C., & Tsai, C.-C. (2009). Exploring the role of student online peer assessment self-
efficacy in online peer assessment learning environments. In G. Siemens & C. Fulford
(Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and
Telecommunications 2009 (pp. 3,357–3,361). Chesapeake, VA: AACE.
van den Berg, I., Admiraal, W., & Pilot, A. (2006). Peer assessment in university teaching:
Evaluating seven courses designs. Assessment & Evaluation in Higher Education, 31(1),
19–36.
van Gennip, N, Segers, M., & Tillema, H. (2009). Peer assessment for learning from a social
perspective: The influence of interpersonal variables and structural features. Educational
Research Review, 4(1), 41–54.
*Verkade, H. & Bryson-Richardson, R. J. (2013). Student acceptance and application of peer
assessment in a final year genetics undergraduate oral presentation. Journal of Peer
Learning, 6(1), 1–18.
*Vozniuk, A., Holzer, A., & Gillet, D. (2014). Peer assessment based on ratings in a social
media course. Proceedings of the Fourth International Conference on Learning Analytics
And Knowledge. New York, NY.
*Wen, L. M., & Tsai, C.-C. (2008). Online peer assessment in an in service science and
mathematics teacher education course. Teaching in Higher Education, 13(1), 55–67.
Winch, R. F., & Anderson, R. B. W. (1967). Two problems involved in the use of peer rating
scales and some observations on Kendall's coefficient of concordance. Sociometry, 30(3),
316–322.
*Xiao, Y., & Lucking, R. (2008). The impact of two types of peer assessment on students’
performance and satisfaction within a Wiki environment. Internet and Higher Education,
11(3), 186–193.
*Yinjaroen, P., & Chiramanee, T. (2011). Peer assessment of oral English proficiency. Paper
presented at The 3rd International Conference on Humanities and Social Sciences. Hat
Yai, Songkhla, Thailand. Retrieved from
http://tar.thailis.or.th/bitstream/123456789/660/1/001.pdf
*Zakian, M., Moradan, A., Naghibi, S. E. (2012). The relationship between self-, peer-, and
teacher-assessments of EFL learners’ speaking. World J Arts, Languages, and Social
Sciences, 1(1), 1–5.
Table 1
Variables and Frequencies
Variable Coding and Notation Frequency
Assessment
mode
Paper-based (Reference group)
Computer-assisted (W1)
135
134
Subject area Social science/arts (Reference group)
Science/engineering (W2)
160
79
Medical/clinical (W3) 30
Task being rated Writing or project (Reference group)
Performance (W4)
149
120
Level of course Undergraduate level (Reference group)
Graduate level (W5)
K-12 (W6)
188
65
16
Constellation of
assessees
Individual (Reference group)
Group (W7)
216
53
Number of peer
raters per
assignment
The number is smaller than 6 (Reference group)
The number is between 6 and 10 (W8)
The number is larger than 10 (W9)
108
50
111
Number of
teachers per
assignment
0 = The number is 1 (Reference group)
1 = The number is more than 1 (W10)
188
81
Matching of
assessors and
assessees
Assessors and assessees are matched at random (Reference
group)
Assessors and assessees are not matched at random (W11)
227
42
Requirement Peer assessment is compulsory (Reference group)
Peer assessment is voluntary (W12)
225
44
Interaction
between peers
Anonymous rating (Reference group)
Non-anonymous (W13)
214
55
Feedback format Only scores without written comments (Reference group)
Scores with written comments (W14)
100
169
Explicit rating
criteria
No explicit rating criteria (Reference group)
There are explicit rating criteria (W15)
12
257
Involvement of
students
Peer raters are not involved in developing the rating criteria
(Reference group)
Peer raters are involved in the developing rating criteria (W16)
251
18
Peer raters
training
Peer raters receive training (Reference group)
Peer raters do not receive training (W17)
168
101
Table 2
Results of the Final Model with Only Significant Predictors
Predictors Regression
coefficient*
Standard
error
T ratio P value
Intercept .69 .05 14.44 .000
Assessment mode is computer-assisted (versus
paper-based)
-.14 .06 -2.25 .026
Subject area is medical/clinical (versus social
science/arts)
-.35 .11 -3.19 .002
Course is graduate level (versus undergraduate
level)
.18 .07 2.49 .013
Assessees are groups (versus individuals) -.26 .07 -3.56 .000
Assessors and assessees are not matched at
random (versus matched at random)
-.27 .07 -3.83 .000
Peer assessment is voluntary (versus
compulsory)
.28 .08 3.40 .001
Peer raters are non-anonymous (versus
anonymous)
.15 .06 2.40 .017
Both scores and comments are provided (versus
only scores are provided)
.15 .06 2.57 .011
Peer raters are involved in developing the rating
criteria (versus peer raters are not involved)
.60 .11 5.55 .000
Note: The regression coefficients are in the Fisher’s Zeta metric.
Table 3
Estimated Pearson Correlations between Peer and Teacher Ratings in the Final Model
Conditions Pearson
Correlation
Default scenario (when all the predictors are zero) 0.60
Assessment mode is computer-assisted, all else the same in the default scenario 0.50
Subject area is medical/clinical, all else the same in the default scenario 0.33
Course is graduate level, all else the same in the default scenario 0.70
Assessees are groups, all else the same in the default scenario 0.41
Assessors and assessees are not matched at random, all else the same in the
default scenario 0.40
Peer assessment is voluntary, all else the same in the default scenario 0.75
Peer raters are non-anonymous, all else the same in the default scenario 0.69
Both scores and comments are provided, all else the same in the default scenario 0.69
Peer raters are involved in developing the rating criteria, all else the same in the
default scenario 0.86