Post on 29-Jan-2023
EXPLORING EFFECTS OF CRITERIA AND
MULTIPLE GRADERS ON CASE GRADING1
C. Gopinath*Sawyer School of Management
Suffolk University8 Ashburton PlaceBoston, MA 02108
Phone: 617.305.1934Internet: cgopinat@suffolk.edu
1 Published in the Journal of Education for Business, 79(6):317-322
* I wish to thank Patricia Carlson for her assistance in data collection.
EXPLORING EFFECTS OF CRITERIA AND
MULTIPLE GRADERS ON CASE GRADING
Abstract
Written analyses of cases help the student develop skills oflogical analysis and written communication. However, the reliability of the grades received is often questioned by students. This paper reports on a study using six criteria to evaluate case analysis by two graders team teaching a course. Our results show that even with pre-determined criteria, there continue to be areas of disagreement betweengraders arising out of varied interpretation. Yet, student grades suggest that they benefit from a process that involves multiple cases and multiple graders. The implications of these findings are discussed.
2
EXPLORING EFFECTS OF CRITERIA AND
MULTIPLE GRADERS ON CASE GRADING
Case discussion is a popular pedagogical technique
employed in business courses. Cases provide rich description
of a setting in which a business decision is to be made.
They provide the student an opportunity to apply analytical
skills within a real context and to arrive at decisions and
other recommendations.
Instructors use cases in several ways. Apart from
engaging the whole class in a case discussion, students may
be asked to prepare and present their analysis, or be asked
to submit a written report on the case. Written analyses of
cases are meant to help the student develop skills of
written communication, and those of logical development of
an argument. The written reports are graded and often go
towards the overall evaluation of the student in the course.
The literature on essay grading in the field of higher
education shows that it is an activity that is subject to
several biases and errors. Although grading cases, which is
similar to essay grading, is extensively undertaken in
business programs, the issues of bias, reliability, or
consistency have not been examined in the business education
literature.
3
This paper describes a study we undertook to examine
the extent of agreement on case grades between dual graders
when grading criteria are specified. Our results show that
even with prior specification of criteria, there continue to
be areas of misinterpretation among graders that account for
significant differences. Yet, students appear to benefit
from a process that involves multiple cases, and multiple
graders. The implications of these findings are discussed.
LITERATURE REVIEW
Written case analysis is extensively used in business
programs. In a survey of 177 faculty teaching business
policy/strategic management courses, Alexander et al. (1986)
found 84.2% of the respondents required individual written
case analysis in this course. This was only second to class
participation as a factor determining the student grade. We
believe similar high usage exists in other courses. However,
our search revealed almost no study on the grading of case
analyses in business schools. Thus, we will draw upon the
studies conducted in other fields such as law where cases
are used and in the humanities where studies on essay
marking (a close analogy to case grading) have been
conducted.
Essays, like case analyses, have no absolutely right or
wrong answers but must show student comprehension of an
advanced level of analysis. ‘An essay writer has to
4
identify the problems beneath the question posed, he or she
has to create a structure, display insight and provide a
coherent argument’ (Brown, 1997). When a grader reads an
essay, he or she has to make a determination as to whether
it satisfies the requirement of a good discussion and
reveals aspects of student learning. These include concerns
of both the content and process of the discussion.
Grading a case analysis also requires consideration of
both content and process. Content issues would include
evaluation of the grasp of the issues in the case, knowledge
of facts and their implications, and whether the student has
understood the main question being raised. Process issues
would include concerns of presentation, whether arguments
are logical and analytical, and quality of language
expression.
The biases that work on the grader have attracted wide
concern among scholars. These biases could arise out of the
gender of the essay writer (Wright, 1996), personal
knowledge of the student (Dennis, Newstead and Wright,
1996), and the sequence of grading wherein a few
consecutively good essays would bias the instructor to grade
a weak essay particularly harshly (Spear, 1997). In addition
to these biases, differences in marks may reflect the
different philosophies of learning subscribed to by the
graders (Blanke, 1999). Bilimoria (1995) uses the lens of
modernism and postmodernism to illustrate approaches to
5
grading. A ‘modernist’ views evaluation as being result-
oriented and meeting certain standards. Grades differentiate
between accomplishments, and indicate the extent to which
students meet criteria. On the other hand, a ‘postmodernist’
views evaluation as continuous, focused on making
improvements, serves as feedback, and reflects how learning
opportunities have been used. Thus, a difference in ideology
may influence the tendency of one grader to mark high and
another to mark low.
The subset of the literature that deals with our study
more closely has to do with the use of criteria to grade the
case/essay, and the effects of disagreement, if any, in the
interpretation of these criteria. Often, double marking, or
the use of multiple graders to assess an essay is an attempt
to reduce the subjectivity in essay marking (Partington,
1994, Erskine et al., 1981). Thus, establishing criteria
should allow for less subjectivity since both the student
and the instructor can be guided by the same set of
principles.
The belief that grading based on criteria can be
standardized has also led to its automation. It has added
attraction for educators who deal with a large volume of
tests. For them, automation and standardization lends speed,
consistency and a perceived measure of objectivity to the
process. In several studies of grading essays and writing
6
samples conducted as part of an extended project, Page
(1994) found it possible to achieve a high level of
correlation between grades assigned by a computer and
multiple human judges. The criteria were broken down into
measurable variables for content traits and for essay
content. The Graduate Management Admissions Council, which
administers the GMAT test required by many business schools
of their graduate program applicants has introduced
computerized grading of essay answers in 1999 (Honan, 1999).
The test involves two essay questions, which were previously
graded by humans. Now, with about 400,000 test takers every
year, a human and a computerized essay scoring system are
used to grade GMAT essays. If the electronic grade differs
from the human grade by more than one point, a third (human)
expert assigns a final grade. The system looks for
organization of ideas and syntactical structure. These
include finding a subordinate clause, looking where a
discussion starts and ends, and examining vocabulary.
Some scholars have compared high levels of students and
instructor agreement of assessment of an essay exam to
suggest that scoring standards can be readily communicated
(Nealey, 1969). However, this issue has not been studied
with regard to case analysis. Case analysis, apart from
calling for good writing and logical development of an
argument, also requires specific application of theories or
concepts to the case situation.
7
The above discussion poses some questions of interest
to us in our study. One concerns the use of criteria to
reduce subjectivity in grading. If this is true, then there
should not be significant differences in grades assigned by
multiple graders working with specified criteria. Any bias
in grading will naturally find expression in differences in
the marks assigned by the two graders. The first research
question below explores this aspect:
RQ1: What is the extent of agreement between two graders of written case analysis when the criteria for grading have been specified?
When multiple instructors are grading cases, students
receive a grade that represents the expertise of the many
graders. This is true whether the multiple graders agree on
the grade or not. In cases of disagreement, a process of
reconciliation between graders is initiated in order to give
the student one grade. Wood and Quinn (1976) in their study
of English language essays show through correlation that
having multiple graders improves reliability by reducing
variation. Thus, they conclude that through a system of
multiple grading, the effects of erratic marking are reduced
and the student grades are less affected by who marked his
or her paper. However, in a criteria-based evaluation scheme
with subsequent reconciliation, final grade could vary from
initial grade. The final grade may be higher or lower than
8
the one they would have received if there was only one
grader. Thus:
RQ2: Does double marking result in a different (higher or lower) grade for the student as compared to single marking when the criteria for grading have been specified?
The purpose of a grade and written comments is meant to
evaluate and provide feedback to students. The learning
process requires that students work to improve areas where
they did not meet expectations. They also understand the
criteria better through repeated attempt. When students
submit multiple written reports, they have the opportunity
to improve by working on weak areas and demonstrating their
understanding. This suggests a third research question:
RQ3: Are student grades higher on a second case as compared to the first, when grading criteria is held constant?
METHOD
The absence of prior research addressing related
research questions has led us to adopt an exploratory
approach. Two instructors who were team-teaching an
introductory general business course that all MBA students
at our university are required to take in the first semester
of their program conducted the study. Both instructors were
present in the classroom throughout the semester and
participated in class activities. The course was designed to
introduce the students to (a) a set of skills that they
9
would need in the program and in a management career (such
as written analysis, presentation, discussion skills, etc.),
and (b) a set of perspectives such as viewing the company as
a whole, appreciating a globalized environment, and the
impact of technology on business.
Students were required to individually submit written
case analyses (WCA) on any two of the six cases to be
discussed in the course during the semester. They were
encouraged, but not required, to submit one case early in
the semester, and after considering the feedback on the
first, do the second one. Each WCA carried a weight of 15%
of the total grade for the course. The WCA had a 350 word
limit. The format to be followed was: (a) specify an issue
or a problem, (b) analyze the situation using a
concept/theory/model that had been discussed in any previous
class session in the course, and (c) bring the discussion to
a conclusion.
In the first semester that we taught this course, we
arrived at four criteria for grading the cases and these
were provided to the students. As the semester progressed,
we found several instances where we disagreed on
interpreting or applying the criteria to the WCA under
consideration. We discussed the criteria again prior to the
start of the second semester and agreed to expand the list
to six items in an effort to reduce the misinterpretation.
10
These items were: (a) Question or issue specified in the
beginning; (b) Question or issue is relevant; (c) Question
was answered/ issue brought to conclusion; (d) Depth of
analysis; (e) Use of appropriate theory/concepts; (f) Format
adhered to (writing style, error-free writing, word limit,
etc.). Scores ranging from 1 (Poor) to 5 (Excellent) were
given for each item. The criteria were developed based on
our experience and parallels what is expected in other
business courses. For example, conducting an analysis,
applying business policy theory and concepts, and writing
ability are among the top seven criteria used to grade cases
(Alexander, et al., 1986) in the business policy course.
Data were collected over a semester and the following
procedure was followed: When a student submitted a WCA, it
was first read by one instructor, and evaluated using the
grading sheet. No marks were made on the script (to
eliminate bias, Murphy, 1979) which was then passed on to
the second instructor who also read the case and evaluated
it separately. We then met and reconciled our evaluation of
each WCA. The process of reconciliation came into effect
when there was a difference between the individual grades
given by the instructors on a particular criterion. Each
instructor would then provide the reasons for his/her grade,
and the WCA would be read again. Each item on the grading
sheet on which there was a disagreement would be discussed
and reconciled. There were three possible outcomes of this
11
process. The final grade would be: (a) that which was given
by one of the instructors, indicating that one instructor
was able to convince the other, (b) the average of the two
grades reflecting a compromise, or (c) a common grade
different (higher or lower) from that originally given by
the instructors (if, in the process of reading and
discussing, it transpired that a different grade was
justified).
The individual scoring sheets with the independent
grades and comments were set aside, and the reconciled grade
along with grader comments were entered on a third grading
sheet, which was given to the student. The students were
aware that both instructors were involved in grading each
WCA but were not told of the detailed process or the study
in progress. There were 53 students in two sections of the
course from whom the data was collected. Two case reports
were missing and thus we had an ‘n’ of 104.
RESULTS
Quantitative analysis: RQ1 was examined by looking at
both the extent of initial agreement and subsequent
reconciliation among the two graders. There was a strong
positive correlation (.46, p<.01) between the scores of the
two graders across all the criteria (Table 1). There was
full agreement between the instructors in 71% of the cases
(Table 2). This compares favorably with the information
12
presented by Page (1994) that one US state educational
system required that interjudge agreement be at least 70% in
a 4-point rating.
____________________________________Place Tables 1 & 2 about here
__________________________________
Disagreement between the graders has a richer story to
tell. Reconciliation came about in all cases of initial
disagreement. In the process of reconciliation, a majority
of the cases (27% out of 29% that required reconciliation)
were resolved with one grader convincing the other (columns
4 and 6, Table 2). In only 3% of the cases (column 5) did we
resort to taking the mean. This confirms the extensive
discussions and review that accompanied reconciliation
without resorting to a quick compromise through settling for
the mean. Moreover, very few students received grades that
fell outside the initial range of the two graders (columns 3
and 7, Table 2). This suggests that the initial two grades
represented the possible range that the student could have
received.
Looking at differences across specific criteria, it is
clear that criteria 2, 4 and 5 (Table 1) accounted for
significant differences between the two graders. Although
correlation on criterion 2 is low, there was a high level of
agreement between the graders. Criteria 4 and 5 represent a
13
different picture. On these two criteria, the number of
students whose final grade was equal to that of one or the
other grader was about equal. Against this, on all the other
criteria, the final grade was more heavily weighted towards
one or the other grader. Only about 12 % (4 out of 43 and 6
out of 39) of the students received a mean grade. This
suggests that the two instructors were adhering to their
initial grades more strongly on these two criteria as
against the others. An examination of the criteria itself
suggests that these two were subject to greater
interpretation than the others and is discussed under
qualitative analysis below.
To examine RQ2, we compared final grade received with
the higher or lower of an individual instructor’s grade and
found no significant variation. The data shows that in 13.8%
of the cases students received a grade higher than single
marking (columns 6 & 7, Table 2). On the down side, in 12.8%
of the cases, they received a grade lower than single
marking (columns 3 & 4).
____________________________________Place Table 3 about here
____________________________________To address RQ3, we compared the grades (Table 3)
received by the students on their second case with that
received on the first. We found a significant difference in
the case of criteria 5. On the others, there were either no
differences, or a marginal improvement. Since criterion 5
14
was also one of the two criteria on which there was the
highest initial disagreement between the two graders, the
improvement could suggest either that the feedback helped
the student understand the criterion better, or that it
helped improve application of theory.
To check if there was a grader ‘learning’ bias, i.e.,
if the graders were converging in their views over the
semester, we compared the disagreements on case-by-case
basis (Table 4). Of the six, there was a drop in the number
of disagreements in the second case. In all the other cases,
the disagreements remained around an average of 39,
suggesting very little convergence effect.
____________________________________Place Table 4 about here
____________________________________Qualitative analysis:
The notes that we maintained during the grading
reconciliation process helped us identify the areas that
resulted in disagreement:
a. Interpretation of the criteria. The criterion, ‘whether the issue
was relevant’ was interpreted by one grader broadly to
mean that it was among the issues in the case. The other
grader was looking to see if the student picked the more
important among the issues. Another cause for the
disagreement in initial grading was due to confusion in
classification. For instance, if the analysis was not
dealing with the question or issue that had been
15
specified, were points to be taken off under ‘depth of
analysis’ or under ‘not brought to a conclusion?’
b. Grading philosophy. Although both instructors agreed on the
nature of the deficiency, there was a disagreement on the
severity and therefore the penalty. Sometimes one took
off more than the other did. This dealt directly with the
concern leading to RQ2 on the grading philosophy of the
instructor. One would argue that these are graduate
students and should know better, while the other would
argue that this is their first semester in the program
and they need more encouragement at this stage.
c. Relative grading. To assist the process of reconciliation,
the graders would often compare the case under discussion
with how other students had been graded. We would go back
and check if we had penalized or credited another student
on a similar issue, and the extent of penalty. Thus,
although not initially stipulated, consistency across a
particular case became an objective.
d. Errors of omission: In some cases, reconciliation was easily
arrived at because one grader had overlooked a deficiency
initially and was quickly convinced when the other drew
attention to it.
DISCUSSION
This study was undertaken to explore the effect of
using multiple graders and their interpretation of criteria
16
in evaluating written business case analysis. The literature
on essay grading suggests that having clear criteria for
grading helps narrow the differences and results in a high
level of agreement between multiple graders. Our results
show that the overall level of agreement found in this study
is consistent with what we found in the literature. However,
looking more closely at the criteria on which disagreement
is greatest suggests cause for concern.
As our results show, the wording of the criteria may be
such as to allow for multiple interpretations. While we were
clear about it at the time of designing the criteria, when
it came to application, it was still subject to diverse
interpretation. Thus, we recommend instructors and
researchers to be as precise as possible inlaying out their
expectations. For instance, the criteria ‘Is the issue
relevant’ could be expanded, in parenthesis, to say
‘relevant to the decision makers in the case’ or ‘relevant
to the topics of the session,’ etc.
One disturbing question that arises from the above is:
how can we expect students to understand criteria when even
instructors interpret them differently? Fortunately, we were
not dealing with an examination situation where the
possibility of repeat submission or appeal may not exist.
Thus, our study suggests that while having criteria is
better than not having them, there is still plenty of room
17
for misinterpretation. Instructors need to take care to
spell out, in as much detail as possible, what they mean by
their criteria, and perhaps spend time in class discussing
with the students before finalizing them. In addition,
students may be encouraged to discuss the evaluation
received with the instructor if the student is not clear
about the message.
Written case analysis is widely adopted in business
programs, as it is believed to help both in written
communication skills, and in improving analytical skills.
Thus, the process of writing cases, grading them, and
providing feedback is an important activity for the student
and the instructor. It serves both to evaluate the student’s
abilities, as well as assist the learning process by
providing feedback. We found support for this process. Our
grading form, apart from giving a numerical score
representing our decision, also provided an explanation
through written comments. Where we felt that analysis was
weak or where there was poor application of theory, we gave
examples of how the student could have dealt with the case.
This would have helped to reduce the confusion that the
student may otherwise have had from either not understanding
the criteria, or misinterpreting the numerical score
received.
Our assumption is that students read and made use of
the written feedback. In addition, we were providing general
18
comments in a subsequent class on the written reports
received. Our analysis does not allow us to pinpoint which
particular feedback was of use. The role of feedback, in
general, is an area worthy of further exploration. Questions
such as how much feedback is optimal, does feedback on
content or process work better, etc. need to be examined.
We found that having multiple graders benefited the
students in multiple ways. Although the students did not
secure better grades than they would have from a single
grader, the graders were able to check each others’ acts of
omission, and the reconciliation process helped to bring a
measure of relative consistency. In addition, resolving
differences between instructors through discussing their
opinions before providing the comments to the students
resulted in a more considered feedback. Of course, while it
can be argued that reconciliation helps to reduce bias in
terms of grades, the different perspectives from multiple
graders can aid the learning process of the students and is
an area that needs further study.
We realize that very few schools have the resources to
support multiple graders on a regular basis. Moreover, many
instructors are not comfortable with having another
instructor in the classroom, or another person with whom
they would like to share grading responsibility. One way
around this is to involve students in the grading process
although the difference in the quality of feedback that can
19
be expected from a faculty member versus a student must be
taken into account. Peer evaluation and assessment can be an
important source of feedback to the students (Gopinath,
1999). Moreover, asking students to evaluate cases helps the
process of developing analysis skills for those providing
the feedback (Schroeder and Fitzgerald, 1984). Thus, instead
of a second instructor as a grader, the instructor can use
peers (individuals or groups) to serve as a second grader.
The instructor trying to reconcile his or her views with the
peers’ comments before finalizing the grade would still
incorporate some of the benefits we observed in our study.
When large classes are involved, instructors often use
teaching assistants to grade cases based on a process of
providing criteria and then random examination of graded
papers for consistency. However, this may not allow for
multiple opinions of graders to come into play since the
teaching assistant is charged with trying to replicate the
standard and expectation set by the instructor, and may not
have the expertise or the experience to provide an
alternative viewpoint. An area that requires further
detailed study is whether the marginal benefits from
multiple graders (in terms of learning value) exceeds the
marginal cost.
Overall, it is perhaps important for instructors to
carefully consider the weight placed on individual written
analysis reports so as to lessen the impact of varied
20
interpretation of the analysis. In addition, it is important
for instructors to allow students to challenge or question
the comments received and the grade given. Instructors often
have a mindset about not changing a grade. We would suggest
that the process of learning requires an instructor to be
able to justify the decision he or she has made about the
quality of a student’s work. Thus, the instructor must be
willing to defend his or her evaluation and be willing to
change the grade, if necessary. Where the final exam (with
limited opportunity to discuss and revise the grade) is in
the form of a written case analysis, the instructor will be
well advised to give the student the benefit of the doubt.
21
REFERENCES
Alexander, L. D., O’Neill, H. M., Snyder, N. H., & Townsend,
J. B. (1986). How academy members teach the business
policy/strategic management case course. Journal of
Management Case Studies, 2, 334 – 344.
Bilimoria, D. (1995). Modernism, postmodernism, and
contemporary grading practices. Journal of Management
Education, 19, 440-458.
Blanke, H. G. (1997). Grading by theory. College Teaching, 47,
136-139.
Brown, G. (1997). Assessing student learning in higher education.
London: Routledge, p. 59.
Dennis, I., Newstead, S. E. & Wright, D. E. (1996). A new
approach to exploring biases in educational assessment.
British Journal of Psychology, 87, 515-535.
Erskine, J. A. Leenders, R. R. & Maufette-Leenders, L. A.
(1981). Teaching with cases. London, Canada: The
University of Western Ontario.
Gopinath, C. (1999). Alternatives to instructor assessment
of class participation. Journal of Education for Business, 75,
10-14.
Honan, W. H. (1999). High tech comes to the classroom:
Machines that grade essays. The New York Times, 148, B8.
Murphy, R. J. L. (1979). Removing marks from examination
scripts before re-marking them: Does it make any
difference? British Journal of Educational Psychology, 49, 73-78.
22
Nealey, S. M. (1969). Student-Instructor agreement in
scoring an essay examination. The Journal of Educational
Research, 63, 111-115.
Page, E. B. (1994). Computer grading of student prose, using
modern concepts and software. Journal of Experimental
Education, 62, 127-142.
Partington, J.Double-marking students’ works. Assessment and
Evaluation in Higher Education, 19 (No. 1, 1994), 57-61.
Schroeder, H. & Fitzgerald, P. (1984). Peer evaluation in
case analysis. Journal of Business Education, 60, 73-77.
Spear, M. (1997). The influence of contrast effects upon
teachers’ marks. Educational Research, 39, 229-233.
Wood, R. & B. Quinn. (1976). Double impression marking of
English language essay and summary questions. Educational
Review, 28, 229-246.
Wright, D. E. (1996). A new approach to exploring biases in
educational assessment. British Journal of Psychology, 87,
515-535.
23