Exploring the effects of criteria and multiple graders on case grading

EXPLORING EFFECTS OF CRITERIA AND

MULTIPLE GRADERS ON CASE GRADING1

C. Gopinath*Sawyer School of Management

Suffolk University8 Ashburton PlaceBoston, MA 02108

Phone: 617.305.1934Internet: [email protected]

1 Published in the Journal of Education for Business, 79(6):317-322

* I wish to thank Patricia Carlson for her assistance in data collection.


MULTIPLE GRADERS ON CASE GRADING

Abstract

Written analyses of cases help the student develop skills oflogical analysis and written communication. However, the reliability of the grades received is often questioned by students. This paper reports on a study using six criteria to evaluate case analysis by two graders team teaching a course. Our results show that even with pre-determined criteria, there continue to be areas of disagreement betweengraders arising out of varied interpretation. Yet, student grades suggest that they benefit from a process that involves multiple cases and multiple graders. The implications of these findings are discussed.

2


MULTIPLE GRADERS ON CASE GRADING

Case discussion is a popular pedagogical technique

employed in business courses. Cases provide rich description

of a setting in which a business decision is to be made.

They provide the student an opportunity to apply analytical

skills within a real context and to arrive at decisions and

other recommendations.

Instructors use cases in several ways. Apart from

engaging the whole class in a case discussion, students may

be asked to prepare and present their analysis, or be asked

to submit a written report on the case. Written analyses of

cases are meant to help the student develop skills of

written communication, and those of logical development of

an argument. The written reports are graded and often go

towards the overall evaluation of the student in the course.

The literature on essay grading in the field of higher

education shows that it is an activity that is subject to

several biases and errors. Although grading cases, which is

similar to essay grading, is extensively undertaken in

business programs, the issues of bias, reliability, or

consistency have not been examined in the business education

literature.

3

This paper describes a study we undertook to examine

the extent of agreement on case grades between dual graders

when grading criteria are specified. Our results show that

even with prior specification of criteria, there continue to

be areas of misinterpretation among graders that account for

significant differences. Yet, students appear to benefit

from a process that involves multiple cases, and multiple

graders. The implications of these findings are discussed.

LITERATURE REVIEW

Written case analysis is extensively used in business

programs. In a survey of 177 faculty teaching business

policy/strategic management courses, Alexander et al. (1986)

found 84.2% of the respondents required individual written

case analysis in this course. This was only second to class

participation as a factor determining the student grade. We

believe similar high usage exists in other courses. However,

our search revealed almost no study on the grading of case

analyses in business schools. Thus, we will draw upon the

studies conducted in other fields such as law where cases

are used and in the humanities where studies on essay

marking (a close analogy to case grading) have been

conducted.

Essays, like case analyses, have no absolutely right or

wrong answers but must show student comprehension of an

advanced level of analysis. ‘An essay writer has to

4

identify the problems beneath the question posed, he or she

has to create a structure, display insight and provide a

coherent argument’ (Brown, 1997). When a grader reads an

essay, he or she has to make a determination as to whether

it satisfies the requirement of a good discussion and

reveals aspects of student learning. These include concerns

of both the content and process of the discussion.

Grading a case analysis also requires consideration of

both content and process. Content issues would include

evaluation of the grasp of the issues in the case, knowledge

of facts and their implications, and whether the student has

understood the main question being raised. Process issues

would include concerns of presentation, whether arguments

are logical and analytical, and quality of language

expression.

The biases that work on the grader have attracted wide

concern among scholars. These biases could arise out of the

gender of the essay writer (Wright, 1996), personal

knowledge of the student (Dennis, Newstead and Wright,

1996), and the sequence of grading wherein a few

consecutively good essays would bias the instructor to grade

a weak essay particularly harshly (Spear, 1997). In addition

to these biases, differences in marks may reflect the

different philosophies of learning subscribed to by the

graders (Blanke, 1999). Bilimoria (1995) uses the lens of

modernism and postmodernism to illustrate approaches to

5

grading. A ‘modernist’ views evaluation as being result-

oriented and meeting certain standards. Grades differentiate

between accomplishments, and indicate the extent to which

students meet criteria. On the other hand, a ‘postmodernist’

views evaluation as continuous, focused on making

improvements, serves as feedback, and reflects how learning

opportunities have been used. Thus, a difference in ideology

may influence the tendency of one grader to mark high and

another to mark low.

The subset of the literature that deals with our study

more closely has to do with the use of criteria to grade the

case/essay, and the effects of disagreement, if any, in the

interpretation of these criteria. Often, double marking, or

the use of multiple graders to assess an essay is an attempt

to reduce the subjectivity in essay marking (Partington,

1994, Erskine et al., 1981). Thus, establishing criteria

should allow for less subjectivity since both the student

and the instructor can be guided by the same set of

principles.

The belief that grading based on criteria can be

standardized has also led to its automation. It has added

attraction for educators who deal with a large volume of

tests. For them, automation and standardization lends speed,

consistency and a perceived measure of objectivity to the

process. In several studies of grading essays and writing

6

samples conducted as part of an extended project, Page

(1994) found it possible to achieve a high level of

correlation between grades assigned by a computer and

multiple human judges. The criteria were broken down into

measurable variables for content traits and for essay

content. The Graduate Management Admissions Council, which

administers the GMAT test required by many business schools

of their graduate program applicants has introduced

computerized grading of essay answers in 1999 (Honan, 1999).

The test involves two essay questions, which were previously

graded by humans. Now, with about 400,000 test takers every

year, a human and a computerized essay scoring system are

used to grade GMAT essays. If the electronic grade differs

from the human grade by more than one point, a third (human)

expert assigns a final grade. The system looks for

organization of ideas and syntactical structure. These

include finding a subordinate clause, looking where a

discussion starts and ends, and examining vocabulary.

Some scholars have compared high levels of students and

instructor agreement of assessment of an essay exam to

suggest that scoring standards can be readily communicated

(Nealey, 1969). However, this issue has not been studied

with regard to case analysis. Case analysis, apart from

calling for good writing and logical development of an

argument, also requires specific application of theories or

concepts to the case situation.

7

The above discussion poses some questions of interest

to us in our study. One concerns the use of criteria to

reduce subjectivity in grading. If this is true, then there

should not be significant differences in grades assigned by

multiple graders working with specified criteria. Any bias

in grading will naturally find expression in differences in

the marks assigned by the two graders. The first research

question below explores this aspect:

RQ1: What is the extent of agreement between two graders of written case analysis when the criteria for grading have been specified?

When multiple instructors are grading cases, students

receive a grade that represents the expertise of the many

graders. This is true whether the multiple graders agree on

the grade or not. In cases of disagreement, a process of

reconciliation between graders is initiated in order to give

the student one grade. Wood and Quinn (1976) in their study

of English language essays show through correlation that

having multiple graders improves reliability by reducing

variation. Thus, they conclude that through a system of

multiple grading, the effects of erratic marking are reduced

and the student grades are less affected by who marked his

or her paper. However, in a criteria-based evaluation scheme

with subsequent reconciliation, final grade could vary from

initial grade. The final grade may be higher or lower than

8

the one they would have received if there was only one

grader. Thus:

RQ2: Does double marking result in a different (higher or lower) grade for the student as compared to single marking when the criteria for grading have been specified?

The purpose of a grade and written comments is meant to

evaluate and provide feedback to students. The learning

process requires that students work to improve areas where

they did not meet expectations. They also understand the

criteria better through repeated attempt. When students

submit multiple written reports, they have the opportunity

to improve by working on weak areas and demonstrating their

understanding. This suggests a third research question:

RQ3: Are student grades higher on a second case as compared to the first, when grading criteria is held constant?

METHOD

The absence of prior research addressing related

research questions has led us to adopt an exploratory

approach. Two instructors who were team-teaching an

introductory general business course that all MBA students

at our university are required to take in the first semester

of their program conducted the study. Both instructors were

present in the classroom throughout the semester and

participated in class activities. The course was designed to

introduce the students to (a) a set of skills that they

9

would need in the program and in a management career (such

as written analysis, presentation, discussion skills, etc.),

and (b) a set of perspectives such as viewing the company as

a whole, appreciating a globalized environment, and the

impact of technology on business.

Students were required to individually submit written

case analyses (WCA) on any two of the six cases to be

discussed in the course during the semester. They were

encouraged, but not required, to submit one case early in

the semester, and after considering the feedback on the

first, do the second one. Each WCA carried a weight of 15%

of the total grade for the course. The WCA had a 350 word

limit. The format to be followed was: (a) specify an issue

or a problem, (b) analyze the situation using a

concept/theory/model that had been discussed in any previous

class session in the course, and (c) bring the discussion to

a conclusion.

In the first semester that we taught this course, we

arrived at four criteria for grading the cases and these

were provided to the students. As the semester progressed,

we found several instances where we disagreed on

interpreting or applying the criteria to the WCA under

consideration. We discussed the criteria again prior to the

start of the second semester and agreed to expand the list

to six items in an effort to reduce the misinterpretation.

10

These items were: (a) Question or issue specified in the

beginning; (b) Question or issue is relevant; (c) Question

was answered/ issue brought to conclusion; (d) Depth of

analysis; (e) Use of appropriate theory/concepts; (f) Format

adhered to (writing style, error-free writing, word limit,

etc.). Scores ranging from 1 (Poor) to 5 (Excellent) were

given for each item. The criteria were developed based on

our experience and parallels what is expected in other

business courses. For example, conducting an analysis,

applying business policy theory and concepts, and writing

ability are among the top seven criteria used to grade cases

(Alexander, et al., 1986) in the business policy course.

Data were collected over a semester and the following

procedure was followed: When a student submitted a WCA, it

was first read by one instructor, and evaluated using the

grading sheet. No marks were made on the script (to

eliminate bias, Murphy, 1979) which was then passed on to

the second instructor who also read the case and evaluated

it separately. We then met and reconciled our evaluation of

each WCA. The process of reconciliation came into effect

when there was a difference between the individual grades

given by the instructors on a particular criterion. Each

instructor would then provide the reasons for his/her grade,

and the WCA would be read again. Each item on the grading

sheet on which there was a disagreement would be discussed

and reconciled. There were three possible outcomes of this

11

process. The final grade would be: (a) that which was given

by one of the instructors, indicating that one instructor

was able to convince the other, (b) the average of the two

grades reflecting a compromise, or (c) a common grade

different (higher or lower) from that originally given by

the instructors (if, in the process of reading and

discussing, it transpired that a different grade was

justified).

The individual scoring sheets with the independent

grades and comments were set aside, and the reconciled grade

along with grader comments were entered on a third grading

sheet, which was given to the student. The students were

aware that both instructors were involved in grading each

WCA but were not told of the detailed process or the study

in progress. There were 53 students in two sections of the

course from whom the data was collected. Two case reports

were missing and thus we had an ‘n’ of 104.

RESULTS

Quantitative analysis: RQ1 was examined by looking at

both the extent of initial agreement and subsequent

reconciliation among the two graders. There was a strong

positive correlation (.46, p<.01) between the scores of the

two graders across all the criteria (Table 1). There was

full agreement between the instructors in 71% of the cases

(Table 2). This compares favorably with the information

12

presented by Page (1994) that one US state educational

system required that interjudge agreement be at least 70% in

a 4-point rating.

____________________________________Place Tables 1 & 2 about here

__________________________________

Disagreement between the graders has a richer story to

tell. Reconciliation came about in all cases of initial

disagreement. In the process of reconciliation, a majority

of the cases (27% out of 29% that required reconciliation)

were resolved with one grader convincing the other (columns

4 and 6, Table 2). In only 3% of the cases (column 5) did we

resort to taking the mean. This confirms the extensive

discussions and review that accompanied reconciliation

without resorting to a quick compromise through settling for

the mean. Moreover, very few students received grades that

fell outside the initial range of the two graders (columns 3

and 7, Table 2). This suggests that the initial two grades

represented the possible range that the student could have

received.

Looking at differences across specific criteria, it is

clear that criteria 2, 4 and 5 (Table 1) accounted for

significant differences between the two graders. Although

correlation on criterion 2 is low, there was a high level of

agreement between the graders. Criteria 4 and 5 represent a

13

different picture. On these two criteria, the number of

students whose final grade was equal to that of one or the

other grader was about equal. Against this, on all the other

criteria, the final grade was more heavily weighted towards

one or the other grader. Only about 12 % (4 out of 43 and 6

out of 39) of the students received a mean grade. This

suggests that the two instructors were adhering to their

initial grades more strongly on these two criteria as

against the others. An examination of the criteria itself

suggests that these two were subject to greater

interpretation than the others and is discussed under

qualitative analysis below.

To examine RQ2, we compared final grade received with

the higher or lower of an individual instructor’s grade and

found no significant variation. The data shows that in 13.8%

of the cases students received a grade higher than single

marking (columns 6 & 7, Table 2). On the down side, in 12.8%

of the cases, they received a grade lower than single

marking (columns 3 & 4).

____________________________________Place Table 3 about here

____________________________________To address RQ3, we compared the grades (Table 3)

received by the students on their second case with that

received on the first. We found a significant difference in

the case of criteria 5. On the others, there were either no

differences, or a marginal improvement. Since criterion 5

14

was also one of the two criteria on which there was the

highest initial disagreement between the two graders, the

improvement could suggest either that the feedback helped

the student understand the criterion better, or that it

helped improve application of theory.

To check if there was a grader ‘learning’ bias, i.e.,

if the graders were converging in their views over the

semester, we compared the disagreements on case-by-case

basis (Table 4). Of the six, there was a drop in the number

of disagreements in the second case. In all the other cases,

the disagreements remained around an average of 39,

suggesting very little convergence effect.

____________________________________Place Table 4 about here

____________________________________Qualitative analysis:

The notes that we maintained during the grading

reconciliation process helped us identify the areas that

resulted in disagreement:

a. Interpretation of the criteria. The criterion, ‘whether the issue

was relevant’ was interpreted by one grader broadly to

mean that it was among the issues in the case. The other

grader was looking to see if the student picked the more

important among the issues. Another cause for the

disagreement in initial grading was due to confusion in

classification. For instance, if the analysis was not

dealing with the question or issue that had been

15

specified, were points to be taken off under ‘depth of

analysis’ or under ‘not brought to a conclusion?’

b. Grading philosophy. Although both instructors agreed on the

nature of the deficiency, there was a disagreement on the

severity and therefore the penalty. Sometimes one took

off more than the other did. This dealt directly with the

concern leading to RQ2 on the grading philosophy of the

instructor. One would argue that these are graduate

students and should know better, while the other would

argue that this is their first semester in the program

and they need more encouragement at this stage.

c. Relative grading. To assist the process of reconciliation,

the graders would often compare the case under discussion

with how other students had been graded. We would go back

and check if we had penalized or credited another student

on a similar issue, and the extent of penalty. Thus,

although not initially stipulated, consistency across a

particular case became an objective.

d. Errors of omission: In some cases, reconciliation was easily

arrived at because one grader had overlooked a deficiency

initially and was quickly convinced when the other drew

attention to it.

DISCUSSION

This study was undertaken to explore the effect of

using multiple graders and their interpretation of criteria

16

in evaluating written business case analysis. The literature

on essay grading suggests that having clear criteria for

grading helps narrow the differences and results in a high

level of agreement between multiple graders. Our results

show that the overall level of agreement found in this study

is consistent with what we found in the literature. However,

looking more closely at the criteria on which disagreement

is greatest suggests cause for concern.

As our results show, the wording of the criteria may be

such as to allow for multiple interpretations. While we were

clear about it at the time of designing the criteria, when

it came to application, it was still subject to diverse

interpretation. Thus, we recommend instructors and

researchers to be as precise as possible inlaying out their

expectations. For instance, the criteria ‘Is the issue

relevant’ could be expanded, in parenthesis, to say

‘relevant to the decision makers in the case’ or ‘relevant

to the topics of the session,’ etc.

One disturbing question that arises from the above is:

how can we expect students to understand criteria when even

instructors interpret them differently? Fortunately, we were

not dealing with an examination situation where the

possibility of repeat submission or appeal may not exist.

Thus, our study suggests that while having criteria is

better than not having them, there is still plenty of room

17

for misinterpretation. Instructors need to take care to

spell out, in as much detail as possible, what they mean by

their criteria, and perhaps spend time in class discussing

with the students before finalizing them. In addition,

students may be encouraged to discuss the evaluation

received with the instructor if the student is not clear

about the message.

Written case analysis is widely adopted in business

programs, as it is believed to help both in written

communication skills, and in improving analytical skills.

Thus, the process of writing cases, grading them, and

providing feedback is an important activity for the student

and the instructor. It serves both to evaluate the student’s

abilities, as well as assist the learning process by

providing feedback. We found support for this process. Our

grading form, apart from giving a numerical score

representing our decision, also provided an explanation

through written comments. Where we felt that analysis was

weak or where there was poor application of theory, we gave

examples of how the student could have dealt with the case.

This would have helped to reduce the confusion that the

student may otherwise have had from either not understanding

the criteria, or misinterpreting the numerical score

received.

Our assumption is that students read and made use of

the written feedback. In addition, we were providing general

18

comments in a subsequent class on the written reports

received. Our analysis does not allow us to pinpoint which

particular feedback was of use. The role of feedback, in

general, is an area worthy of further exploration. Questions

such as how much feedback is optimal, does feedback on

content or process work better, etc. need to be examined.

We found that having multiple graders benefited the

students in multiple ways. Although the students did not

secure better grades than they would have from a single

grader, the graders were able to check each others’ acts of

omission, and the reconciliation process helped to bring a

measure of relative consistency. In addition, resolving

differences between instructors through discussing their

opinions before providing the comments to the students

resulted in a more considered feedback. Of course, while it

can be argued that reconciliation helps to reduce bias in

terms of grades, the different perspectives from multiple

graders can aid the learning process of the students and is

an area that needs further study.

We realize that very few schools have the resources to

support multiple graders on a regular basis. Moreover, many

instructors are not comfortable with having another

instructor in the classroom, or another person with whom

they would like to share grading responsibility. One way

around this is to involve students in the grading process

although the difference in the quality of feedback that can

19

be expected from a faculty member versus a student must be

taken into account. Peer evaluation and assessment can be an

important source of feedback to the students (Gopinath,

1999). Moreover, asking students to evaluate cases helps the

process of developing analysis skills for those providing

the feedback (Schroeder and Fitzgerald, 1984). Thus, instead

of a second instructor as a grader, the instructor can use

peers (individuals or groups) to serve as a second grader.

The instructor trying to reconcile his or her views with the

peers’ comments before finalizing the grade would still

incorporate some of the benefits we observed in our study.

When large classes are involved, instructors often use

teaching assistants to grade cases based on a process of

providing criteria and then random examination of graded

papers for consistency. However, this may not allow for

multiple opinions of graders to come into play since the

teaching assistant is charged with trying to replicate the

standard and expectation set by the instructor, and may not

have the expertise or the experience to provide an

alternative viewpoint. An area that requires further

detailed study is whether the marginal benefits from

multiple graders (in terms of learning value) exceeds the

marginal cost.

Overall, it is perhaps important for instructors to

carefully consider the weight placed on individual written

analysis reports so as to lessen the impact of varied

20

interpretation of the analysis. In addition, it is important

for instructors to allow students to challenge or question

the comments received and the grade given. Instructors often

have a mindset about not changing a grade. We would suggest

that the process of learning requires an instructor to be

able to justify the decision he or she has made about the

quality of a student’s work. Thus, the instructor must be

willing to defend his or her evaluation and be willing to

change the grade, if necessary. Where the final exam (with

limited opportunity to discuss and revise the grade) is in

the form of a written case analysis, the instructor will be

well advised to give the student the benefit of the doubt.

21

REFERENCES

Alexander, L. D., O’Neill, H. M., Snyder, N. H., & Townsend,

J. B. (1986). How academy members teach the business

policy/strategic management case course. Journal of

Management Case Studies, 2, 334 – 344.

Bilimoria, D. (1995). Modernism, postmodernism, and

contemporary grading practices. Journal of Management

Education, 19, 440-458.

Blanke, H. G. (1997). Grading by theory. College Teaching, 47,

136-139.

Brown, G. (1997). Assessing student learning in higher education.

London: Routledge, p. 59.

Dennis, I., Newstead, S. E. & Wright, D. E. (1996). A new

approach to exploring biases in educational assessment.

British Journal of Psychology, 87, 515-535.

Erskine, J. A. Leenders, R. R. & Maufette-Leenders, L. A.

(1981). Teaching with cases. London, Canada: The

University of Western Ontario.

Gopinath, C. (1999). Alternatives to instructor assessment

of class participation. Journal of Education for Business, 75,

10-14.

Honan, W. H. (1999). High tech comes to the classroom:

Machines that grade essays. The New York Times, 148, B8.

Murphy, R. J. L. (1979). Removing marks from examination

scripts before re-marking them: Does it make any

difference? British Journal of Educational Psychology, 49, 73-78.

22

Nealey, S. M. (1969). Student-Instructor agreement in

scoring an essay examination. The Journal of Educational

Research, 63, 111-115.

Page, E. B. (1994). Computer grading of student prose, using

modern concepts and software. Journal of Experimental

Education, 62, 127-142.

Partington, J.Double-marking students’ works. Assessment and

Evaluation in Higher Education, 19 (No. 1, 1994), 57-61.

Schroeder, H. & Fitzgerald, P. (1984). Peer evaluation in

case analysis. Journal of Business Education, 60, 73-77.

Spear, M. (1997). The influence of contrast effects upon

teachers’ marks. Educational Research, 39, 229-233.

Wood, R. & B. Quinn. (1976). Double impression marking of

English language essay and summary questions. Educational

Review, 28, 229-246.

Wright, D. E. (1996). A new approach to exploring biases in

educational assessment. British Journal of Psychology, 87,

515-535.

23

Exploring the effects of criteria and multiple graders on case grading

Documents

Transcript of Exploring the effects of criteria and multiple graders on case grading