Student engagement with assessment and feedback: Some lessons from short-answer free-text...

17
Student engagement with assessment and feedback: Some lessons from short-answer free-text e-assessment questions Sally Jordan * The Open University, Cintra House, 12 Hills Road, Cambridge CB2 1PF, United Kingdom article info Article history: Received 23 June 2011 Received in revised form 4 September 2011 Accepted 11 October 2011 Keywords: Adult learning Distance education and telelearning Evaluation of CAL systems Human computer interface abstract Students were observed directly, in a usability laboratory, and indirectly, by means of an extensive evaluation of responses, as they attempted interactive computer-marked assessment questions that required free-text responses of up to 20 words and as they amended their responses after receiving feedback. This provided more general insight into the way in which students actually engage with assessment and feedback, which is not necessarily the same as their self-reported behaviour. Response length and type varied with whether the question was in summative, purely formative, or diagnostic use, with the question itself, and most signicantly with studentsinterpretation of what the question author was looking for. Feedback was most effective when it was understood by the student, tailored to the mistakes that they had made and when it prompted students rather than giving the answer. On some occasions, students appeared to respond to the computer as if it was a human marker, supporting the computers as social actorshypothesis, whilst on other occasions students seemed very aware that they were being marked by a machine. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Despite the wealth of literature relating to technology-enhanced assessment and to feedback on assessed tasks of all types, some aspects remain remarkably under-researched. This paper reports on an investigation that touches on two such under-researched topics, namely the way in which students actually engage with moderately sophisticated e-assessment tasks (short-answer free-text questions) and the way in which they actually respond to feedback, in this case generated by a computer. 1.1. Background The term feedback intervention(Kluger & DeNisi, 1996) can be used to describe externally generated information and tips for improvement; Ramaprasad (1983) and Sadler (1989) point the way to a theory of formative assessment in which this intervention can only truly be described as feedbackif the student uses it to close the gapbetween their current level and one to which they aspire. The fact that feedback interventions are not necessarily helpful is well documented (e.g. Sadler, 1989; Bangert-Drowns, Kulik & Morgan, 1991) but nevertheless the problem is shocking, widespread and persistent. Kluger & DeNisi (1996) reported that 38% of feedback interventions resulted in decreased performance. More recently, Crisp (2007) and Price, Handley, Millar & ODonovan (2010) have questioned the effectiveness of current feedback practices, whilst Carless (2006) and Orsmond and Merry (2011) found a lack of alignment between studentsand tutorsviews. Higgins, Hartley & Skelton (2002) report that students are trying to make use of the feedback provided, but when 27% of students are not able even to understand their tutorscomments (Walker, 2009), their usefulness is inevitably limited. In a review of the effectiveness of feedback, Hattie and Timperley (2007) highlight the importance of affective factors such as self esteem. Studentsperceptions (Lizzio & Wilson, 2008) are also important; these perceptions can only be found out from the students themselves Abbreviations: CASA, Computers as social actors; IAT, Intelligent Assessment Technologies; iCMA, Interactive computer marked assessment; OU, The UKOpen University. * Tel.: þ44 (0)1223 584656; fax: þ44 (0)1223 355207. E-mail address: [email protected]. Contents lists available at SciVerse ScienceDirect Computers & Education journal homepage: www.elsevier.com/locate/compedu 0360-1315/$ see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.compedu.2011.10.007 Computers & Education 58 (2012) 818834

Transcript of Student engagement with assessment and feedback: Some lessons from short-answer free-text...

Computers & Education 58 (2012) 818–834

Contents lists available at SciVerse ScienceDirect

Computers & Education

journal homepage: www.elsevier .com/locate/compedu

Student engagement with assessment and feedback: Some lessons fromshort-answer free-text e-assessment questions

Sally Jordan*

The Open University, Cintra House, 12 Hills Road, Cambridge CB2 1PF, United Kingdom

a r t i c l e i n f o

Article history:Received 23 June 2011Received in revised form4 September 2011Accepted 11 October 2011

Keywords:Adult learningDistance education and telelearningEvaluation of CAL systemsHuman computer interface

Abbreviations: CASA, Computers as social actors; I* Tel.: þ44 (0)1223 584656; fax: þ44 (0)1223 3552

E-mail address: [email protected].

0360-1315/$ – see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.compedu.2011.10.007

a b s t r a c t

Students were observed directly, in a usability laboratory, and indirectly, by means of an extensiveevaluation of responses, as they attempted interactive computer-marked assessment questions thatrequired free-text responses of up to 20 words and as they amended their responses after receivingfeedback. This provided more general insight into the way in which students actually engage withassessment and feedback, which is not necessarily the same as their self-reported behaviour. Responselength and type varied with whether the question was in summative, purely formative, or diagnostic use,with the question itself, and most significantly with students’ interpretation of what the question authorwas looking for. Feedback was most effective when it was understood by the student, tailored to themistakes that they had made and when it prompted students rather than giving the answer. On someoccasions, students appeared to respond to the computer as if it was a human marker, supporting the‘computers as social actors’ hypothesis, whilst on other occasions students seemed very aware that theywere being marked by a machine.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Despite the wealth of literature relating to technology-enhanced assessment and to feedback on assessed tasks of all types, some aspectsremain remarkably under-researched. This paper reports on an investigation that touches on two such under-researched topics, namely theway inwhich students actually engage with moderately sophisticated e-assessment tasks (short-answer free-text questions) and the way inwhich they actually respond to feedback, in this case generated by a computer.

1.1. Background

The term ‘feedback intervention’ (Kluger & DeNisi, 1996) can be used to describe externally generated information and tips forimprovement; Ramaprasad (1983) and Sadler (1989) point the way to a theory of formative assessment in which this intervention can onlytruly be described as ‘feedback’ if the student uses it to ‘close the gap’ between their current level and one towhich they aspire. The fact thatfeedback interventions are not necessarily helpful is well documented (e.g. Sadler, 1989; Bangert-Drowns, Kulik & Morgan, 1991) butnevertheless the problem is shocking, widespread and persistent. Kluger & DeNisi (1996) reported that 38% of feedback interventionsresulted in decreased performance. More recently, Crisp (2007) and Price, Handley, Millar & O’Donovan (2010) have questioned theeffectiveness of current feedback practices, whilst Carless (2006) and Orsmond and Merry (2011) found a lack of alignment betweenstudents’ and tutors’ views. Higgins, Hartley & Skelton (2002) report that students are trying to make use of the feedback provided, butwhen 27% of students are not able even to understand their tutors’ comments (Walker, 2009), their usefulness is inevitably limited.

In a review of the effectiveness of feedback, Hattie and Timperley (2007) highlight the importance of affective factors such as self esteem.Students’ perceptions (Lizzio & Wilson, 2008) are also important; these perceptions can only be found out from the students themselves

AT, Intelligent Assessment Technologies; iCMA, Interactive computer marked assessment; OU, The UK Open University.07.

ll rights reserved.

S. Jordan / Computers & Education 58 (2012) 818–834 819

(Boud, 1995) and it is important to make a distinction between a student’s construction of reality and their actions (Struyven, Dochy &Janssens, 2005). As long ago as 1995, Boud also made a plea for action research into what students are actually doing.

When the feedback comes from a computer, even less is known. Gipps (2005, p.197) asks “How does the student react to electronicfeedback? What is the assessment relationship when the assessor is the computer?” In considering the effectiveness of e-assessment, careneeds to be paid to factors as diverse as the measured impact of medium on performance (Clariana &Wallace, 2002; Pommerich, 2004) andstudent perception of e-assessment tasks and computer-generated feedback (Miller, 2008; Walker, Topping & Rodrigues, 2008).

Reeves and Nass (1996) report, in what has become known as the ‘Computers as Social Actors’ (CASA) framework, that people respondsocially and naturally to computers, as if they were people, even though they know that they are not. Mishra (2006) found some support forthe CASA hypothesis in students’ reaction to affective feedback from a computer. Lipnevich and Smith (2008) found no significant variationin overall score as a result whether students believed feedback to come from a computer or a human. However there were differences whenthe detail was considered e.g. participants who believed they had been marked by a human marker outperformed those who believed theyhad been marked by a computer when only comments were provided, but when grades were provided along with comments the situationwas reversed. In focus group discussions (Lipnevich & Smith, 2009), students reported that feedback they believed to come from a computerwas helpful, unless they disagreedwith themarking (typically because they thought theywere right and the computer was wrong) at whichpoint they ignored the feedback. Praise from a computer was not particularly well received. However, some individuals felt that it was goodthat their tutors had not seen their work and some felt that the computer was fairer. Other researchers report that students prefer feedbackfrom a computer because it enables students to make mistakes in private (Miller, 2008) and is impersonal (Earley, 1988) and non-judgemental (Beevers et al., 2010).

As students becomemore andmore familiar with the presence of computers in all aspects of their lives, it is likely that their reaction to e-learning and e-assessment will alter; theymay cease to regard the use of technology in learning and assessment as anything special (Conole,de Laat, Darby & Dillon, 2006; Beevers et al., 2010). However, it remains important to check that students are engaging with e-assessmentand computer-generated feedback in the way that we expect them to, especially since there may be no human intermediary to ensure thatstudents understand the feedback and respond to it appropriately (Draper, 2009).

Observations at the UKOpen University (OU) indicate that students may not always engage with e-assessment questions in the way thatthey report, in particular that they might not be making as much use of feedback as they claim (Jordan, 2011). This highlights a dangerinherent in any evaluation that relies on self-reporting and is in line with Kluger & deNisi’s (1996) conclusion that people find feedbackpsychologically reassuring and so like to think that feedback interventions are helpful, even if they are not.

1.2. Research questions and context

The work described in this paper sought to investigate:

� How students actually engage with e-assessment questions;� How students use the feedback provided;� The influence of factors such as question wording and whether the assessment is summative or purely formative.

The investigation, also at the OU, took the form of direct and indirect observation of student use of interactive computer-markedassignments (iCMAs), but the findings are likely to be of relevance to a wide range of assessment tasks and for students in differentlearning environments.

The most significant feature of the modules included in the investigation is that student numbers are large: the modules are studied bybetween 500 and 4000 students per year, and are expected to run for about 10 years. These large numbers justify the considerable resourcethat has gone into the development of iCMA questions; they also enable robust quantitative analysis of student responses.

Another relevant feature is that students are usually allowedmultiple attempts at each question, with increasing feedback provided aftereach attempt and an opportunity for students to amend their answer in the light of the feedback provided. In some modules iCMAs havea purely formative role, in others they carry a light summative weighting, designed to encourage student engagement.

Question types include multiple-choice, multiple-response, drag and drop and those requiring free-text entry of numbers, letters andwords. To these typeswas added, in 2006, a question-type requiring students to type their response as a free-text phrase or sentence of up to20 words. The response matching initially used Intelligent Assessment Technologies’ (IAT) FreeText Author software, which gave excellentresults (Jordan & Mitchell, 2009). However, following the unexpected finding that equally accurate results could be obtained using muchsimpler patternmatching software (Butcher & Jordan, 2010), the OU’s own PMatch is now used. Fig.1 shows typical feedback following threeattempts at a question. The move from IAT to PMatch has allowed greater flexibility in the provision of feedback e.g. enabling more detailedtargeted feedback after just one incorrect or incomplete response, when that is deemed to be appropriate. Both IAT and PMatch haveallowed incorrectly spelt words to be accepted (Fig. 2b). In addition, a spell checker has recently been added, informing students whena word they have used is not recognised, and offering suggestions for correct spelling (Fig. 2a).

2. Method

2.1. Usability laboratory observations

Six student volunteers were observed attempting questions in a usability laboratory. The students were studying the module (identifiedby the code S103) in which developmental versions of the questions were first introduced, in a number of purely formative and optionaliCMAs. The volunteers were asked to attempt one of two iCMAs, each including a number of short-answer free-text questions (given inAppendix A). Most of the students attempted the questions on Block 7, which they had studied recently; Student E had not read Block 7 soattempted the Block 2 questions instead. The questions in these iCMAs had not been previously seen by the students, though some hadattempted earlier short-answer free-text questions. Students C and D were female; the others were male. In line with accepted practice for

Fig. 1. Increasing feedback (top left to bottom right) on a typical short-answer free-text question. Since the introduction of PMatch response matching, tailored feedback, such asthat shown in the central image (after a second incomplete attempt) can now also be given after a first incorrect or incomplete attempt, in place of the very simple feedback shownin the top left-hand image.

S. Jordan / Computers & Education 58 (2012) 818–834820

usability laboratory observation (see, for example, Stone, Jarrett, Wodroffe & Mincoha, 2005) participants interacted with the questionswithout assistance, but their interactions were observed live and recorded for subsequent analysis. A verbal think-aloud protocol was used,whereby the participants were asked to talk about what theywere thinking and doing, and after the observation, each participant was askedto comment retrospectively on the reasons for their actions. However, the emphasis of the evaluationwas onwhat the students actually didrather than what they said they did.

2.2. Large scale analysis of responses

Student responses to all iCMA questions are captured, and this enabled a retrospective and anonymous evaluation of both the form of theresponses (length; whether the response was given as a single word, phrase, sentence or paragraph etc.; presence of spelling mistakes) andthe way in which the responses were altered in response to the feedback given. The work has similarities with that of Fowler (2008) fora different type of e-assessment question, but again with the deliberate provision of multiple attempts.

Seven short-answer free-text questions (given in Appendix B) were selected for detailed evaluation, from the 82 that had been writtenand 26 in regular use. The questions were chosen to enable comparison of as wide as possible a range of modes of use: formative-only,diagnostic and low-stakes summative. As can be seen from Table 1, all of the questions were analysed for more than one presentation ofthe relevant module(s), allowing factors that have altered over time (e.g. word limits and advice onword length) to be compared. Whereverpossible, the influence of confounding variables was reduced e.g. by analysing uses of questions in otherwise similar presentations.

3. Results and comments

3.1. Usability laboratory observations

The observed student behaviour might have been different if the iCMAs had carried summative weighting and also if the students hadnot been being observed. However, the students reported that the fact they were being observed did not influence their behaviour.

Fig. 2. Two methods for dealing with incorrect spelling. (a) A spell-checker identifies words not in a dictionary. (b) Similar but incorrect words can be accepted (in this case“deceased” instead of “decreased”).

S. Jordan / Computers & Education 58 (2012) 818–834 821

3.1.1. Form of responseThere was considerable variation in the type of response given by the different students and also from question to question. Table 2

compares typical responses from the five students who attempted the same iCMA, giving the mean length of their responses to fourshort-answer questions which they all attempted, the time spent composing their first response to Block 7 Q1, and their first attemptresponses to Block 7 Q1 and Block 7 Q8.

At the time of the observations, no indication had been given to students as to the required form of the response and several of thevolunteers were concerned about this e.g. Student C commented:

I found it difficult to know what I should be putting in there because you don’t know whether it’s going to understand what you areputting in.

Student A appeared to take some time settling into the assignment (takingmore than 6min to compose his first response) and part of thereason for this appeared to be uncertainty about how to phrase the answer. In answer to the first question (Block 7 Q1) he initially typed:

The key characteristic

He then deleted this entry, commenting that ‘obviously on tutor-marked assignments I’m typing the full answer’ and implying that hewould not do that here. The first attempt response that he submitted for checking, after making several further changes of content andchecking in the course book, was:

produces emissions at similar energy levels

Later Student A explained that if this had been a question on a tutor-marked assignment his strategy would be to:

.include the question in the answer. Normally I would write ‘The key characteristics of an emission spectrum’ and then the answer.

What is not clear is whether Student A responded to iCMA questions differently because he knew that his response would be marked bya computer or whether this was to do with the positioning of the data-entry box immediately below the text of the question. AlthoughStudent A did not repeat the question before giving his answers, his answers were well expressed and not particularly short. His answer toBlock 7 Q8 was written in two sentences and included an explanation as well as the answer itself.

Student F set out to give responses as if for a humanmarker. He explained his strategywhen ‘thinking aloud’ at the very outset of the trial,saying:

I’m going to assume that I answer this question in the same way that I would a tutor-marked assignment.

In answer to Q1 he initially typed:

The spectrum will be characterised by a line showing

Table 1Questions and responses included in the analysis. Modules are identified by their codes and the year and month of presentation (so 08J refers to a presentation that started inOctober 2008).

Name of question Formative-only use Diagnostic use Summative use

Snowflake S103 (IAT matching):450 students, 888 responses

S104 08J (IAT matching):1341 students, 2057 responsesS104 09B (IAT matching):1327 students, 1991 responsesS104 10B (IAT matching):1315 students, 1919 responses

A-good-idea S154 09J (IAT matching):680 students, 1071 responsesS154 10J (PMatch):694 students, 1037 responses

Are you ready for S104? Nov 2009- Sept 2010(IAT matching):9317 users, 19889 responsesSept – Dec 2010 (PMatch):2203 users, 2756 responses

Oil-on-water S154 08J (IAT matching):591 students, 817 responsesS154 09J (Matcha):710 students, 876 responsesS154 10J (PMatch):692 students, 927 responses

Metamorphic SXR103 09E (IAT matching):434 students, 603 responsesSXR103 10E (PMatch):544 students, 714 responses

Sedimentary SXR103 09E (IAT matching):427 students, 682 responsesSXR103 10E (PMatch):533 students, 795 responses

Sandstone S104 09B (IAT matching):1007 students, 1889 responsesS104 10B (PMatch):956 students, 1640 responses

Electric-force S104 08B (IAT matching):787 students, 1127 responsesS104 09B (IAT matching):934 students, 1355 responsesS104 10B (PMatch):850 students, 1187 responses

a Match was the precursor to PMatch. For the purposes of this analysis, the main difference was that there was no pre-matching spell checker in use.

S. Jordan / Computers & Education 58 (2012) 818–834822

Then he hesitated before adding the word ‘vertical’:

The spectrum will be characterised by a vertical line showing

Then he added a full answer, complete with full stop:

The spectrum will be characterised by a vertical line showing where in the spectrum the particular colour absorbed by the vapour hasbeen absorbed.

He then read through his response before adding theword ‘is’ and removing the final six words, to give the responsewhich he submittedfor checking:

The spectrum will be characterised by a vertical line showing where in the spectrum the particular colour is absorbed.

Throughout the assignment Student F took time in composing his answers and composed them carefully. They were characteristicallylonger than the other students’ responses and were always written as sentences. All started with a capital letter and most ended with a fullstop.

Table 2Characteristics of responses from students A, B, C, D and F for various Block 7 questions.

Student Average word-lengthfor responses to Block 7 Q1,2, 3 and 8

Time spent in composingfirst response to Block 7 Q1/seconds

First attempt response at Block 7 Q1 First attempt response to Block 7 Q8

A 7.6 377 Produces emissions at similar energylevels

Reduces by a helium isotope. Thereforethe mass number reduces by four andthe atomic number reduces by two.

B 3.4 46 Dark lines in a spectrum It deceasesC 3.4 88 Coloured lines on a black background It reduces by 4D 3.5 75 Lines of colour Mass number reduces by 4F 20.4 267 The spectrum will be characterised by a

vertical line showing where in the spectrumthe particular colour is absorbed by the vapour.

The mass number decreases by 2

S. Jordan / Computers & Education 58 (2012) 818–834 823

Student E admitted that he was conscious of being marked by a computer, but rather than this being a cause for concern he saw it as anopportunity to take a more relaxed approach to grammar and punctuation than he would for a human marker:

‘.the computer is not concerned with my Ps and Qs, whether I start with a capital letter first word, punctuation and stuff. I’m notshowing it as much respect as I would a human being.[for a tutor] I would have written perhaps a longer sentence or two shortersentences.’

3.1.2. Response to brief feedback after first incorrect responseThere were several instances where students were able to use the brief feedback given after an incorrect or incomplete first response (at

the time of the observations this was simply ‘Your answer does not appear to be correct’) to correct their error, most commonly where therewas only a two-way choice of answer or where the correct answer was the opposite of an incorrect one given. For example, Student C’s firstresponse to Block 7 Q11 was:

Diffraction is greater

However, as she typed this response she commented ‘unless it’s the other way round’ and when she received the brief feedback that heranswer was incorrect, she immediately altered her response to:

Diffraction is less

Elsewhere, the brief feedback misled Student C. In answer to Block 7 Q6 she initially gave the correct but incomplete response:

Differences between levels are higher with helium than hydrogen

When told that this response was not correct, she then gave the incorrect response:

Differences between levels are lower with helium than hydrogen

Finally, after receiving more detailed feedback she was able to give the correct response:

Differences between levels are 4 times the hydrogen

3.1.3. Response to feedback after second incorrect attemptIn the usability laboratory observations, more detailed feedback (sometimes tailored to the student’s misunderstandings) was supplied

on 14 occasions,13 of which are considered here (in the 14th case the feedback required Student E to check something in studymaterial thatwas not available). The use made of the feedback is summarised in Table 3.

Students C–F appeared to read all the feedback that was provided, followed the reference to the study material on 4 out of 7 occasionsand gave the correct final answer on 3 out of 7 occasions. Sometimes the decision not to look back to the study material was deliberate.Student F commented:

Table 3Students’ responses to detailed feedback received after their second attempt.

Question Nature offeedback

Did student appearto read feedback?

Was course text consultedafter prompt?

Was finalattempt correct?

Comment

Student A Block 7 Q1 General Partially Yes No Student read the first part, aloud, but omitted to read thefinal sentence, which contained a clue.

Block 7 Q6 Tailored Yes Yes YesStudent B Block 7 Q7 Tailored Uncleara No No This was a two-part question. One part of the student’s

answer was correct (throughout). The prompt suppliedwas for the incorrect part.

Block 7 Q8 General Uncleara No Yes The response ‘It deceases by 4’ was marked incorrectbecause of the spelling mistake. ‘It decreases by 4’ is acorrect response.

Block 7 Q10 General No No NoBlock 7 Q11 General Uncleara No No

Student C Block 7 Q6 General Yes Yes YesStudent D Block 7 Q6 Tailored Yes No Yes Student reviewed previous two questions after receiving

feedback, which gave a clue to the correct answer.Block 7 Q7 General Yes Yes No

Student E Block 3 Q2 Tailored Yes Yes No The final response given was actually incorrect but dueto a problem with the answer-matching it was markedas correct.

Block 3 Q9 Tailored Yes Yes YesBlock 3 Q10 General Yes No NoBlock 3 Q13 Tailored Yes No No Reference was to the study file which was not available.

Student F Block 7 Q7 General Yes No No

a The time spent looking at the screen was insufficient for the student to read the feedback fully.

S. Jordan / Computers & Education 58 (2012) 818–834824

This is my first experience of the OU and to me having the book there and knowing that essentially the answers are there, it’s not testingme. So for me personally it’s important that I can answer the question, at least initially, without having to refer back.

Student A read the feedback carefully on one occasion but on another (following his second incorrect response to Block 7 Q1), despitespending considerable time (2 min 16 s) consulting the study materials and re-reading the question, he appeared to miss the final sentenceof the feedback which included a vital clue.

Student B appeared to react rather superficially with the second-attempt feedback provided to him on four occasions, generally eitherskim reading it or reading part of it; there was clearly insufficient time for a full reading. He never referred to the materials referenced in thefeedback and was only able to get one of the four questions correct at the third attempt.

3.1.4. Response to final answerA full solution, sometimes with additional explanation, was provided either when a student’s responsewas deemed to be correct, or after

three incorrect responses. The students used these specimen answers to differing extents. Student E was the most consistent in hisbehaviour, reading them carefully when his final attempt was incorrect (with one exception which he explained in the interview – thediagram onwhich the question was based was difficult to understand and he had resorted to guesswork) but he did not read the specimenanswer when his final answer was correct.

Student D and Student F appeared to read the specimen answers irrespective of whether their final answerwas correct, though Student Dglossed over additional mathematical explanations.

All of Student C’s final responses were correct and she usually just glanced at the specimen answers, though she read them morethoroughly when she was less confident of her understanding. Student A (who was incorrect in just one final response) read just onespecimen answer, and that was following a correct response. He commented:

I’m just going to read the answer to make sure my thinking was right instead of being lucky.

The initial impression was that Student B did not make much use of the specimen answers; he certainly did not spend much time inreading them. However, following an incorrect answer to Block 7 Q10, he read the specimen answer rapidly, commenting ‘wrongway’ and inthe post-observation interview he commented:

. the third time it gives you the full answer. I like it when it does that. It gives you a clue for the next one, what to look for.

3.1.5. When the computer got it wrongAn interesting, though unplanned, situation arose because the usability laboratory observations took place when the questions were at

an early developmental stage. This led to a few occasions where student responses were inaccurately marked by the computer. In addition,although the software in use at the time generally coped well with poor spelling and grammar, when Student B entered the word deceasesinstead of decreases, this was not recognised as a misspelling (because ‘deceases’ is a word in its own right) and so ‘it deceases’was markedas incorrect when ‘it decreases’ would have been a correct answer. It was very interesting to see how students reacted to being told thata correct answer was incorrect, and vice versa.

Student B failed to spot his spelling mistake and was quite upset that the computer had marked him as incorrect. He read the specimenanswer, commented that he thought his answer was the same and ticked the box provided to indicate that he believed he had beenincorrectly marked.

In contrast, when incorrect responses from Student C and Student F were marked as correct, neither student spotted the discrepancy. Inthe case of Student C, this could be explained by the fact that she only glanced at the specimen answers, and when the discrepancy waspointed out to her later she commented:

Yes, you think you get it right so you ignore this.

Student F’s response to the first question was inaccurately marked as correct and he appeared to read the specimen answervery thoroughly. However he missed the fact that the specimen answer contradicted the one he had given, going so far as tocomment:

It’s so nice to get the first one right.

For these two students, the statement ‘Your answer is correct’ appeared to carry great weight and to over-ride the impact of any furthercomment or explanation.

3.2. Analysis of responses

3.2.1. Length of responseThe length of the responses given was found to vary with question, permitted length, detail of question wording and mode of use.The first use of the snowflake question (‘Snowflake’) was in an optional formative-only iCMA in S103. Since then the question has been in

low-stakes summative use in S104 (S1030s replacement). The distribution of word-length in formative use (Fig. 3) was found to be bimodalwith peaks on one word (corresponding to responses such as ‘balanced’) and three words (e.g. ‘They are balanced’, ‘equal and opposite’). Atthis time students were not given any guidance on expected length, and responses of any length were allowed. A small number of very longresponses were received.

In early (08J) summative use, word-length was again not controlled, though the general guidance implied that students should answer ina single sentence. Theword-length distribution (Fig. 4) was again bimodal, with peaks now on 3–4 words (e.g. ‘The forces are balanced’) and8–9 words (e.g. ‘The forces of gravity and air resistance are equal’). There were excessively long responses, such as the following:

Fig. 3. Distribution of length of 888 responses to Snowflake in formative-only use (S103).

S. Jordan / Computers & Education 58 (2012) 818–834 825

Gravity from the earth core is pulling the snow flake and no other force is acting on it for example wind. When the snow flake reach theground there are two forces acting on it. One is the gravitational pull from the Earth and the upward force from the ground. Both theseforce are balance with means that the snow flake is statioary. (63 words)

The responsematching (both IATand PMatch) copes well with responses of all lengths, but very long responses are more likely to containa correct response within an incorrect one, acknowledged as a serious problem area for automated marking of short-answer questions(Mitchell, Broomhead, Russell & Aldridge, 2002). For this reason, from 2009, it was decided to impose a filter, only accepting responses of upto 20 words. If students give an answer that is too long they are told and given an opportunity (without penalty) to shorten their answer. For09B, the warning ‘You should give your answer as a short phrase or sentence. Answers of more than 20 words will not be accepted.’ wasadded to the question. The resulting word-length distribution is shown in Fig. 5. There were still peaks at 4 words and 8-9 words and thefilter was obviously effective in removing the excessively long responses. However the general trend was towards longer answers (themedian increased from 8words in 08J to 10words in 09B) apparently because some students interpreted the guidance as meaning that theiranswer should be close to 20 words in length.

From 2010, the specific instruction about length was removed, leaving the general guidance on length as shown in Fig. 1. Students areonly told that their response needs to be no more than 20 words if their original response is longer than this. This final change in wordinghas resulted in the word-length distribution shown in Fig. 6. The number of responses with length of 20 words or in the high teens hasdiminished and the median has returned to 8 words.

Over-riding the impact of whether a question is in summative or purely formative use and students’ interpretation of guidance onword-length, the major factor in determining the length of responses appears to be the question itself. Fig. 7 shows word-length distributions fortwo other short-answer questions (Sandstone and Electric-force) for the same module and the same presentation as Snowflake in Fig. 6.

Whilst Sandstone (Fig. 7a) shows a broad distribution, Electric-force (Fig. 7b) has two distinct peaks at 9 and 20 words. On this occasionthe peak on 20 words is a characteristic of the question itself; it was present when students were not given any guidance on response lengthand also when they were told that their response should be no more than 20 words. Whilst individual responses are worded in many

Fig. 4. Distribution of length of 2057 responses to Snowflake in summative use without length restriction or additional wording in the question (S104 08J).

Fig. 5. Distribution of length of 1991 responses to Snowflake in summative use with filter and advice ‘You should give your answer as a short phrase or sentence. Answers of morethan 20 words will not be accepted.’ (S104 09B).

S. Jordan / Computers & Education 58 (2012) 818–834826

different ways, the peak on 9 words generally corresponds to responses that precisely state the effect of moving the two charges apart (e.g.‘The force is decreased by a factor of four’whilst the peak on 20 words corresponds to responses that also give an explanation e.g. ‘The forcedecreases by a factor of four since distance and electric force are related by an inverse square law’.

Where sufficient responses are available to make meaningful comparison, second and third attempt responses are seen to be slightlylonger than first attempt responses (this appears to be due to the fact that students add to their responses after being told that they areincorrect). Also, correct responses tend to be longer than incorrect ones, though the extent of the difference varies considerably betweenquestions (compare Fig. 8 and Fig. 9). For Electric-force (Fig. 9), the characteristic 9-word and 20-word responses are simply not present inincorrect responses. Incorrect and incomplete responses tend to give just one element of the answer e.g. ‘The force is increased’ or ‘The forceis decreased’ rather than ‘The force is decreased by a factor of four’.

3.2.2. Form of responseTable 4 shows the number of responses classified as being empty (no response), a single word, a phrase, in note form, a sentence

(containing a verb and a subject) or a paragraph (several sentences). The large proportion of responses which were in the form of a sentencemay have been a result of general advice (in iCMA instructions) to students to give their answer in this way, of wording introduced into thequestion to ‘give your answer as a short phrase or sentence’ and of the general feedback after a first unsuccessful attempt to ‘have another go,remembering to express your answer as a simple sentence.’ For Snowflake, where use has been monitored over the widest range ofpresentations, relativelymore answers of a singlewordwere seen in purely formative usewhilst relativelymore answers of a sentencewereseen in summative use. There were relatively more answers in the form of a paragraph when there was no limit on word-length and noindication in the question of the type of answer required (08J) whereas there were relatively more answers in the form of a phrase when thewording ‘You should give your answer as a short phrase or sentence’ had been added (10B). The number of answers in note form droppedaway as soon as it was made clear (in the question itself rather than in the general iCMA instructions) that an answer was expected in theform of a phrase or sentence.

Fig. 6. Distribution of length of 1919 responses to Snowflake in summative use with filter and advice ‘You should give your answer as a short phrase or sentence.’ (S104 10B).

Fig. 7. Distribution of length of (a) 1640 responses to Sandstone; (b) 1187 responses to Electric-force (both in summative use and S104 10B presentation).

Fig. 8. Distribution of length of (a) 861 correct and (b) 779 incorrect responses to Sandstone (S104 10B).

Fig. 9. Distribution of length of (a) 783 correct and (b) 404 incorrect responses to Electric-force (S104 10B).

S. Jordan / Computers & Education 58 (2012) 818–834 827

Table 4The number and percentage of responses for each classification.

Question Number of responses (and percentage of total) for each classification of response

Empty Word Phrase Note Sentence Paragraph

SnowflakeFormative 20 (2.3%) 174 (19.6%) 82 (9.2%) 22 (2.5%) 570 (64.2%) 20 (2.3%)S104 08J 3 (0.1%) 87 (4.2%) 125 (6.1%) 84 (4.1%) 1608 (78.2%) 150 (7.3%)S104 09B 8 (0.4%) 23 (1.2%) 79 (4.1%) 33 (1.7%) 1805 (90.7%) 43 (2.2%)S104 10B 5 (0.3%) 26 (1.4%) 255 (13.3%) 26 (1.4%) 1561 (81.3%) 46 (2.4%)A-good-ideaAYRF (IAT) 64 (4.4%) 32 (2.2% 208 (14.2%) 25 (1.7%) 1073 (73.2%) 63 (4.3%)AYRF (PMatch) 7 (0.9%) 6 (0.7%) 129 (15.7%) 8 (1.0%) 608 (74.1%) 63 (7.7%)S154 09J 37 (3.5%) 6 (0.6%) 175 (16.3%) 13 (1.2%) 758 (70.8%) 82 (7.7%)S154 10J 29 (2.8%) 5 (0.5%) 133 (12.8%) 17 (1.6%) 783 (75.5%) 71 (6.8%)Oil-on-waterS154 09J 12 (1.4%) 3 (0.3%) 18 (2.1%) 4 (0.5%) 836 (95.4%) 3 (0.3%)S154 10J 15 (1.6%) 5 (0.5%) 37 (4.0%) 9 (1.0%) 856 (92.3%) 5 (0.5%)MetamorphicSXR103 09E 41 (6.8%) 22 (3.6%) 112 (18.6%) 33 (5.5%) 386 (64.0%) 9 (1.5%)SXR103 10E 23 (3.2%) 13 (1.8%) 173 (24.2%) 21 (2.9%) 478 (66.9%) 6 (0.8%)SedimentarySXR103 09E 21 (3.6%) 23 (3.4%) 261 (38.3%) 44 (6.5%) 316 (46.3%) 17 (2.5%)SXR103 10E 16 (2.0%) 21 (2.6%) 256 (32.2%) 57 (7.2%) 439 (55.2%) 6 (0.8%)SandstoneS104 09B 7 (0.4%) 6 (0.3%) 467 (24.7%) 176 (9.3%) 1220 (64.6%) 13 (0.7%)S104 10B 9 (0.5%) 11 (0.7%) 396 (24.1%) 166 (10.1%) 1028 (62.7%) 30 (1.8%)Electric-forceS104 09B 16 (1.2%) 9 (0.7%) 46 (3.4%) 17 (1.3%) 1258 (92.8%) 9 (0.7%)S104 10B 9 (0.8%) 5 (0.4%) 62 (5.2%) 16 (1.3%) 1050 (88.5%) 45 (3.8%)

S. Jordan / Computers & Education 58 (2012) 818–834828

However, the major factor in determining response type again appears to be the actual question being asked. Note again the differentbehaviour of Sandstone (considerably more phrases and notes) and Electric-force (considerably more sentences) in the same presentationsof the samemodule. The percentage of sentences in response to A-good-idea vary between 71% and 76%, whether in formative or diagnosticuse, whilst the percentage of sentences in response to Oil-on-water, in the same formative assignments as A-good-idea lie between 92% and95%. Metamorphic and Sedimentary have a lower percentage of responses classified as sentences, and a corresponding rise in other clas-sifications. Perhaps this is to be expected in the jargon-rich subject of geology, where words such as ‘fragmental’ have a specific meaningthat would take several sentences to describe accurately.

There were more blank responses in the first sample of responses (IAT) to A-good-idea in diagnostic use than in the second sample ofresponses (PMatch). This can be explained by the presence of a ‘skip to answer’ button, coincidentally introduced at the same time as theswitch from IAT to PMatch matching, which students are using instead of leaving the answer blank when they have no idea of the correctanswer. This finding is not directly relevant, but it highlights the importance of considering all variables in conducting an analysis of thistype.

With the exception of the fact that empty responses are bound to be incorrect and that it is impossible to correctly answer some of thequestions (A-good-idea, Oil-on-water, Sandstone and Electric-force) with a response of just one word, there was no consistent significant

Table 5The number and percentage of responses including a spelling mistake.

Question Number of responses (and percentage oftotal) containing at least one incorrectlyspelt word

SnowflakeFormative 71 (8.0%)S104 08J 148 (7.2%)S104 09B 155 (7.8%)S104 10B 117 (6.1%)A-good-ideaAYRF (IAT) 2039 (10.3%)S154 09J 80 (7.5%)Oil-on-waterS154 09J 38 (4.3%)MetamorphicSXR103 09E 89 (14.8%)SedimentarySXR103 09E 64 (9.4%)SandstoneS104 09B 206 (10.9%)Electric-forceS104 09B 166 (12.3%)

Table 6The number and percentage of correct responses for responses that are spelt correctly and incorrectly.

Question Responses with correct spelling Responses with at least one spelling error Probability, p, that the variationarose by chance

Number of responses Number (and percentage) ofcorrect responses

Number of responses Number (and percentage) ofcorrect responses

SnowflakeFormative 817 267 (32.7%) 71 20 (28.2%) 0.52S104 08J 1909 1097 (57.5%) 148 53 (35.8%) 5 � 10�7

S104 09B 1836 1094 (59.6%) 155 71 (45.8%) 2 � 10�16

S104 10B 1802 1129 (62.7%) 117 47 (40.2%) 2 � 10�6

A-good-ideaAYRF (IAT) 17850 4762 (26.7%) 2039 249 (12.2%) 2 � 10�16

S154 09J 991 546 (55.1%) 80 31 (38.8%) 0.0068Oil-on-waterS154 09J 838 634 (75.7%) 38 29 (76.3%) 0.92MetamorphicSXR103 09E 514 334 (65.0%) 89 54 (60.7%) 0.51SedimentarySXR103 09E 618 313 (50.6%) 64 23 (35.9%) 0.035SandstoneS104 09B 1683 753 (44.7%) 206 107 (51.9%) 0.059Electric-forceS104 09B 1189 735 (61.8%) 166 90 (54.2%) 0.073

S. Jordan / Computers & Education 58 (2012) 818–834 829

correlation between the classification of response type and whether it was correct or incorrect. It was impossible to compare first, secondand third-attempt responses because there were insufficient responses in some categories.

3.2.3. SpellingThe results of an investigation into the prevalence of spelling mistakes in responses are given in Table 5. Here responses with PMatch

answer matching have been excluded, since the spell checker results in spelling errors being corrected before responses are captured. Themain variation is again between actual questions rather than between mode of use, with the smallest number of spelling mistakes asso-ciated with questions that are easy to answer with words with straightforward spelling e.g. ‘The water has greater density’ (Oil-on-water),‘The forces are balanced’ (Snowflake), ‘There is no verb’ (A-good-idea). In contrast, the questions associated with the largest number ofspelling mistakes are those that are frequently answered with words that many students find difficult to spell e.g. ‘align’ and ‘crystal’(Metamorphic), ‘separation’ and ‘separated’ (Electric-force).

Table 6 gives the results of an investigation into whether correct responses are more likely to be associated with correct spelling thanincorrect spelling. For virtually all questions, a higher proportion of responses that are spelt correctly are correct than for responsesincluding a spelling mistake. However this result is only significant for some questions1, interestingly those with a smaller proportion ofspelling mistakes overall: perhaps when a spelling mistake is made in answering one of these questions, it is reasonable to assume that thecontent of the answer will be weaker too.

We speculated that studentsmight attend to their spellingmistakes when told that their answer was incorrect, so decided to test the nullhypothesis that the spelling error rate is the same in first, second and third-attempt responses. Table 7 shows the surprising outcome: thereare more spelling mistakes in second and third attempt responses than in first attempt responses. Inspection of a sample of responsesoffered the possible explanation that whilst some typos (e.g. missing spaces between words) were corrected between attempts, genuinespelling mistakes (e.g. spelling ‘separate’ as ‘seperate’ were simply not recognised by the students). Furthermore, many second and thirdattempt responses included additional typographical and spelling errors, sometimes introduced by students trying to edit existing text andsometimes completely new spelling mistakes. Snowflake and A-good-idea show the greatest percentage difference between the number ofspelling errors in first and subsequent attempts, perhaps because these questions contain relatively few spelling mistakes overall, so theintroduction of typos at second or third attempt is likely to have a greater impact.

A random sample of responses from 200 students (yielding between 300 and 400 responses for each question) was used toinvestigate the prevalence of homophones and other incorrect words (e.g. decease instead of decrease). Around 1% of these responseswere seen to include a word with a meaning other than that intended by the student, but there was no evidence of this causing anyproblems with response matching, largely because most of the homophones were in words such as to/too which were not required ina correct answer.

3.2.4. Changes made to responses following feedbackFor all types of iCMA questions in formative-only and diagnostic use, students sometimes deliberately leave the response box blank, or

leave their response unchanged after receiving feedback (Jordan, 2011). Reasons given by students for these behaviours include just wantingto get the final specimen answer, not understanding the question and not understanding the feedback. These behaviours are much lesscommon for questions in summative use. The results for short-answer free-text questions were in line with those for other types of iCMAquestions; there were higher proportions of blank and repeated responses for questions in diagnostic and formative-only use than is thecase for questions in summative use.

1 Note on significance: This sequence of tests involved 11 chi-squared tests which means that it is not appropriate to follow the usual convention of saying that a result issignificant if the probability was less than 0.05. The Bonferroni correction suggests that we should require p < 0.05/11 i.e. p < 0.0045 in order to claim significance.

Table 7The number and percentage of 1st attempt, 2nd attempt and 3rd attempt responses that include a spelling mistake.

Question First attempt responses Second attempt responses Third attempt responses Probability, p, thatthe variation arose bychance

Number ofresponses

Number (and percentage) ofresponses with spelling error

Number ofresponses

Number (and percentage) ofresponses with spelling error

Number ofresponses

Number (and percentage) ofresponses with spelling error

SnowflakeFormative 450 30 (6.7%) 241 22 (9.1%) 197 19 (9.6%) 0.33S104 08J 1341 66 (4.9%) 419 43 (10.3%) 297 39 (13.1%) 3 � 10�4

S104 09B 1327 96 (7.2%) 398 35 (8.8%) 266 24 (9.0%) 1 � 10�7

S104 10B 1315 61 (4.6%) 370 27 (7.3%) 234 29 (12.4%) 2 � 10�5

A-good-ideaAYRF (IAT) 9317 833 (8.9%) 5636 639 (11.3%) 4936 567 (11.5%) 8 � 10�8

S154 09J 680 35 (5.1%) 237 26 (11.0%) 154 19 (12.3%) 6 � 10�4

Oil-on-waterS154 09J 710 32 (4.5%) 106 5 (4.7%) 60 1 (1.7%) 0.60MetamorphicSXR103 09E 434 63 (14.5%) 104 13 (12.5%) 65 13 (20.0%) 0.39SedimentarySXR103 09E 427 42 (9.8%) 145 15 (10.3%) 110 7 (6.4%) 0.49SandstoneS104 09B 1007 106 (10.5%) 521 57 (10.9%) 361 43 (11.9%) 0.77Electric-forceS104 09B 934 106 (11.3%) 245 34 (13.9%) 176 26 (14.8%) 0.31

Table 8Changes made to student responses between 1st and 2nd attempts and between 2nd and 3rd attempts, for questions in summative use.

Question and attempt Total resp-onses Number of responses in this category (and percentage of total)

Unchanged Rephrased Deleted Added Changed New

Snowflake 09B1st–2nd 398 17 (4.3%) 41 (10.3%) 33 (8.3%) 76 (19.1%) 76 (19.1%) 155 (38.9%)2nd–3rd 266 12 (4.5%) 5 (1.9%) 4 (1.5%) 67 (25.2%) 84 (31.6%) 94 (35.3%)Snowflake 10B1st–2nd 370 12 (3.2%) 32 (8.6%) 22 (5.9%) 62 (16.8%) 77 (20.8%) 165 (44.6%)2nd–3rd 234 9 (3.8%) 5 (2.1%) 3 (1.3%) 59 (25.2%) 78 (33.3%) 80 (34.2%)Sandstone 09B1st–2nd 520 10 (1.9%) 106 (20.4%) 50 (9.6%) 227 (43.7%) 54 (10.4%) 73 (14.0%)2nd–3rd 361 8 (2.2%) 10 (2.8%) 1 (0.3%) 216 (60.4%) 55 (15.2%) 69 (19.1%)Sandstone 10B1st–2nd 523 11 (2.1%) 11 (2.1%) 11 (2.1%) 393 (75.1%) 43 (8.2%) 54 (10.3%)2nd–3rd 161 7 (4.3%) 6 (3.7%) 7 (4.3%) 65 (40.4%) 35 (21.7%) 41 (25.5%)Electric-force 09B1st–2nd 245 16 (6.5%) 47 (19.2%) 22 (9.0%) 42 (17.1%) 53 (21.6%) 65 (26.5%)2nd–3rd 176 10 (5.7%) 12 (6.8%) 2 (1.1%) 29 (16.5%) 76 (43.2%) 47 (26.7%)Electric-force 10B1st–2nd 225 9 (4.0%) 14 (6.2%) 4 (1.8%) 78 (34.7%) 76 (33.8%) 44 (19.6%)2nd–3rd 110 12 (10.9%) 8 (7.3%) 1 (0.9%) 12 (10.9%) 61 (55.5%) 16 (14.5%)

Fig. 10. Targeted feedback on a common incomplete response to Sandstone, provided after both first and second attempt for the 10B presentation. For the 09B presentation, onlygeneral feedback was provided at first attempt.

S. Jordan / Computers & Education 58 (2012) 818–834830

Fig. 11. The proportion of responses to Sandstone that were correct at first attempt, second attempt, third attempt, and not at all for (a) the 09B presentation of S104; (b) the 10Bpresentation of the same module.

S. Jordan / Computers & Education 58 (2012) 818–834 831

Table 8 gives more information about the way in which students altered their responses following feedback. There are significantdifferences between the changes made between 1st and 2nd attempt and between 2nd and 3rd attempt, between different questions, andbetween different presentations of Sandstone (with p < 10�15 of the difference between the 09B and 10B presentations arising by chance)and Electric-force (p < 10�9 of the difference between presentations arising by chance), but not Snowflake. We hypothesise that thedifferences can all be explained by the differing feedback provided and by students’ understanding of this. The answer matching forSnowflake was the same in 09B and 10B whilst for Sandstone and Electric-force the answer matching was changed from IAT to PMatchbetween 09B and 10B, which provided an opportunity to increase the targeted feedback provided and to provide some targeted feedbackafter an incorrect or incomplete first attempt.

When responses were unchanged, it is likely that students just did not know what to alter, an ‘I haven’t a clue’ type of reaction. Whenresponses were rephrased, students may have believed (probably incorrectly) that they had been marked as incorrect because thecomputer did not understand their response. When part of a response was deleted, students may have believed that the computer had notmarked them as correct because of the presence of a word or phrase in the response. Thus these categories were more common between1st and 2nd attempt for all the 09B questions and for both presentations of Snowflake, where no targeted feedback was given at the firstattempt.

The wording of the standard feedback after an incorrect or incomplete first attempt response ‘Your response appears to be incorrector incomplete in some way. Have another go, remembering to express your answer as a simple sentence.’ may have encouragedstudents to rephrase, delete part of or add to their answer even when this was not needed. However the proportion of responses thatare added to goes up with targeted feedback and there is a marked increase between the 09B and 10B presentations in the proportionof responses which are added to after first attempt for Electric-force and especially Sandstone. This can be attributed to the intro-duction of targeted feedback for responses to Electric-force that just talk about the force decreasing but do not say by how much theforce decreases and, for Sandstone, for those responses that just talk about the rock being formed in a desert. This is an incompleteresponse, but students do not always appreciate this, which led to student dissatisfaction with this question in the 09B presentation,despite the fact that its marking accuracy was very high. The targeted feedback shown in Fig. 10 has improved student satisfactionwith this question; it has also markedly increased the proportion of students who get this question right at the second attempt(Fig. 11).

4. Discussion and conclusions

The responses given by students to short-answer free-text e-assessment questions varied in length, form and presence/absence ofspelling mistakes with a number of factors, most notable of which was the actual question being asked. The detail of the question wording,and students’ interpretation of this, was also significant, so when told ‘Answers of more than 20 words will not be accepted’, students weremore likely to give responses of 19–20 words than in the absence of this instruction and when asked to ‘Remember to express your answeras a single sentence’, they did just that. The influence of whether the questionwas in summative, formative-only or diagnostic use, althoughpresent, was less significant.

Features observed in some student responses, for example starting with a capital letter and/or ending with a full stop, could be inter-preted as providing support for the Computers as Social Actors (CASA) hypothesis. However students may simply have thought that thecomputer wouldmark their response as incorrect in the absence of these features. Other features, for example students’ tendency to start ananswer without first restating the question, could be seen as evidence that they were conscious that the marker was a computer. However,there is again an alternative explanation; this may simply have been because the question was immediately above the input box. Furtherwork is needed in order to identify the reasons for these apparently contradictory behaviours.

Many students submitted answers which included spelling mistakes, even to summative questions, and they frequently did not spotthese errors when told that their responsewas incorrect. This could be seen as support for the CASA hypothesis, but it seemsmore likely thatstudents simply did not realise that they were making mistakes. For this reason, there is considerable benefit in the dual approach ofproviding a spell-checker and accepting some incorrect spellings. Some question authors require certain words (especially scientific words)to be spelt correctly; PMatch provides the capability for author choice on this point and a small number of specialist words have been addedto the dictionary. There is a tension between the requirement that students should know the correct spelling of a scientific word and a wishnot to disadvantage certain students inappropriately, especially those for whom English is not their first language and those with dyslexia.

When students were convinced that the computer hadmarked them incorrectly, whether or not this was actually the case, they appearedto be demonstrating acute awareness that they were being marked by a computer not a human marker. Walker et al. (2008) and Lipnevich

S. Jordan / Computers & Education 58 (2012) 818–834832

and Smith (2009) reported similar results. Similarly, when students removed parts of responses or reworded them completely after beingtold that their first response was incorrect, this was probably because they believed the computer had not ‘understood’ what they weresaying, an assumption they are less likely to make of a human marker. However, for human-marking, the credibility of the assessmentprocess and the perceived usefulness of feedback provided has been found to relate to the students’ perceptions of the lecturers themselves(Poulos & Mahony, 2007). Assessment of any sort appears to be more effective when student confidence in the marker is high, a findingwhich reinforces the importance of accurate marking and the provision of feedback which the student is able to understand and use.

In general, students appeared to engage with the feedback provided to a pleasing extent, even when not directly observed. Engagementwith feedbackwas seen to vary from formative to summative use and from student to student, but the over-riding influencewas the detail ofwhat the feedback said and the student’s interpretation of this. So, for example, the statement ‘Your answer is correct’ was seen to detractfrom the detailed feedback that followed this, whilst feedback was engaged with more carefully when the student had been told that theirresponsewas incorrect. Kulhavy and Stock (1989) introduced the concept of ‘response certitude’, arguing that feedback is most helpful whena student is confident that their answer is correct but it turns out to be incorrect.

Whilst brief feedback was sometimes seen to be effectively used by students, more detailed tailored feedback was generally more useful.This is in line with Hsieh & O’Neil’s (2002) finding that feedback that gives information about what is right and wrong, with hints for furtherimprovement, is more effective than feedback that just tells a student whether their answer is correct or incorrect. Although the final‘specimen answers’ were seen to be used, especially when a student had not managed to reach a correct answer after three attempts, ingeneral prompting-answer strategies were seen to be more useful that rather than giving-answer strategies (Ferreira & Atkinson, 2009).

Our first overall conclusion is that feedback can be made vastly more effective by making sure that it is understood by students andtailored to the mistakes that they have made. This effectiveness extends beyond the immediate usefulness of the feedback itself to anenhanced confidence in the ability of the marker, be that a human or a computer.

Our second conclusion is that students are extremely influenced by what they believe the question setter wants them to do. Thisconclusion almost certainly applies more generally than to computer-marked short-answer questions. As Struyven et al. (2005, p.338)comment, “Do not we all want to ‘figure out the teacher’?”

Finally, observation of students and careful inspection of their responses has led to increased understanding of student behaviour, whichin turn has led to demonstrably improved assessment items. This cyclical model of evaluation and development (Benson & Brack, 2010) hasmuch to commend it as we seek to improve the student experience.

Acknowledgements

The author gratefully acknowledges the financial support of the Centre for Open Learning of Mathematics, Science, Computing andTechnology (COLMSCT) (gs1) and the invaluable assistance of Tom Mitchell (Intelligent Assessment Technologies Ltd.), Barbara Brockbank,Philip Butcher, Richard Jordan, StephenSwithenbyand twoanonymous reviewers,whose comments enabledher to see thewood for the trees.

Appendix A. Questions included in the usability laboratory observation

Block 3 Q1What does an object’s velocity tell you that its speed does not?Block 3 Q2A hailstone falls vertically with a constant speed. What does this tell you about the forces acting on the hailstone?Block 3 Q9The photograph shows a selection of igneous rocks. How are igneous rocks formed?Block 3 Q10You are handed two rock specimens and you are told that one is an intrusive igneous rock whilst the other is an extrusive igneous rock.

How would you know which was the intrusive specimen?Block 3 Q11Why do intrusive igneous rocks have larger crystals than extrusive ones?Block 3 Q12You are handed a rock specimen that consists of interlocking crystals. How could you be sure, from its appearance, that this was

a metamorphic rock?Block 3 Q13You are handed a rock specimen from a cliff that appears to show some kind of layering. The specimen does not contain any fossils. How

could you be sure, from its appearance, that this rock specimen was a sedimentary rock?Block 7 Q1What is the key characteristic of an emission spectrum from a vapour of atoms such as hydrogen, helium or sodium?Block 7 Q2What other thing happens when a hydrogen atom makes a transition from the n ¼ 3 energy level to the n ¼ 2 energy level?Block 7 Q3What other thing happens when a hydrogen atom makes a transition from the n ¼ 1 energy level to the n ¼ 5 energy level?Block 7 Q6How do the separations of the energy levels of a singly ionised helium atom compare with those of a hydrogen atom? Be as specific as

possible.Block 7 Q7What are the two key features of the quantum world that apply when discussing the properties of atoms? First [ Box for first answer ]

Second [ Box for second answer ]Block 7 Q8

S. Jordan / Computers & Education 58 (2012) 818–834 833

During an alpha decay, what happens to the mass number of the nucleus undergoing the decay? Be as specific as possible.Block 7 Q9During a beta-minus decay, what happens to the atomic number and the mass number of the nucleus undergoing the decay? Be as

specific as possible in your answers. Put each answer in the relevant box. Atomic number [ Box for answer ] Mass number [ Box for answer ]Block 7 Q10Water waves in a ripple tank are diffracted by a narrow aperture, as shown in the ‘Making waves’ video in Block 7 Activity 8.1. What

would be the effect of increasing the width of the aperture?Block 7 Q11Water waves in a ripple tank are diffracted by a narrow aperture, as shown in the ‘Making waves’ video in Block 7 Activity 8.1. What

would be the effect on the observed diffraction of increasing the wavelength whilst keeping the aperture size the same?Block 7 Q12A red laser beam is shone through a diffraction grating (as shown in the ‘MakingWaves’ video in Block 7 Activity 8.1). What would be the

effect on the observed diffraction pattern of replacing the diffraction grating with one inwhich the lines are closer together (i.e. d is smaller).Block 7 Q13A red laser beam is shone through a diffraction grating (as shown in the ‘MakingWaves’ video in Block 7 Activity 8.1). What would be the

effect (other than a change in colour) on the observed diffraction pattern of replacing the red laser with a blue one. Note: blue light hasa higher frequency than red light.

Block 7 Q16Light is shone on a metal and photoelectrons are emitted. The frequency of the light is then reduced. What does the photoelectric effect

tell you about the two changes that might occur? Write one possible outcome in each of the boxes provided. First [ Box for first answer ]Second [ Box for second answer ]

Appendix B. Questions included in the analysis of responses

SnowflakeA snow flake falls vertically with constant speed. What does this tell you about the forces acting on the snow flake?A-good-ideaWhy is the following not a sentence?: A good idea.Oil-on-waterThe photograph shows a layer of oil floating on top of a glass of water. Why does the oil float?MetamorphicYou are handed a rock specimen that consists of interlocking crystals. How could you be sure, from its appearance, that this was

a metamorphic rock?SedimentaryYou are handed a rock specimen from a cliff that appears to show some kind of layering. The specimen does not contain any fossils. How

could you be sure, from its appearance, that this rock specimen was a sedimentary rock?SandstoneA sandstone observed in the field contains well-sorted, well-rounded, fine pitted and reddened grains. What does this tell you about the

origins of this rock?Electric-forceIf the distance between two electrically charged particles is doubled, what happens to the electric force between them? Be as specific as

possible.

References

Bangert-Drowns, R. L., Kulik, J. A., & Morgan, M. T. (1991). The instructional effect of feedback in test-like events. Review of Educational Research, 61, 213–238.Beevers, C. et al. (2010). What can e-assessment do for learning and teaching? Part 1 of a draft of current and emerging practice: review by the E-Assessment Association

Expert Panel. In Paper presented by John Winkley on behalf of the panel to the International Computer Assisted Assessment (CAA) Conference, 20th–21st July 2011,Southampton, UK.

Benson, R., & Brack, C. (2010). Online learning and assessment in higher education: A planning guide. Oxford: Chandos Publishing.Boud, D. (1995). Assessment and learning: contradictory or complementary? In P. Knight (Ed.), Assessment for learning in higher education (pp. 35–48) London: Kogan Page in

association with SEDA.Butcher, P. G., & Jordan, S. E. (2010). A comparison of human and computer marking of short free-text student responses. Computers and Education, 55(2), 489–499.Carless, D. (2006). Differing perceptions in the feedback process. Studies in Higher Education, 31(2), 219–233.Clariana, R., & Wallace, P. (2002). Paper-based versus computer-based assessment: key factors associated with test mode effect. British Journal of Educational Technology, 33(5),

593–602.Conole, G., de Laat, M., Darby, J., & Dillon, T. (2006). JISC LXP: Student experiences of technologies: Final report. Bristol: JISC.Crisp, B. R. (2007). Is it worth the effort? How feedback influences students’ subsequent submission of assessable work. Assessment and Evaluation in Higher Education, 32(5),

571–581.Draper, S. (2009). What are learners actually regulating when given feedback? British Journal of Educational Technology, 40(2), 306–315.Earley, P. C. (1988). Computer-generated performance feedback in the subscription processing industry. Organizational Behavior and Human Decision Processes, 41, 50–64.Ferreira, A., & Atkinson, J. (2009). Designing a feedback component of an intelligent tutoring system for foreign language. Knowledge Based Systems, 22, 496–501.Fowler, A. (2008). Providing effective feedback on whole-phrase input in computer-assisted language learning. In Paper presented at the 12th international computer assisted

assessment (CAA) conference, 8th–9th July 2008, Loughborough, UK.Gipps, C. V. (2005). What is the role for ICT-based assessment in universities? Studies in Higher Education, 30(2), 171–180.Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77, 81–112.Higgins, R., Hartley, P., & Skelton, A. (2002). The conscientious consumer: reconsidering the role of assessment feedback. Studies in Higher Education, 27(1), 53–64.Hsieh, I. L. G., & O’Neil, H. F., Jr. (2002). Types of feedback in computer-based collaborative problem-solving group task. Computers in Human Behaviour, 18(6), 699–715.Jordan, S. (2011). Using interactive computer-based assessment to support beginning distance learners of science. Open Learning, 26(2), 151–168.Jordan, S., & Mitchell, T. (2009). E-assessment for learning? The potential of short-answer free-text questions with tailored feedback. British Journal of Educational Technology,

40(2), 371–385.

S. Jordan / Computers & Education 58 (2012) 818–834834

Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: a historical review, a meta-analysis and a preliminary feedback intervention theory.Psychological Bulletin, 119, 254–284.

Kulhavy, R. W., & Stock, W. A. (1989). Feedback in written instruction: the place of response certitude. Educational Psychology Review, 1(4), 279–308.Lipnevich, A. A., & Smith, J. K. (2008). Response to assessment feedback: The effects of grades, praise and source of information. Princeton, NJ: Educational Testing Service.Lipnevich, A. A., & Smith, J. K. (2009). "I really need feedback to learn": students’ perspectives on the effectiveness of the differential feedback messages. Educational

Assessment Evaluation and Accountability, 21, 347–367.Lizzio, A., & Wilson, K. (2008). Feedback on assessment: students’ perceptions of quality and effectiveness. Assessment and Evaluation in Higher Education, 33(3), 263–275.Miller, T. (2008). Formative computer-based assessment in higher education: the effectiveness of feedback in supporting student learning. Assessment and Evaluation in Higher

Education, 34(2), 181–192.Mishra, P. (2006). Affective feedback from computers and its effect on perceived ability and affect: a test of the computers as social actors hypothesis. Journal of Educational

Multimedia and Hypermedia, 15, 107–131.Mitchell, T., Russell, T., Broomhead, P. & Aldridge, P. (2002). Towards rebust computerised marking of free-text responses. In Paper presented at the 6th international computer

assisted assessment (CAA) conference, 9th–10th July 2002, Loughborough, UK.Orsmond, P., & Merry, S. (2011). Feedback alignment: effective and ineffective links between tutors’ and students’ understanding of coursework feedback. Assessment and

Evaluation in Higher Education, 36(2), 125–136.Pommerich, M. (2004). Developing computerized versions of pen-and-paper tests: mode effects for passage-based tests. Journal of Technology, Learning and Assessment, 2(6),

Accessed from. http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1666/1508 21st June 2011.Poulos, A., & Mahony, M. J. (2007). Effectiveness of feedback: the students’ perspective. Assessment and Evaluation in Higher Education, 33(2), 143–154.Price, M., Handley, K., Millar, J., & O’Donovan, B. (2010). Feedback: all that effort, but what is the effect? Assessment and Evaluation in Higher Education, 35(3), 277–289.Ramaprasad, A. (1983). On the definition of feedback. Behavioral Science, 23, 4–13.Reeves, B., & Nass, C. (1996). The media equation. Stanford, CA: Center for the Study of Language and Information.Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–144.Stone, D., Jarrett, C., Woodroffe, M., & Mincoha, S. (2005). User interface design and evaluation. San Francisco: Morgan Kaufman.Struyven, K., Dochy, F., & Janssens, S. (2005). Students’ perceptions about evaluation and assessment in higher education: a review. Assessment & Evaluation in Higher

Education, 30(4), 325–341.Walker, M. (2009). An investigation into written comments on assignments: do students find them usable? Assessment & Evaluation in Higher Education, 34(1), 67–78.Walker, D. J., Topping, K., & Rodrigues, S. (2008). Student reflections on formative e-assessment: expectations and perceptions. Learning, Media and Technology, 33, 221–234.