The Social Aspect of Oral Assessment in Second Language Education

The Social Aspect of Oral Assessment in Second Language Education

LLED 526 Second Language Assessment: Conceptual and Empirical Approaches

Melike Kesirli Erozlu

November 28, 2013

Abstract

This paper discusses the effect of social interaction on test takers'

performance and scores in oral exams. It aims to provide a background

for the importance of interaction in oral language testing, examine

the ways interaction shows itself in various testing contexts and

offer some alternative views on this issue. Drawing on McNamara's

(1997; 2001), Lazaraton and Davis' s (2006; 2008) views it focuses on

how test takers position themselves as proficient and competent

language users, the consequences of interaction between strong and

weak candidates and the 'multiple identities' each brings to the test

context. Finally, the effects of interlocutor and rater interaction

on the performance and scores of candidates are discussed.

Introduction

The fact that language test performance is co-constructed and it is

“essentially a social activity” (McNamara, 1997, p. 447) has shaped

the discussion on issues of validity, reliability and fairness in

oral assessment for quite some time. McNamara suggested that rather

than looking at test performance as a (solely) cognitive endeavour,

we should view speaking test scores as the product of a number of

1

factors such as the interlocutor, the rater, the task, and the rating

criteria. He (2001) also linked the speaking test performance to

post-structuralist conceptions of identity through “performativity”.

Truly, McNamara (1997) believes that in language testing we cannot

just assume that constructs such as language proficiency or language

ability were present before the test performance or that these will

continue to exist in the target language use context after the test

is administered. What we need to keep reminding ourselves here is

that “it is the task of the language tester to allow these to be

expressed, to be displayed in the test performance. But what if the

direction of the action is the reverse, so that the act of testing

itself constructs the notion of language proficiency?” (p. 339)

My own teaching and learning experience in English has shown me that

interaction during the test plays a very important role in the

scoring of candidates' performance in oral examinations. As I carried

on with my duties as an oral examiner for Cambridge Esol's KET/PET

exams between 2006 and 2012, I continuously questioned the

contribution of the many aspects of social interaction to the test

takers' oral performance scores. My observations and experiences as

2

an interlocutor and an assessor in these exams urged me to search for

answers to questions such as 'what makes up social interaction during

an oral exam?, if performance is a joint act of examiners and

examinees or peers in paired oral tests, are interlocutor frames and

detailed rubrics adequate to help assessors evaluate a single

performance objectively or are there more factors affecting the

scores when it comes to the social aspect of oral assessment?, how

does wrong pairing of candidates affect their performance during the

exam? or how does the dynamics between the interlocutor and the

assessor influence the final scores of test takers?

In this paper, I will attempt to respond to such questions as above

regarding the effects of social interaction in oral assessment.

Background information

In order to understand how social interaction affects testtakers'

oral performance scores, we first need to look at what we mean by

interaction and the features of it in a test environment.

In his work in language testing, McNamara (1997) defines

“interaction” as “a social/behavioural one, where joint behaviour

3

between individuals is the basis for the joint construction (and

interpretation) of performance” (p. 447)

Similarly, Joughin (1999) in his six dimensions of oral assessment

explains that it is essential to understand the ways examiners

interact with the test takers and what kinds of interaction are

needed for specific testing situations. To him (2010), one of the

distinctive features of oral assessment is that it allows for

interaction between the examiners and the candidates, and sometimes

others, with the interaction often being rapid and unpredictable. The

types of interaction may range from the examiner asking support

questions to the examinee to complete a turn to an examinee making

claims to persuade peers that an argument is worth supporting. Some

other examples for different types of interaction include the non-

interactive one-way presentation and the debate/discussion between

the test taker and examiner or between a pair/group of test takers.

Interaction gives oral assessment meaning, and it has a positive

washback for classroom teaching because when candidates know that

their interaction will affect their performance score, they may be

4

motivated to prepare comprehensively for the test. However, we should

not forget that the very same interaction can make assessment

unpredictable so it must be our priority to ensure fairness and

equality in terms of opportunities we provide for the candidates to

show their knowledge and abilities. Careful planning in terms of how

the interaction will take place between the stakeholders is

necessary. We need to carefully consider the type of interaction that

will take place during the test, how the examiners are expected to

interact with the test takers and if there is an audience, what the

role of that audience will be when working out the details of an oral

exam.

The oral interview :

The oral interview is believed to be a valid way of testing

conversational communicative competence because the interaction that

takes place is dynamic in nature and unpredictable. However, Brown

(2003) argues that this unpredictability may compromise test

reliability. Lazaraton (as cited in Brown, 2003) explains that

consistent examiner conduct is necessary in order to ensure

consistent ratings during the procedure. In the end, how can we

5

guarantee that all candidates will be given the same number and kinds

of opportunities to demonstrate their knowledge and abilities if

examiners do not administer the test in prescribed ways? Especially

the ways in which some interviewers adjust their speech to the

proficiency level of the candidates depending on their own

assumptions is worth looking at in more detail. (Brown, 2003)

Group Oral Tests

Research shows that group oral tests are likely to have high face

validity among students. For example, Fulcher (1996) studied the

students' reactions to oral tasks they were asked to complete. The

tasks included one-on-one interviews and a group discussion. Students

indicated that they felt least anxious about the group discussion

task, and that they thought the conversation they had during the

group discussion was more natural compared to one-on-one interviews.

However, McNamara (2001) questions how scores based on joint

interaction among examinees in group oral tests can be regarded as

evidence of individual performance ability or whether they can be

seen as individual performance ability at all. Swain (2001) seems to

6

support this view:

“In a group, the performance is jointly constructed and distributed across the

participants. The dialogues construct cognitive and strategic processes which

in turn construct student performance. One implication for testing is,

minimally, that serious thought needs to be given to the most adequate and fair

means of scoring the linguistic activity and its product that derives from

group interaction. It also means that in a testing situation, whom one is

paired or grouped with, is not unimportant.” (p. 296).

As a result, it will not be wrong to say that such studies are

convincing to believe that a candidate's individual performance is

closely tied to the interaction maintained during an oral test. This

becomes even more important in group oral tests as there are more

factors such as gender, proficiency level, familiarity or the

unfamiliarity of other test-takers affecting the overall production

compared with the oral interview where the oral examiner's

participation is limited to an interlocutor framework.

The role of proficiency identities in language assessment

Truly, test takers bring multiple identities to language assessment

7

contexts and the level of familiarity or acquaintanceship may affect

the discourse that is produced and the scores assigned to each

candidate. (McNamara, 1997)

Holster (2007) argues that “the ability to create and project a range

of identities is fundamental to becoming a proficient user” of the L2

(p. 1). Lazaraton and Davis (2008) also agree that there is an

interlocutor effect in paired testing, but they stress that this

effect should not be examined only from the perspective of fixed

identities assigned to the participants by the researchers. They

believe that the language proficiency identity displayed in oral

tests by the participants can change and evolve during interaction.

This probably means that it is very difficult to understand what

“true proficiency” is, as it keeps changing, being reconstructed,

modified in the talk, depending on other variables such as one's

partner's acknowledgement or refusal of their language proficiency

identity.

Similarly, Davis (2007) takes a post-structuralist view of identity,

to examine how perceptions of language proficiency are established in

a paired speaking test. In his study, he analyzed the language

8

produced by two pairs in a Chinese speaking test and looked at how

each candidate demonstrated their competence as well as how their

performance affected the proficiency level of their partners. The

findings suggest that the proficient speakers produced longer,

extended turns, used accurate language and often self-corrected their

speech. They listened to their partners and had a genuine interest in

what their partners had to say, and therefore positioned themselves

as “competent” and “proficient.” On the other hand, their weaker

partner was positioned as “weak” when these more proficient partners

helped him with words or responses, ignored some of his

contributions, or warned him for not fulfilling the task

requirements; the weaker speaker employed certain strategies to

resist this by not changing his point of view and disagreeing with

his partner.

Lazaraton & Davis (2006) examined in an earlier study the discourse

features in the candidates' talk and found that the identity

formulations such as “more proficient” and “less proficient” are

constructed in the test takers' discourse. This means that a test

taker comes to a paired oral assessment context with an already

9

formed identity of proficient L2 user and this identity is

constructed, maintained and re-developed during the test. As

Lazaraton and Davis (2008) explains, a good pairing can definitely

work to the advantage of the test takers as the discourse that they

jointly produce can benefit their performance when scores are

assigned. Then the question is if such two candidates can easily

position themselves as “proficient” and “competent” by the

collaborative interactiveness of their oral production and are able

to receive high scores, what are the consequences in terms of

proficiency and/or interactivity when a highly proficient speaker is

matched randomly with a weaker partner?

McNamara (1997) stresses a need to re-problematize the extent to

which the interaction in role play resembles the interaction in the

target language use, and whether it is actually possible to simulate

the conditions of the target interactions during an oral test. In

reality, we can only assume that the sample of the candidate's oral

performance during the exam is a simulation of the target language

she would use in real life, but whether it is a good or a bad sample

depends on many other factors as I have discussed above.

10

Construct interlocutor effect

The role of examiners in the candidates' performance has long been

discussed by researchers interested in oral assessment. According to

Yaphe and Street's (2003) findings, examiners go through a three

stage process when they assign a score to each test taker. First,

they make a strong judgement based on the ‘‘first impressions’’

formed by the candidate’s initial response. Later depending on the

test takers' responses to their further questions they form a

‘‘provisional grade’’ in their minds. Remaining questions are used to

confirm the candidates' final grade. This study reveals that the

candidates’ personal qualities affect their grade. Yaphe and Street

(2003) note that confident, fluent test takers mostly score well

while weaker ones receive low scores because of not being

comprehensive enough and lack of coherence in their responses. They

point out that personal attributes like fluency and creativity should

not be included in the construct of an exam that aims to test

language proficiency. This becomes even more important when the

findings suggest that some examiners in the study were affected by

11

test takers' personal attributes and this influence had a direct

impact on the final scores they gave.

On the other hand, a study conducted by Weingarten et al. (2000)

revealed that experienced examiners, examiners with senior academic

rank and those who qualified in English speaking countries were

likely to grade more harshly and fail more candidates. The study

concluded that examiners who are really off with their scores should

not be allowed to mark in oral exams.

Another study by Brown and Hill (as cited in Brown, 2003) analyzed

interviewer effects, in relation to the speaking part of the IELTS

exam and found a tendency for test takers to receive higher scores

with certain interviewers than with others. Further analysis showed

significant differences between certain pairs of interviewers in

terms of their grading1. Brown (2003) also analyzed the discourse in

two interviews in the IELTS oral exam involving the same candidate

with two different interviewers and was able to show how closely the

interviewer participated in the construction of test takers' oral1 The IELTS interview format in this study had a new version in July 2001 which restricted

interviewer behaviour to the use of ‘interlocutor frames’

12

proficiency:

“The behaviour of the two interviewers were different in terms of

the ways in which they structured sequences of topical talk, their

questioning techniques, and the type of feedback they provided. An

analysis of verbal reports produced by some of the raters confirmed

that these differences resulted in different impressions of the

candidate’s ability: in one interview the candidate was considered

to be more ‘effective’ and ‘willing’ as a communicator than in the

other.” (p. 1)

Related to the findings of this study, Brown (2003) questions the

validity in conversational oral interviews since it seems that trying

to assess real world conversational skills depends mostly on the

unpredictable nature of the test interaction.

On the other hand, the findings of Gas & Varonis (1984) revealed the

effect of familiarity on the

native speaker’s comprehension of non native speech:

“1. Familiarity with the topic of discourse greatly facilitates comprehension.

2. Familiarity with non native speech in general facilitates comprehension.

3. Familiarity with a particular non native accent

facilitates comprehension of the speech of another

non native of that language background.

13

4. Familiarity with a particular non_native speaker

facilitates comprehension of that person’s speech.”

(p. 81)

These results are not surprising for many of us in the field as we

often experience and observe that oral examiners who speak the

language of the country they teach in are used to hearing a

particular accent and can easily comprehend what the test takers from

that country say whereas if the examiners are not familiar with the

candidates' accents they may have difficulty understanding their

speech and may assign inconsistent scores.

This effect is also evident in the results from Winke's (2011) bias

interaction analyses:

“Matches between the raters’ L2 and the test takers’ L1

resulted in some of the raters assigning ratings that were

significantly higher than expected. After an initial 4-hour

training period, a group of 107 raters (mostly of learners of

Chinese, Korean, and Spanish), listened to a selection of 432

speech samples that 72 test takers (native speakers of Chinese,

Korean, and Spanish) produced. As a whole, raters with Spanish

as an L2 were significantly more lenient toward test takers who

14

had Spanish as an L1, and raters with Chinese as an L2 were

significantly more lenient toward test takers who had Chinese

as an L1.” (p. )

Lumley and Brown (1996) have also found that the interlocutor

behaviour can determine the level of difficulty of the interaction

for the test takers. The candidate is disadvantaged if the

interlocutor uses sarcastic language, interrupts, or contributes very

little in the interaction. On the contrary, if the interlocutor does

not go beyond factual questions, or if he/she simplifies his/her

language, this seems to help the candidates' performance scores.

Brown (2003) summarizes the interviewers' behaviour during oral exams

as varied in terms of

“· the level of rapport that they establish with candidates

· their functional and topical choices

· the ways in which they ask questions and construct prompts

· the extent to which or the ways in which

they accommodate their speech to that of

the candidate

· the ways in which they develop and extend topics” (p. 3)

Finally, Tarone (1998) explains that when learners produce inter-

language, their production varies depending on the social context,

15

which can be reflected in their scores. Therefore he suggests

documenting the situational features that affect inter-language,

together with the changes in production in order to have more valid

and reliable assessments of language ability.

Rater characteristics

Another major concern with the ways of interaction in oral assessment

is the subjectivity of the raters' decisions about the candidates'

performance. Many studies reveal that the rater can have a direct

effect on the candidates' scores. For example, in his study in 1997,

O'Loughlin was able to show that the live version of an ESL speaking

test given to the immigrants in Australia which included an

interlocutor could not be regarded as equivalent to the taped version

as a result of the undeniable existence of social interaction in the

former one. The findings of O'Loughlin (1997) suggest that some

raters' reactions to some interlocutors, or tasks, or the test format

were negative and these negative reactions influenced their scoring,

which worked in the favour of the candidates as they received the

required score to pursue their life goals.

16

Several other studies have also looked at the effects of rater bias

and ways to control them. For example, McNamara (1997) notes that

raters can vary in their scoring behaviour depending on the

particular group of examinees, the particular task, and the

particular occasion. To reduce the effects of such bias Wigglesworth

(1993) found that raters were responsive to feedback and were able to

incorporate it into their ratings. However, despite rater training,

it was found that each rater had a unique background that could

affect their judgements about the candidates.

Likewise, another study conducted in Bangladesh reveals that some

IELTS Oral Examiners favour certain topics over others while they may

avoid other topics due to poor language proficiency of test takers.

They say they prefer to do this when they are convinced that the

candidates did not have enough information on certain topics and if

they think that a topic is not common in Bangladeshi culture, or is

difficult and confusing for candidates coming from a Bangladeshi

context. A majority of the examiners who took part in this study

reported that they substitute an easy or familiar word for a

difficult one. (Khan, 2006)

17

The results of these studies once again bring to mind the efforts to

control rater bias using detailed rubrics and whether these rubrics

or the training given to the raters are adequate measures to make

reliable tests.

Bachman's interactional/ability approach

Considering the points we discussed above and drawing on McNamara's

(1997; 2001) views we can go on to say that our dilemma lies in the

fact that we are isolating the test taker from his/her joint

performance in the name of objective rating and we expect him/her to

be solely responsible for his performance, which we assume as

separate from his/her partner's and we rely on detailed rubrics,

interlocutor frames and rater trainings on the way to success.

However, McNamara (1997) claims that this above perception is a

result of Bachman's (1990) “interactional/ability approach” which

reduces performance to “an ability in cognitive terms”:

18

“a performance is not a simple projection of what

is in the head of the candidate, even if that

display is mediated by the candidate's strategies

for dealing with the interactional context in which

it is to be achieved. (p. 453)

Surprisingly, too many testers still seem to think that the relation

between the candidate's performance and their score is transparent

and that the rater, based on his/her categorization, can give a

performance an objective score as long as he/she has evidence from

the test to back his/her decision.

McNamara (1997) problematizes the issue of interlocutor effect from a

Vygotskyan perspective. Vygotskyan theory (1978) focuses on the

learning potential of individuals and stresses that when the learners

are somewhere between their existing competence and their future

potential performance, they are in their zone of proximal development

(ZPD). When we apply this concept to language testing we can better

understand how test takers can benefit from being in interaction with

more proficient and competent interlocutors.

Similarly, Halliday rejects the view that sees social interaction as

19

part of a cognitive process inside the individual's brain but instead

adopts a similar view that sees the organization of language as a

shared resource for meaning (as cited in McNamara, 1997). He expects

the candidates to demonstrate mastery of types of interactions such

as the ones in types of genre, or their components. However, McNamara

(1997) argues that this perception assumes that test validity can be

achieved by controlling content and agrees with Messick (1994) who

emphasizes that content validity cannot be sufficient to conclude

that a test is valid. McNamara goes on to say that the categorization

in systemic functional systems do not work in judging candidates

performance in oral assessment because such judgments can only be

probabilistic rather than deterministic. I agree with McNamara that

trying to fit candidates' performance into systemic functional

systems will not work in an oral test, which is basically a natural

and unpredictable production as it is not structured. Furthermore,

this kind of categorization cannot guarantee that the candidate will

perform in the same way as she did if she was faced with a similar

situation in the real world.

Conclusion

20

In this paper I have focused on the aspects of social interaction in

oral assessment and how they may affect candidates' performance and

scores. The role of partners in paired/group oral tests and the

influence of examiners and raters on the oral production and test

scores were questioned based on findings from different studies. I

have also looked at the multiple identities that the L2 learners

bring to the oral assessment context and how those identities may

have an impact on test takers' interaction patterns during the exam.

Finally drawing on McNamara's criticism (1997) of Bachman's (1990)

interactional/ability approach, the paper offered some alternative

views on interaction in oral tests.

Some important questions that I would like to elaborate on in the

future regarding interaction in oral exams are 1) If oral performance

is so intimately tied to a joint production in real life situations

as the studies prove, then is it an adequate measure to depend on

detailed rubrics and constrain interlocutor participation to a

framework? 2) Can we assume that these measures will automatically

provide the assessor with a clear evidence of the candidate's

21

sampling of speech performance in the target language domain? And 3)

What can language teachers do to help their students prepare for such

oral exams?

References

Bachman, L.F. (1990). Fundamental considerations in language testing.Oxford: Oxford University Press.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1) , 1-25.

Davis, L. (2007). Proficiency and identity in tests of speaking: Are we measuring whatwe think we are measuring? Unpublished manuscript, University of Hawaii.

Fulcher, G. (1996). Testing tasks: issues in task design and thegroup oral. Language Testing, 13, 23– 51.

Gass, S. M., & Varonis, E. M. (1984). The effect of familiarity onthe comprehensibility of nonnative speech. Language Learning, 34,65-89.

22

Holster, T. (2007). Identity and proficiency: Meaningful approachesto learning and assessment. Retrieved fromhttp://trevorholster.com/Papers/07-05_id_prof.pdf

Joughin, G. (1999). Dimensions of oral assessment and student approaches to learning. In S. Brown & A. Glasner (Eds.), Assessment matters (pp. 146–156). Buckingham: The Society for Research into Higher Education & Open University Press.

Joughin, G. (2010). A short guide to oral assessment. Retrieved from https://www.leedsmet.ac.uk/publications/files/100317_36668_ShortGuideOralAssess1_WEB.pdf

Khan, R. (2006). The IELTS Speaking Test: Analysing Culture Bias. Malaysian Journal of ELT Research, 2, 60-79

Lazaraton, A., & Davis, L. (2006). Process and outcome in paired oral assessment: What is the ‘interlocutor effect’ on discourse and scores? Paper presented at the AAAL Conference, Montreal, Canada.

Lazaraton, A., Davis, L. (2008). A microanalytic perspective on discourse, proficiency, and identity in paired oral assessment. Language Assessment Quarterly, 5(4), 313–335.

Lumley, T. and A. Brown. (1996). Specific-purpose languageperformance tests Task and interaction in G Wigglesworth and Elder(eds) 1996 The Language Testing Cycle From Inception to Washback. Australian Reviewof Applied Linguistics,13, 105-136

Messick, S. (1994). The interplay of evidence and consequences in thevalidation of performance assessments. Educational Researcher, 23(2),13-23.

McNamara, T. F. (1997). “Interaction” in second language performance assessment: Whose performance? Applied Linguistics, 18, 446–465.McNamara, T. (2001). Language assessment as social practice: Challenges for research. Language Testing, 18, 333–349

23

https://www.leedsmet.ac.uk/publications/files/100317_36668_ShortGuideOralAssess1_WEB.pdf

https://www.leedsmet.ac.uk/publications/files/100317_36668_ShortGuideOralAssess1_WEB.pdf

http://trevorholster.com/Papers/07-05_id_prof.pdf

O'Loughlin, K. (1997). The comparability of a direct and semi-directspeaking tests: A case study. PhD thesis, University of Melbourne

Swain, M. (2001). Examining dialogue: Another approach to content specification and to validating inferences drawn from test scores.Language Testing, 18(3), 275-302.

Tarone, E. (1998). Research on inter-language variation: Implicationsfor language testing. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 71–89). Cambridge, UK: Cambridge University Press.

Vygotsky, L.S. (1978). Mind and society: The development of higher psychological processes. Cambridge, MA: Harvard University Press.

Weingarten, M. A., Polliack, M. R., Tabenkin, H., & Kahan, E. (2000).Variations among examiners in family medicine residency board oral examinations. Medical Education, 34, 13–17.

Wigglesworth, G. (1993). Exploring bias analysis as a tool forimproving rater consistency in assessing oral interaction.Language Testing, 10(3), 305–323.

Winke, P., Gass, S., & Myford, C. (2012). Raters' L2 background as a potential source of bias in rating oral performance. Retrieved from

http://ltj.sagepub.com/content/early/2012/08/16/0265532212456968

Yaphe, J. & Street, S. (2003). How do examiners decide? A qualitativestudy of the process of decision making in the oral examination component of the MRCGP examination. Medical Education, 37, 764- 771.

24

http://ltj.sagepub.com/content/early/2012/08/16/0265532212456968

The Social Aspect of Oral Assessment in Second Language Education

Documents

Transcript of The Social Aspect of Oral Assessment in Second Language Education