Weaving aspect configurations for managing system variability
The Social Aspect of Oral Assessment in Second Language Education
Transcript of The Social Aspect of Oral Assessment in Second Language Education
The Social Aspect of Oral Assessment in Second Language Education
LLED 526 Second Language Assessment: Conceptual and Empirical Approaches
Melike Kesirli Erozlu
November 28, 2013
Abstract
This paper discusses the effect of social interaction on test takers'
performance and scores in oral exams. It aims to provide a background
for the importance of interaction in oral language testing, examine
the ways interaction shows itself in various testing contexts and
offer some alternative views on this issue. Drawing on McNamara's
(1997; 2001), Lazaraton and Davis' s (2006; 2008) views it focuses on
how test takers position themselves as proficient and competent
language users, the consequences of interaction between strong and
weak candidates and the 'multiple identities' each brings to the test
context. Finally, the effects of interlocutor and rater interaction
on the performance and scores of candidates are discussed.
Introduction
The fact that language test performance is co-constructed and it is
“essentially a social activity” (McNamara, 1997, p. 447) has shaped
the discussion on issues of validity, reliability and fairness in
oral assessment for quite some time. McNamara suggested that rather
than looking at test performance as a (solely) cognitive endeavour,
we should view speaking test scores as the product of a number of
1
factors such as the interlocutor, the rater, the task, and the rating
criteria. He (2001) also linked the speaking test performance to
post-structuralist conceptions of identity through “performativity”.
Truly, McNamara (1997) believes that in language testing we cannot
just assume that constructs such as language proficiency or language
ability were present before the test performance or that these will
continue to exist in the target language use context after the test
is administered. What we need to keep reminding ourselves here is
that “it is the task of the language tester to allow these to be
expressed, to be displayed in the test performance. But what if the
direction of the action is the reverse, so that the act of testing
itself constructs the notion of language proficiency?” (p. 339)
My own teaching and learning experience in English has shown me that
interaction during the test plays a very important role in the
scoring of candidates' performance in oral examinations. As I carried
on with my duties as an oral examiner for Cambridge Esol's KET/PET
exams between 2006 and 2012, I continuously questioned the
contribution of the many aspects of social interaction to the test
takers' oral performance scores. My observations and experiences as
2
an interlocutor and an assessor in these exams urged me to search for
answers to questions such as 'what makes up social interaction during
an oral exam?, if performance is a joint act of examiners and
examinees or peers in paired oral tests, are interlocutor frames and
detailed rubrics adequate to help assessors evaluate a single
performance objectively or are there more factors affecting the
scores when it comes to the social aspect of oral assessment?, how
does wrong pairing of candidates affect their performance during the
exam? or how does the dynamics between the interlocutor and the
assessor influence the final scores of test takers?
In this paper, I will attempt to respond to such questions as above
regarding the effects of social interaction in oral assessment.
Background information
In order to understand how social interaction affects testtakers'
oral performance scores, we first need to look at what we mean by
interaction and the features of it in a test environment.
In his work in language testing, McNamara (1997) defines
“interaction” as “a social/behavioural one, where joint behaviour
3
between individuals is the basis for the joint construction (and
interpretation) of performance” (p. 447)
Similarly, Joughin (1999) in his six dimensions of oral assessment
explains that it is essential to understand the ways examiners
interact with the test takers and what kinds of interaction are
needed for specific testing situations. To him (2010), one of the
distinctive features of oral assessment is that it allows for
interaction between the examiners and the candidates, and sometimes
others, with the interaction often being rapid and unpredictable. The
types of interaction may range from the examiner asking support
questions to the examinee to complete a turn to an examinee making
claims to persuade peers that an argument is worth supporting. Some
other examples for different types of interaction include the non-
interactive one-way presentation and the debate/discussion between
the test taker and examiner or between a pair/group of test takers.
Interaction gives oral assessment meaning, and it has a positive
washback for classroom teaching because when candidates know that
their interaction will affect their performance score, they may be
4
motivated to prepare comprehensively for the test. However, we should
not forget that the very same interaction can make assessment
unpredictable so it must be our priority to ensure fairness and
equality in terms of opportunities we provide for the candidates to
show their knowledge and abilities. Careful planning in terms of how
the interaction will take place between the stakeholders is
necessary. We need to carefully consider the type of interaction that
will take place during the test, how the examiners are expected to
interact with the test takers and if there is an audience, what the
role of that audience will be when working out the details of an oral
exam.
The oral interview :
The oral interview is believed to be a valid way of testing
conversational communicative competence because the interaction that
takes place is dynamic in nature and unpredictable. However, Brown
(2003) argues that this unpredictability may compromise test
reliability. Lazaraton (as cited in Brown, 2003) explains that
consistent examiner conduct is necessary in order to ensure
consistent ratings during the procedure. In the end, how can we
5
guarantee that all candidates will be given the same number and kinds
of opportunities to demonstrate their knowledge and abilities if
examiners do not administer the test in prescribed ways? Especially
the ways in which some interviewers adjust their speech to the
proficiency level of the candidates depending on their own
assumptions is worth looking at in more detail. (Brown, 2003)
Group Oral Tests
Research shows that group oral tests are likely to have high face
validity among students. For example, Fulcher (1996) studied the
students' reactions to oral tasks they were asked to complete. The
tasks included one-on-one interviews and a group discussion. Students
indicated that they felt least anxious about the group discussion
task, and that they thought the conversation they had during the
group discussion was more natural compared to one-on-one interviews.
However, McNamara (2001) questions how scores based on joint
interaction among examinees in group oral tests can be regarded as
evidence of individual performance ability or whether they can be
seen as individual performance ability at all. Swain (2001) seems to
6
support this view:
“In a group, the performance is jointly constructed and distributed across the
participants. The dialogues construct cognitive and strategic processes which
in turn construct student performance. One implication for testing is,
minimally, that serious thought needs to be given to the most adequate and fair
means of scoring the linguistic activity and its product that derives from
group interaction. It also means that in a testing situation, whom one is
paired or grouped with, is not unimportant.” (p. 296).
As a result, it will not be wrong to say that such studies are
convincing to believe that a candidate's individual performance is
closely tied to the interaction maintained during an oral test. This
becomes even more important in group oral tests as there are more
factors such as gender, proficiency level, familiarity or the
unfamiliarity of other test-takers affecting the overall production
compared with the oral interview where the oral examiner's
participation is limited to an interlocutor framework.
The role of proficiency identities in language assessment
Truly, test takers bring multiple identities to language assessment
7
contexts and the level of familiarity or acquaintanceship may affect
the discourse that is produced and the scores assigned to each
candidate. (McNamara, 1997)
Holster (2007) argues that “the ability to create and project a range
of identities is fundamental to becoming a proficient user” of the L2
(p. 1). Lazaraton and Davis (2008) also agree that there is an
interlocutor effect in paired testing, but they stress that this
effect should not be examined only from the perspective of fixed
identities assigned to the participants by the researchers. They
believe that the language proficiency identity displayed in oral
tests by the participants can change and evolve during interaction.
This probably means that it is very difficult to understand what
“true proficiency” is, as it keeps changing, being reconstructed,
modified in the talk, depending on other variables such as one's
partner's acknowledgement or refusal of their language proficiency
identity.
Similarly, Davis (2007) takes a post-structuralist view of identity,
to examine how perceptions of language proficiency are established in
a paired speaking test. In his study, he analyzed the language
8
produced by two pairs in a Chinese speaking test and looked at how
each candidate demonstrated their competence as well as how their
performance affected the proficiency level of their partners. The
findings suggest that the proficient speakers produced longer,
extended turns, used accurate language and often self-corrected their
speech. They listened to their partners and had a genuine interest in
what their partners had to say, and therefore positioned themselves
as “competent” and “proficient.” On the other hand, their weaker
partner was positioned as “weak” when these more proficient partners
helped him with words or responses, ignored some of his
contributions, or warned him for not fulfilling the task
requirements; the weaker speaker employed certain strategies to
resist this by not changing his point of view and disagreeing with
his partner.
Lazaraton & Davis (2006) examined in an earlier study the discourse
features in the candidates' talk and found that the identity
formulations such as “more proficient” and “less proficient” are
constructed in the test takers' discourse. This means that a test
taker comes to a paired oral assessment context with an already
9
formed identity of proficient L2 user and this identity is
constructed, maintained and re-developed during the test. As
Lazaraton and Davis (2008) explains, a good pairing can definitely
work to the advantage of the test takers as the discourse that they
jointly produce can benefit their performance when scores are
assigned. Then the question is if such two candidates can easily
position themselves as “proficient” and “competent” by the
collaborative interactiveness of their oral production and are able
to receive high scores, what are the consequences in terms of
proficiency and/or interactivity when a highly proficient speaker is
matched randomly with a weaker partner?
McNamara (1997) stresses a need to re-problematize the extent to
which the interaction in role play resembles the interaction in the
target language use, and whether it is actually possible to simulate
the conditions of the target interactions during an oral test. In
reality, we can only assume that the sample of the candidate's oral
performance during the exam is a simulation of the target language
she would use in real life, but whether it is a good or a bad sample
depends on many other factors as I have discussed above.
10
Construct interlocutor effect
The role of examiners in the candidates' performance has long been
discussed by researchers interested in oral assessment. According to
Yaphe and Street's (2003) findings, examiners go through a three
stage process when they assign a score to each test taker. First,
they make a strong judgement based on the ‘‘first impressions’’
formed by the candidate’s initial response. Later depending on the
test takers' responses to their further questions they form a
‘‘provisional grade’’ in their minds. Remaining questions are used to
confirm the candidates' final grade. This study reveals that the
candidates’ personal qualities affect their grade. Yaphe and Street
(2003) note that confident, fluent test takers mostly score well
while weaker ones receive low scores because of not being
comprehensive enough and lack of coherence in their responses. They
point out that personal attributes like fluency and creativity should
not be included in the construct of an exam that aims to test
language proficiency. This becomes even more important when the
findings suggest that some examiners in the study were affected by
11
test takers' personal attributes and this influence had a direct
impact on the final scores they gave.
On the other hand, a study conducted by Weingarten et al. (2000)
revealed that experienced examiners, examiners with senior academic
rank and those who qualified in English speaking countries were
likely to grade more harshly and fail more candidates. The study
concluded that examiners who are really off with their scores should
not be allowed to mark in oral exams.
Another study by Brown and Hill (as cited in Brown, 2003) analyzed
interviewer effects, in relation to the speaking part of the IELTS
exam and found a tendency for test takers to receive higher scores
with certain interviewers than with others. Further analysis showed
significant differences between certain pairs of interviewers in
terms of their grading1. Brown (2003) also analyzed the discourse in
two interviews in the IELTS oral exam involving the same candidate
with two different interviewers and was able to show how closely the
interviewer participated in the construction of test takers' oral1 The IELTS interview format in this study had a new version in July 2001 which restricted
interviewer behaviour to the use of ‘interlocutor frames’
12
proficiency:
“The behaviour of the two interviewers were different in terms of
the ways in which they structured sequences of topical talk, their
questioning techniques, and the type of feedback they provided. An
analysis of verbal reports produced by some of the raters confirmed
that these differences resulted in different impressions of the
candidate’s ability: in one interview the candidate was considered
to be more ‘effective’ and ‘willing’ as a communicator than in the
other.” (p. 1)
Related to the findings of this study, Brown (2003) questions the
validity in conversational oral interviews since it seems that trying
to assess real world conversational skills depends mostly on the
unpredictable nature of the test interaction.
On the other hand, the findings of Gas & Varonis (1984) revealed the
effect of familiarity on the
native speaker’s comprehension of non native speech:
“1. Familiarity with the topic of discourse greatly facilitates comprehension.
2. Familiarity with non native speech in general facilitates comprehension.
3. Familiarity with a particular non native accent
facilitates comprehension of the speech of another
non native of that language background.
13
4. Familiarity with a particular non_native speaker
facilitates comprehension of that person’s speech.”
(p. 81)
These results are not surprising for many of us in the field as we
often experience and observe that oral examiners who speak the
language of the country they teach in are used to hearing a
particular accent and can easily comprehend what the test takers from
that country say whereas if the examiners are not familiar with the
candidates' accents they may have difficulty understanding their
speech and may assign inconsistent scores.
This effect is also evident in the results from Winke's (2011) bias
interaction analyses:
“Matches between the raters’ L2 and the test takers’ L1
resulted in some of the raters assigning ratings that were
significantly higher than expected. After an initial 4-hour
training period, a group of 107 raters (mostly of learners of
Chinese, Korean, and Spanish), listened to a selection of 432
speech samples that 72 test takers (native speakers of Chinese,
Korean, and Spanish) produced. As a whole, raters with Spanish
as an L2 were significantly more lenient toward test takers who
14
had Spanish as an L1, and raters with Chinese as an L2 were
significantly more lenient toward test takers who had Chinese
as an L1.” (p. )
Lumley and Brown (1996) have also found that the interlocutor
behaviour can determine the level of difficulty of the interaction
for the test takers. The candidate is disadvantaged if the
interlocutor uses sarcastic language, interrupts, or contributes very
little in the interaction. On the contrary, if the interlocutor does
not go beyond factual questions, or if he/she simplifies his/her
language, this seems to help the candidates' performance scores.
Brown (2003) summarizes the interviewers' behaviour during oral exams
as varied in terms of
“· the level of rapport that they establish with candidates
· their functional and topical choices
· the ways in which they ask questions and construct prompts
· the extent to which or the ways in which
they accommodate their speech to that of
the candidate
· the ways in which they develop and extend topics” (p. 3)
Finally, Tarone (1998) explains that when learners produce inter-
language, their production varies depending on the social context,
15
which can be reflected in their scores. Therefore he suggests
documenting the situational features that affect inter-language,
together with the changes in production in order to have more valid
and reliable assessments of language ability.
Rater characteristics
Another major concern with the ways of interaction in oral assessment
is the subjectivity of the raters' decisions about the candidates'
performance. Many studies reveal that the rater can have a direct
effect on the candidates' scores. For example, in his study in 1997,
O'Loughlin was able to show that the live version of an ESL speaking
test given to the immigrants in Australia which included an
interlocutor could not be regarded as equivalent to the taped version
as a result of the undeniable existence of social interaction in the
former one. The findings of O'Loughlin (1997) suggest that some
raters' reactions to some interlocutors, or tasks, or the test format
were negative and these negative reactions influenced their scoring,
which worked in the favour of the candidates as they received the
required score to pursue their life goals.
16
Several other studies have also looked at the effects of rater bias
and ways to control them. For example, McNamara (1997) notes that
raters can vary in their scoring behaviour depending on the
particular group of examinees, the particular task, and the
particular occasion. To reduce the effects of such bias Wigglesworth
(1993) found that raters were responsive to feedback and were able to
incorporate it into their ratings. However, despite rater training,
it was found that each rater had a unique background that could
affect their judgements about the candidates.
Likewise, another study conducted in Bangladesh reveals that some
IELTS Oral Examiners favour certain topics over others while they may
avoid other topics due to poor language proficiency of test takers.
They say they prefer to do this when they are convinced that the
candidates did not have enough information on certain topics and if
they think that a topic is not common in Bangladeshi culture, or is
difficult and confusing for candidates coming from a Bangladeshi
context. A majority of the examiners who took part in this study
reported that they substitute an easy or familiar word for a
difficult one. (Khan, 2006)
17
The results of these studies once again bring to mind the efforts to
control rater bias using detailed rubrics and whether these rubrics
or the training given to the raters are adequate measures to make
reliable tests.
Bachman's interactional/ability approach
Considering the points we discussed above and drawing on McNamara's
(1997; 2001) views we can go on to say that our dilemma lies in the
fact that we are isolating the test taker from his/her joint
performance in the name of objective rating and we expect him/her to
be solely responsible for his performance, which we assume as
separate from his/her partner's and we rely on detailed rubrics,
interlocutor frames and rater trainings on the way to success.
However, McNamara (1997) claims that this above perception is a
result of Bachman's (1990) “interactional/ability approach” which
reduces performance to “an ability in cognitive terms”:
18
“a performance is not a simple projection of what
is in the head of the candidate, even if that
display is mediated by the candidate's strategies
for dealing with the interactional context in which
it is to be achieved. (p. 453)
Surprisingly, too many testers still seem to think that the relation
between the candidate's performance and their score is transparent
and that the rater, based on his/her categorization, can give a
performance an objective score as long as he/she has evidence from
the test to back his/her decision.
McNamara (1997) problematizes the issue of interlocutor effect from a
Vygotskyan perspective. Vygotskyan theory (1978) focuses on the
learning potential of individuals and stresses that when the learners
are somewhere between their existing competence and their future
potential performance, they are in their zone of proximal development
(ZPD). When we apply this concept to language testing we can better
understand how test takers can benefit from being in interaction with
more proficient and competent interlocutors.
Similarly, Halliday rejects the view that sees social interaction as
19
part of a cognitive process inside the individual's brain but instead
adopts a similar view that sees the organization of language as a
shared resource for meaning (as cited in McNamara, 1997). He expects
the candidates to demonstrate mastery of types of interactions such
as the ones in types of genre, or their components. However, McNamara
(1997) argues that this perception assumes that test validity can be
achieved by controlling content and agrees with Messick (1994) who
emphasizes that content validity cannot be sufficient to conclude
that a test is valid. McNamara goes on to say that the categorization
in systemic functional systems do not work in judging candidates
performance in oral assessment because such judgments can only be
probabilistic rather than deterministic. I agree with McNamara that
trying to fit candidates' performance into systemic functional
systems will not work in an oral test, which is basically a natural
and unpredictable production as it is not structured. Furthermore,
this kind of categorization cannot guarantee that the candidate will
perform in the same way as she did if she was faced with a similar
situation in the real world.
Conclusion
20
In this paper I have focused on the aspects of social interaction in
oral assessment and how they may affect candidates' performance and
scores. The role of partners in paired/group oral tests and the
influence of examiners and raters on the oral production and test
scores were questioned based on findings from different studies. I
have also looked at the multiple identities that the L2 learners
bring to the oral assessment context and how those identities may
have an impact on test takers' interaction patterns during the exam.
Finally drawing on McNamara's criticism (1997) of Bachman's (1990)
interactional/ability approach, the paper offered some alternative
views on interaction in oral tests.
Some important questions that I would like to elaborate on in the
future regarding interaction in oral exams are 1) If oral performance
is so intimately tied to a joint production in real life situations
as the studies prove, then is it an adequate measure to depend on
detailed rubrics and constrain interlocutor participation to a
framework? 2) Can we assume that these measures will automatically
provide the assessor with a clear evidence of the candidate's
21
sampling of speech performance in the target language domain? And 3)
What can language teachers do to help their students prepare for such
oral exams?
References
Bachman, L.F. (1990). Fundamental considerations in language testing.Oxford: Oxford University Press.
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1) , 1-25.
Davis, L. (2007). Proficiency and identity in tests of speaking: Are we measuring whatwe think we are measuring? Unpublished manuscript, University of Hawaii.
Fulcher, G. (1996). Testing tasks: issues in task design and thegroup oral. Language Testing, 13, 23– 51.
Gass, S. M., & Varonis, E. M. (1984). The effect of familiarity onthe comprehensibility of nonnative speech. Language Learning, 34,65-89.
22
Holster, T. (2007). Identity and proficiency: Meaningful approachesto learning and assessment. Retrieved fromhttp://trevorholster.com/Papers/07-05_id_prof.pdf
Joughin, G. (1999). Dimensions of oral assessment and student approaches to learning. In S. Brown & A. Glasner (Eds.), Assessment matters (pp. 146–156). Buckingham: The Society for Research into Higher Education & Open University Press.
Joughin, G. (2010). A short guide to oral assessment. Retrieved from https://www.leedsmet.ac.uk/publications/files/100317_36668_ShortGuideOralAssess1_WEB.pdf
Khan, R. (2006). The IELTS Speaking Test: Analysing Culture Bias. Malaysian Journal of ELT Research, 2, 60-79
Lazaraton, A., & Davis, L. (2006). Process and outcome in paired oral assessment: What is the ‘interlocutor effect’ on discourse and scores? Paper presented at the AAAL Conference, Montreal, Canada.
Lazaraton, A., Davis, L. (2008). A microanalytic perspective on discourse, proficiency, and identity in paired oral assessment. Language Assessment Quarterly, 5(4), 313–335.
Lumley, T. and A. Brown. (1996). Specific-purpose languageperformance tests Task and interaction in G Wigglesworth and Elder(eds) 1996 The Language Testing Cycle From Inception to Washback. Australian Reviewof Applied Linguistics,13, 105-136
Messick, S. (1994). The interplay of evidence and consequences in thevalidation of performance assessments. Educational Researcher, 23(2),13-23.
McNamara, T. F. (1997). “Interaction” in second language performance assessment: Whose performance? Applied Linguistics, 18, 446–465.McNamara, T. (2001). Language assessment as social practice: Challenges for research. Language Testing, 18, 333–349
23
O'Loughlin, K. (1997). The comparability of a direct and semi-directspeaking tests: A case study. PhD thesis, University of Melbourne
Swain, M. (2001). Examining dialogue: Another approach to content specification and to validating inferences drawn from test scores.Language Testing, 18(3), 275-302.
Tarone, E. (1998). Research on inter-language variation: Implicationsfor language testing. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 71–89). Cambridge, UK: Cambridge University Press.
Vygotsky, L.S. (1978). Mind and society: The development of higher psychological processes. Cambridge, MA: Harvard University Press.
Weingarten, M. A., Polliack, M. R., Tabenkin, H., & Kahan, E. (2000).Variations among examiners in family medicine residency board oral examinations. Medical Education, 34, 13–17.
Wigglesworth, G. (1993). Exploring bias analysis as a tool forimproving rater consistency in assessing oral interaction.Language Testing, 10(3), 305–323.
Winke, P., Gass, S., & Myford, C. (2012). Raters' L2 background as a potential source of bias in rating oral performance. Retrieved from
http://ltj.sagepub.com/content/early/2012/08/16/0265532212456968
Yaphe, J. & Street, S. (2003). How do examiners decide? A qualitativestudy of the process of decision making in the oral examination component of the MRCGP examination. Medical Education, 37, 764- 771.
24