Improving the validity of Chinese speaking tests
-
Upload
independent -
Category
Documents
-
view
2 -
download
0
Transcript of Improving the validity of Chinese speaking tests
Improving the validity of Chinese speaking tests
Andy Castro
Essay submitted in partial fulfilment of the requirements
of the degree of
Master of Arts in Teaching Chinese as a Foreign Language
at the University of Sheffield
Contents
1 Introduction 3
2 Chinese speaking tests: CST and HSKK 4
3 Key constructs 4
3.1 Speaking 5
3.2 L2 oral proficiency 6
4 Validity of oral proficiency tests 7
4.1 Validity of rating scales 8
4.1.1 HSKK 8
4.1.2 CST 8
4.1.3 Increasing scale validity 9
4.2 Validity of tasks and scoring 9
4.2.1 CST 10
4.2.2 HSKK 11
5 Conclusions 12
References 12
3 / 25
1 Introduction
Over the last few decades there has been much debate
concerning the validity and reliability of a range of
second language speaking tests (Bachman & Palmer, 1996;
Fulcher, 2003; Meyer, 2014). The test developer must
first show that the test is valid. In other words, at the
most basic level, “does [it] measure what it is supposed
to measure?” (Lado, 1961:321) Assuming that the test,
including its scoring process, is valid, it must then be
shown to be reliable. That is to say the same candidate,
performing under the same conditions, should get a very
similar result every time they take the test (Brown &
Abeywickrama, 2010: 27).
Accurate assessment of oral proficiency is not easy.
This is partly because, as suggested in the essay
question, the scoring of spoken production is highly
subjective. In addition, spoken performances under test
conditions are almost inevitably influenced by numerous
psychological and environmental factors. More
fundamentally, though, the basic concepts of “speaking”
and “oral proficiency” are themselves very unclear
constructs. Many test specifications seem to lack the
theoretical and empirical underpinnings that are required
for a solid validity argument to be made (North, 2000:
571; Luoma, 2004: 68).
In this essay, I examine ways of strengthening the
validity of the scoring process for speaking tests,
4 / 25
specifically for criterion-referenced Chinese oral
proficiency tests. First, I describe two oral proficiency
tests commonly used in the TCFL (Teaching Chinese as a
Foreign Language) context. Secondly, I look at the
constructs of “speaking” and “L2 oral proficiency”, both
generally and as defined by the two tests under
consideration. Finally, I examine the validity of these
tests by relating their rating scales, their tasks and
their scoring procedures to the constructs defined in the
previous section, following Luoma’s (2004: 177) pattern
for validity checking. In addition, I discuss possible
ways of improving their validity, touching also on issues
of reliability.
5 / 25
2 Chinese speaking tests: CST and HSKK
I shall critique two widely-used tests of spoken Chinese
proficiency in this essay. The first, the Chinese
Speaking Test (CST), is a precursor to the new Oral
Proficiency Interview Computer test (OPIc) offered by the
American Council on the Teaching of Foreign Languages
(ACTFL) (Clark, 1998; ACTFL, 2012). The second, known as
the Hanyu Shuiping Kouyu Kaoshi (Chinese Proficiency Oral
Examination), or HSKK for short, is widely used both
inside and outside of China and has been promoted
extensively abroad by the Chinese government in recent
years (Meyer, 2014: 22; Hanban, 2014a).
Both of these tests claim to provide measures of
overall speaking proficiency in Chinese which are
criterion-referenced and syllabus-independent. The CST is
designed to test oral proficiency levels according to the
ACTFL Proficiency Guidelines (ACTFL, 2012). It is a
recording-based test comprising 15 tasks, five picture-
based, five situation-based (i.e. a role-play without a
live interviewer) and five topic-based. Each task is
designed to elicit language at a particular level
according to the ACTFL criteria. An advanced-level
picture-based task is telling a story based on a set of
pictures. The result of the test is reported as a single
proficiency rating according to the ACTFL guidelines
(Clark, 1988).
6 / 25
The HSKK consists of three independent tests, one
for each of Elementary, Intermediate and Advanced levels.
Like the CST, it is recording-based. The Intermediate
HSKK comprises one sentence repetition task (ten
sentences in total), two picture-description tasks and
two question-answer tasks. The Advanced HSKK comprises
three listen-and-summarise tasks, one read-aloud task and
two question-answer tasks. The result for any one test is
reported as a percentage, with 60% required to achieve a
“pass” (Hanban, 2014b).
3 Key constructs
3.1 Speaking
The ‘skill’ of speaking is very rarely exercised in
isolation. It is usually an interactional activity. It
does not simply involve formulating and articulating
speech, but a whole host of other skills and strategies.
As Fulcher (2003: 46) puts it, “No operational construct
definition can ever capture the richness of what happens
in a process as complex as human communication.”
Nevertheless, the act of speaking can be
deconstructed and elements that are useful for testing
can be identified. One of the earlier models of speaking
was proposed by Byegate (1987). In his model, speaking is
a three step process of planning, selection and
7 / 25
production, each step involving interactions between the
speaker’s knowledge and skills (Byegate, 1987: 49-50).
Martínez-Flor et al. (2006: 146-150) propose a model
of speaking in the overall framework of communicative
competence that has been the driving force behind
communicative language teaching (Littlewood, 1981) in
recent decades. Under this model, represented visually in
Figure 1, linguistic competence (vocabulary, grammar,
phonology and prosody), pragmatic competence (register
and politeness), intercultural competence (sociocultural
awareness and non-verbal means of communication) and
strategic competence (compensatory strategies such as
repetition, circumlocution and paraphrasing) are all
mediated by discourse competence (cohesive devices and
the use of turn-taking mechanisms) to produce speech.
Figure 1. Speaking in a communicative competence framework (Martínez-
Flor et al., 2006: 147).
8 / 25
Linguisticcompeten
ce
Intercultural
competence
Strategiccompete
nce
Pragmaticcompete
nce
Discourse
competence
SPEAKING
Fulcher (2003: 31-48) proposes a similar framework.
He divides core language competence into three broad
areas: phonology (pronunciation, stress and prosody),
accuracy (syntax, vocabulary and cohesion) and fluency
(cohesion and structure). He proposes that speaking is
also moderated by strategic capacity (achievement
strategies and avoidance strategies), textual knowledge
(i.e. knowledge about the structure of dialogue),
pragmatic knowledge (appropriateness, implicature) and
sociolinguistic knowledge.
Tarone (2005) hints that the construct of speaking
may in fact be different according to the context of
speaking. She observes that speech acts primarily fall
into two categories: referential communication (for
exchanging information) and ludic discourse (for
entertainment and for establishing and maintaining
relationships) (Tarone, 2005: 486-490).
These models will all be useful reference points as
I consider the validity of the two Chinese speaking tests
under consideration.
3.2 L2 oral proficiency
Having examined the construct of speaking, I now
turn to the construct of L2 oral proficiency. Both tests
under consideration purport to provide a measure of
general Chinese L2 proficiency. But what does L2
proficiency look like?
9 / 25
Many speaking tests have defined the construct of L2
proficiency in relation to the proficiency of native
speakers. For example, the HSK adopts Beijing
pronunciation of standard Mandarin as the criterion for
evaluating L2 pronunciation (Jin and Mak, 2013: 28).
Similarly, the Foreign Services Institute (FSI) oral
proficiency levels and its daughters (such as ACTFL),
were originally benchmarked according to the linguistic
ability of a “well-educated native speaker” (WENS)
(Fulcher, 2003: 84). Many people have criticised the
concept of generic WENSs, since such a group is far from
homogenous in its ability (Bachman & Savignon, 1986: 383)
and in any case, as Lowe (1986: 394) suggests, many
native speakers themselves do not achieve WENS
proficiency in their L1.
In response to such criticism, many language tests
(for example IELTS and ACTFL) redefined their proficiency
scales in terms of ‘can do’ statements (Jin & Mak, 2013:
24). Under this model, L2 proficiency constitutes a
progression of ability to perform an increasing number of
tasks in the L2. Unfortunately there is little empirical
research to back up the premise on which this is based –
that all language learners learn to do the same tasks at
the same point along the learning process.
Scale descriptors of components of oral proficiency,
for example grammar, vocabulary and fluency, are often
intuitively derived. For example, Fulcher (2003: 98)
observed that descriptors of “fluency” in test
10 / 25
specifications often refer to “hesitation” and
“repetition” as indicators of lack of fluency. However,
discourse analysis shows that both hesitation and
repetition perform important functions in native speech.
Fulcher therefore analysed the discourse of L2 speech to
discover the distinguishing features of L2 fluency,
developing a detailed, empirically-based scale of L2
fluency descriptors (Fulcher, 2003: 250-2).
This type of research into “distinguishing features”
is currently gaining momentum, and Jin & Mak (2013) have
shown high levels of correlation between quantitively
scored distinguishing features in L2 Chinese speech and
other objective measures of Chinese L2 proficiency.
Second language acquisition research also provides
valuable insights into the stages of acquisition of
grammatical and discourse features which, I suspect,
could usefully inform a theoretically and empirically
based construct description for L2 oral proficiency (see
Wu and Ortega, 2013: 681).
4 Validity of oral proficiency tests
Validity is more than just a matter of whether a test
truly measures what it purports to measure. Messick
(1989: 13) defines validity as “an integrated evaluative
judgment of the degree to which empirical evidence and
theoretical rationales support the adequacy and
appropriateness of inferences and actions based on test
11 / 25
scores or other modes of assessment.” Thus, the
consequences of being assigned a particular score on a
particular test must be shown to be reasonable and
justifiable.
In this section, I first look at the validity of the
rating scales, the tasks and the scoring for the two
tests being considered. I will make reference to test
result consequences, including washback as an important
component of consequence validity. I will not address the
issues of face validity (Hughes, 2003: 33) or criterion-
related validity (Chapelle, 1999: 255) in detail. In
terms of the former, both the CST and the HSKK carry a
reasonable amount of face validity and are both generally
accepted by a wide variety of stakeholders. In terms of
the latter, there are unfortunately no Chinese L2
proficiency tests of sufficient reliability and validity
with which the CST and the HSKK can usefully be compared.
Indeed, both the CST and the HSKK are so well established
that they tend to be used as benchmarks for establishing
criterion-related validity of other tests (Li and Li,
2014: 105).
4.1 Validity of rating scales
4.1.1 HSKK
I turn first to the HSKK, since this is probably the
most widely recognised test of Chinese spoken proficiency
12 / 25
today. The HSKK, like its sister exam the HSK (a general
Chinese proficiency test comprising listening, reading,
grammar and writing), is scored from levels 1 (lowest) to
6 (highest). These levels are based on the International
Chinese Proficiency Standards (ICPS) (OIPC, 2007), developed
around the key concept of communicative competence (Xie,
2011: 11). Within this framework, there are separate
descriptors for “language proficiency” and “sample
tasks”, both of which are framed in terms of “can-do”
standards. In addition, there are descriptors of
pronunciation, vocabulary and grammar for each level (Jin
& Mak, 2013: 28). HSK descriptions even provide
guidelines concerning expected range of vocabulary and
has published vocabulary lists for each level (Confucius
Institute, 2010). Overall, the rating scale for oral
proficiency for the HSKK appears to be comprehensive,
although discourse competences and strategic competences
are largely overlooked.
Xie (2011: 11) claims that the ICPS levels are
directly equivalent to the Common European Framework of
Reference for languages (CEFR) Levels A1 to C2 (Council
of Europe, 2011), although this is somewhat unlikely
given that the CEFR levels were empirically derived
(North and Schneider, 1998) whereas the ICPS levels were
not (Xu & Cheng, 2011; Meyer, 2014).
4.1.2 CST
13 / 25
The CST uses the ACTFL (2012) rating scale which,
like the ICPS, is based on communicative competence. The
developers admit that the scale is “not based on any
particular theory” and that they “do not describe how an
individual learns a language” (ACTFL, 2012: 3). The lack
of a theoretical basis reduces its validity. Each band
descriptor includes non-context specific “can-do”
statements along with descriptions of typical discourse
type (e.g. narrative or descriptive) and complexity,
typical content (e.g. can talk about abstract topics),
pronunciation, and intelligibility to “native speakers”.
The ACTFL scale has come under attack for its lack
of theoretical or empirical basis (Bachman, 1988;
Fulcher, 2003). Crucially, there appears to be no basis
for claiming that there is a relationship between second
language learners’ ability to deal with certain discourse
types, their accuracy of expression and the
intelligibility of their speech to native speakers
(Lantolf & Frawley, 1985). On the contrary, Munro,
Derwing & Morton (2006) have demonstrated that heavy
accentedness in L2 English does not necessarily result in
reduced intelligibility to native speakers (Kennedy &
Trofimovich, 2008, showed slightly different findings).
Neal’s (forthcoming) tentative findings similarly
indicate that poor tone production in L2 Chinese speech
does not affect native speaker intelligibility as much as
might be expected.
14 / 25
4.1.3 Increasing scale validity
In some ways, ICPS suffers from the same flaw as the
ACTFL in that it assumes a direct relationship between
what kind of tasks learners can handle and the type of
language they can produce.
To improve the theoretical basis, and therefore the
validity, of the scales used for both the CST and the
HSKK, I propose an empirically-driven reformulation of
the rating scales. This could either be done via more in-
depth discourse and grammatical analysis of Chinese L2
speech at different levels of proficiency, or via the
development of a set of distinguishing features based on
a comparison of descriptive assessments of L2 speakers
carried out by a large number of Chinese teachers and
examiners, along the lines of North and Schneider’s
(1998) approach for developing the CEFR.
4.2 Validity of tasks and scoring
Neither the CST nor the HSKK require an interviewer
to administer them. Their recorded responses are sent to
trained raters for grading. In this sense, both tests
suffer from invalid claims concerning what their scores
represent, viz. L2 communicative competence. In reality,
most communicative tasks are interactional in nature, and
interactional competence is not tested in either test.
Moreover, both tests almost exclusively elicit
15 / 25
referential communication, largely ignoring the ludic
discourse function of speaking (Tarone, 2005).
4.2.1 CST
I now look specifically at the tasks themselves,
already described in Section 2, and the way that they are
scored. Measured against the construct of oral
proficiency set out in the ACTFL Guidelines, the CST
tasks themselves seem to have a high level of validity.
They are focused enough to elicit the type of language
described in each proficiency band and they cover a wide
range of discourse types, situations and topics,
minimising the effects of topic familiarity.
However, there is a problem with the way the tasks
are scored. The ACTFL Guidelines constitute a holistic
scale, each band encompassing multiple areas of language
competency. It is difficult for raters to consistently
assign a level to each task because some aspects of the
candidate’s performance will likely indicate a different
proficiency level than other aspects (Bachman & Palmer,
1996: 209-210, are particularly critical of holistic
scales for this very reason).
The CST rater guidelines (see Kenyon, 1997) attempt
to solve this problem by giving precedence to content and
successful task completion over accuracy and fluency.
This is because the final rating is supposed to reflect
what a candidate can do in the L2. Despite this, each
16 / 25
rater’s judgment concerning “successful task completion”
may vary wildly, making the test unreliable. By way of
illustration, Clark (1988: 205) cites one instance in
which a qualified rater scored five particular candidates
as Intermediate-Mid, whereas another qualified rater
scored the same five candidates at four different levels
– two Intermediate-Low, one Intermediate-Mid, one
Intermediate-High and one Advanced Plus.
One way to improve the validity and reliability of
the CST, then, would be to use an analytic scale instead
of a holistic scale. Several sub-scales could be devised
according to a valid construct of L2 linguistic
competence (based on Martínez-Flor et al., 2006, for
example) and the raters could assign scores on each scale
according to a simpler set of criteria than those
provided by the current ACTFL guidelines. Rather than
providing a single overall rating, candidates could be
given more than one rating relating to key areas of
communicative competence (in a similar way to the CEFR,
which assigns ratings for both “spoken interaction” and
“spoken production”, CoE, 2001). This would better inform
institutions who make decisions based on CST results. It
would also provide useful feedback to the candidate,
showing them the specific areas of L2 speaking that they
need to work on (a washback effect).
4.2.2 HSKK
17 / 25
The HSKK tasks are very different to the CST. It
makes considerable use of “listen and repeat” tasks,
which are far more indirect measures of oral proficiency.
These tasks are very similar to those elicited by the
Versant® computerised English speaking test (Brown &
Abeywickrama, 2010: 188). They have the distinct
advantage over freer speaking tasks in that they can be
rated far more objectively. They have the disadvantage of
being unauthentic and not having much face validity.
Concerning their validity, Van Moere (2012: 328) argues
that listen and repeat tasks accurately measure “the
efficiency with which learners process language”, a
psycholinguistic measure that has shown significant
correlations with general communicative proficiency
(Bernstein, DeJong, Pisoni & Townshend, 2000).
The remaining tasks in HSKK are extremely
problematic. The Intermediate HSKK tasks require the
candidate to talk for two minutes about a photograph. The
question-answer tasks for Intermediate and Advanced HSKK
are extremely simple (for example, “You now have one
month’s holiday. What are you going to do?”). Neither
task gives further guidance concerning content. The
rating criteria for both tasks include: 1) volume and
breadth of relevant content; 2) fluency (little
hesitation or repetition); and 3) grammatical accuracy
(Hanban, 2014c). They could be argued to be “valid” in
that they attempt to elicit responses which can be rated
on the ICPS rating scale. However, it is doubtful that
18 / 25
four simple questions could elicit the range of target
content and linguistic structures that are stipulated in
the ICPS construct of oral proficiency. Furthermore, as
Zhang, Li, Li, Jie & Huang (2012: 51-2) point out, these
tasks probably test “logic, … textual organisation
skills, … imaginative skills, creativity [and] specific
topic knowledge” far more than they test actual L2
ability.
To improve the validity of the HSKK, I firstly
propose adding more tasks to cover a wider range of
interactional situations and discourse types, thus
encompassing the entire oral proficiency construct
described in ICPS. Secondly, I propose providing
candidates with more guidance on what specifically to say
and how to structure their response, thus reducing the
influence of non-language-related factors in their
performance and increasing test-retest reliability
(Hughes, 2003: 39).
5 Conclusions
In this essay, I have demonstrated the complexities
involved in validating oral proficiency tests. I have
suggested that the construct of “Chinese oral
proficiency” should be redefined in the light of real
data showing how L2 learners progress as they acquire
spoken Chinese. Rating criteria and test design can then
be developed from this redefined construct. Scoring of
19 / 25
Chinese speaking tests could be improved through the
introduction of a simple, concise, construct-linked
analytic scale. This would enhance consequence and
washback validity. Although I did not have space to
consider issues of reliability in depth, there is surely
much that can be done to improve in this area as well.
Even then, caution must still be exercised in
interpreting the results of such tests, and decisions
about people’s future should not be based upon oral
proficiency test results alone.
References
American Council on the Teaching of Foreign Languages
(ACTFL) (2012), ACTFL Proficiency Guidelines, Alexandria, VA:
ACTFL.
Bachman, Lyle F. (1988), “Problems in Examining the
Validity of the ACTFL Oral Proficiency Interview”,
Studies in Second Language Acquisition, 10(2):149-164.
Bachman, Lyle F. and Savignon, Sandra J. (1986), “The
Evaluation of Communicative Language Proficiency: A
Critique of the ACTFL Oral Interview”, The Modern
Language Teaching Journal, 70(4): 380-390.
Bachman, Lyle F. and Palmer, Adrian S. (1996), Language
Testing in Practice, Oxford: Oxford University Press.
Bernstein, Jared, DeJong, John, Pisoni, David and
Townshend, Brent (2000), “Two experiments on automatic
20 / 25
scoring of spoken language proficiency”, Integrating
Speech Technology in Learning, pp. 57-61.
Brown, H. Douglas and Abeywickrama, Priyanvada (2010),
Language Assessment: Principles and Classroom Practice, 2nd
Edition, White Plains, NY: Pearson.
Byegate, Martin (1987), Speaking, Oxford: Oxford
University Press.
Chapelle, Carol A. (1999), “Validity in language
assessment”, Annual Review of Applied Linguistics, 19: 254-272.
Clark, John L. D. (1988), “Validation of a tape-mediated
ACTFL/ILR-scale based test of Chinese speaking
proficiency”, Language Testing, 5(2): 187-205.
Confucius Institute (ed.) (2010), Xin Hanyu Shuiping Kaoshi
Dagang HSK Liu Ji [New Chinese Proficiency Examination Curriculum, HSK
Level 6], Beijing: Commercial Press.
Council of Europe (2001), Common European Framework of
Reference for Languages: Learning, Teaching and Assessment,
Cambridge: Cambridge University Press. Available
online:
http://www.coe.int/t/dg4/linguistic/source/framework_en
.pdf (Accessed 2014-05-12).
Fulcher, Glenn (2003), Testing Second Language Speaking,
Harlow: Pearson Education.
Hanban (National Office for Teaching Chinese as a Foreign
Language) (2014a), HSK [Chinese Proficiency Examination],
Beijing: Hanban, On-line at:
http://www.hanban.edu.cn/tests/node_7486.htm (Accessed
2014-05-17).
21 / 25
Hanban (2014b), HSKK [Chinese Proficiency Oral Examination],
Beijing: Hanban, On-line at:
http://www.chinesetest.cn/godownload.do (Accessed 2014-
05-17).
Hanban (2014c), HSKK Pingfen Shuoming [Chinese Proficiency Oral
Examination Scoring Explanation], Beijing: Hanban, On-line at:
http://www.hanban.edu.cn/tests/node_38289.htm (Accessed
2014-05-17).
Hughes, Arthur (2003), Testing for Language Teachers: Second
Edition, Cambridge: Cambridge University Press.
Jin, Tan and Mak, Barley (2013), “Distinguishing features
in scoring L2 Chinese speaking proficiency: How do they
work?”, Language Testing, 30(1): 23-47.
Kennedy, Sara and Trofimovich, Pavel (2008),
“Intelligibility, Comprehensibility, and Accentedness
of L2 Speech: The Role of Listener Experience and
Semantic Context”, The Canadian Modern Language Review,
64(3): 459-489.
Kenyon, Dorry M. (1997), “Further research on the
efficacy of rater self-training”, in Ari Huhta, Viljo
Kohonen, Liisa Kurki-Suonio & Sari Luoma (Eds.), Current
Developments and Alternatives in Language Assessment: Proceedings of
LTRC 96, Jyväskylä, Finland: University of Jyväskylä,
pp. 257-273.
Lado, Robert (1961), Language Testing: The Construction and Use of
Foreign Language Tests, London: Longman.
22 / 25
Lantolf, James P. and Frawley, William (1988),
“Proficiency: Understanding the Concept”, Studies in Second
Language Acquisition, 10: 181-195.
Li, Xiaoqi and Li, Jinghua (2014), “Hanyu Kouyu Kaoshi
(SCT) de Xiaodu Fenxi [Validity Analysis of Spoken
Chinese Test]”, Shijie Hanyu Jiaoxue [Global Chinese Teaching],
28(1): 103-112.
Littlewood, William (1981), Communicative Language Teaching: An
Introduction, Cambridge: Cambridge University Press.
Lowe, Pardee, Jr. (1986), “Proficiency: Panacea,
Framework, Process? A Reply to Kramsch, Schulz, and,
Particularly, to Bachman and Savignon”, The Modern
Language Journal, 70(4): 391-7.
Luoma, Sari (2004), Assessing Speaking, Cambridge: Cambridge
University Press.
Martínez-Flor, Alicia, Usó-Juan, Esther and Soler, Eva
Alcón (2006), “Towards Acquiring Communicative
Competence through Speaking”, in Esther Usó-Juan and
Alicia Martínez-Flor (eds.), Current Trends in the Development
and Teaching of the Four Language Skills, Berlin: Mouton de
Gruyter, pp. 139-157.
Messick, Samuel J. (1989), “Validity”, in Robert L. Linn
(ed.), Educational Measurement, New York: MacMillan,
pp.13-103.
Meyer, Florian Kağan (2014), Language Proficiency Testing for
Chinese as a Foreign Language: An Argument Based Approach for
Validating the Hanyu Shuiping Kaoshi (HSK), Frankfurt am Mein:
Peter Lang.
23 / 25
Munro, Murray J., Derwing, Tracey M. and Morton, Susan L.
(2006), “The mutual intelligibility of L2 speech”,
Studies in Second Language Acquisition, 28: 111–131.
Neal, Robert J. (forthcoming), PhD thesis, University of
Cambridge.
North, Brain and Schneider, Günther (1998), “Scaling
descriptors for language proficiency scales”, Language
Testing, 15(2): 217-262.
North, Brian (2000), “Linking language assessments: an
example in a low stakes context”, System, 28: 555-577.
Office of the National Working Team for the International
Promotion of Chinese (OIPC) (2007), Guoji Hanyu Nengli
Biaozhun [International Chinese Proficiency Standards], Beijing:
Foreign Language Teaching and Research Press.
Tarone, Elaine (2005), “Speaking in a Second Language”,
in Eli Hinkel (ed.), Handbook of Research in Second Language
Teaching and Learning, Mahwah, NJ and London: Lawrence
Erlbaum Associates, pp. 485-502.
Van Moere, Alistair (2012), “A psycholinguistic approach
to oral language assessment”, Language Testing, 29(3):
325-344.
Wu, Shu-Ling and Ortega, Lourdes (2013), “Measuring
Global Oral Proficiency in SLA Research: A New Elicited
Imitation Test of L2 Chinese”, Foreign Language Annals,
46(4): 680-704.
Xie, Xiaoqing (2011), “Wei shenme yao kaifa xin HSK
kaoshi? [Why Hanban developed the new Chinese
24 / 25
proficiency test?]”, Zhongguo Kaoshi [China Examinations], 3:
10-13.
Xu, Haibing and Cheng, Yan (2011), “ ‘Guoji Hanyu Nengli
Biaozhun’ zhong Zhide Shangque de Difang – Jian yu
‘Ouzhou Yuyan Gongtong Cankao Kuangjia: Xuexi, Jiaoxue,
Pinggu’ Bijiao [Debateable Issues on International Standard
of Chinese Ability: In Comparison with Common European
Framework of Reference for Languages: Learning, Teaching,
Assessment], Guangdong Haiyang Daxue Xuebao [Journal of
Guangdong Ocean University], 31(5): 91-94.
Zhang Jinjun, Li Peize, Li Yanan, Jie Nini and Huang Lei
(2012), “Dui Xin Hanyu Shuiping Kaoshi de Xin Sikao
[New Thinking on the New HSK]”, Kaoshi Luntan [Examination
Exploration], 2: 50-53.
25 / 25