Improving the validity of Chinese speaking tests

Improving the validity of Chinese speaking tests

Andy Castro

[email protected]

Essay submitted in partial fulfilment of the requirements

of the degree of

Master of Arts in Teaching Chinese as a Foreign Language

at the University of Sheffield

mailto:[email protected]

2 / 25

Contents

1 Introduction 3

2 Chinese speaking tests: CST and HSKK 4

3 Key constructs 4

3.1 Speaking 5

3.2 L2 oral proficiency 6

4 Validity of oral proficiency tests 7

4.1 Validity of rating scales 8

4.1.1 HSKK 8

4.1.2 CST 8

4.1.3 Increasing scale validity 9

4.2 Validity of tasks and scoring 9

4.2.1 CST 10

4.2.2 HSKK 11

5 Conclusions 12

References 12

3 / 25

1 Introduction

Over the last few decades there has been much debate

concerning the validity and reliability of a range of

second language speaking tests (Bachman & Palmer, 1996;

Fulcher, 2003; Meyer, 2014). The test developer must

first show that the test is valid. In other words, at the

most basic level, “does [it] measure what it is supposed

to measure?” (Lado, 1961:321) Assuming that the test,

including its scoring process, is valid, it must then be

shown to be reliable. That is to say the same candidate,

performing under the same conditions, should get a very

similar result every time they take the test (Brown &

Abeywickrama, 2010: 27).

Accurate assessment of oral proficiency is not easy.

This is partly because, as suggested in the essay

question, the scoring of spoken production is highly

subjective. In addition, spoken performances under test

conditions are almost inevitably influenced by numerous

psychological and environmental factors. More

fundamentally, though, the basic concepts of “speaking”

and “oral proficiency” are themselves very unclear

constructs. Many test specifications seem to lack the

theoretical and empirical underpinnings that are required

for a solid validity argument to be made (North, 2000:

571; Luoma, 2004: 68).

In this essay, I examine ways of strengthening the

validity of the scoring process for speaking tests,

4 / 25

specifically for criterion-referenced Chinese oral

proficiency tests. First, I describe two oral proficiency

tests commonly used in the TCFL (Teaching Chinese as a

Foreign Language) context. Secondly, I look at the

constructs of “speaking” and “L2 oral proficiency”, both

generally and as defined by the two tests under

consideration. Finally, I examine the validity of these

tests by relating their rating scales, their tasks and

their scoring procedures to the constructs defined in the

previous section, following Luoma’s (2004: 177) pattern

for validity checking. In addition, I discuss possible

ways of improving their validity, touching also on issues

of reliability.

5 / 25

2 Chinese speaking tests: CST and HSKK

I shall critique two widely-used tests of spoken Chinese

proficiency in this essay. The first, the Chinese

Speaking Test (CST), is a precursor to the new Oral

Proficiency Interview Computer test (OPIc) offered by the

American Council on the Teaching of Foreign Languages

(ACTFL) (Clark, 1998; ACTFL, 2012). The second, known as

the Hanyu Shuiping Kouyu Kaoshi (Chinese Proficiency Oral

Examination), or HSKK for short, is widely used both

inside and outside of China and has been promoted

extensively abroad by the Chinese government in recent

years (Meyer, 2014: 22; Hanban, 2014a).

Both of these tests claim to provide measures of

overall speaking proficiency in Chinese which are

criterion-referenced and syllabus-independent. The CST is

designed to test oral proficiency levels according to the

ACTFL Proficiency Guidelines (ACTFL, 2012). It is a

recording-based test comprising 15 tasks, five picture-

based, five situation-based (i.e. a role-play without a

live interviewer) and five topic-based. Each task is

designed to elicit language at a particular level

according to the ACTFL criteria. An advanced-level

picture-based task is telling a story based on a set of

pictures. The result of the test is reported as a single

proficiency rating according to the ACTFL guidelines

(Clark, 1988).

6 / 25

The HSKK consists of three independent tests, one

for each of Elementary, Intermediate and Advanced levels.

Like the CST, it is recording-based. The Intermediate

HSKK comprises one sentence repetition task (ten

sentences in total), two picture-description tasks and

two question-answer tasks. The Advanced HSKK comprises

three listen-and-summarise tasks, one read-aloud task and

two question-answer tasks. The result for any one test is

reported as a percentage, with 60% required to achieve a

“pass” (Hanban, 2014b).

3 Key constructs

3.1 Speaking

The ‘skill’ of speaking is very rarely exercised in

isolation. It is usually an interactional activity. It

does not simply involve formulating and articulating

speech, but a whole host of other skills and strategies.

As Fulcher (2003: 46) puts it, “No operational construct

definition can ever capture the richness of what happens

in a process as complex as human communication.”

Nevertheless, the act of speaking can be

deconstructed and elements that are useful for testing

can be identified. One of the earlier models of speaking

was proposed by Byegate (1987). In his model, speaking is

a three step process of planning, selection and

7 / 25

production, each step involving interactions between the

speaker’s knowledge and skills (Byegate, 1987: 49-50).

Martínez-Flor et al. (2006: 146-150) propose a model

of speaking in the overall framework of communicative

competence that has been the driving force behind

communicative language teaching (Littlewood, 1981) in

recent decades. Under this model, represented visually in

Figure 1, linguistic competence (vocabulary, grammar,

phonology and prosody), pragmatic competence (register

and politeness), intercultural competence (sociocultural

awareness and non-verbal means of communication) and

strategic competence (compensatory strategies such as

repetition, circumlocution and paraphrasing) are all

mediated by discourse competence (cohesive devices and

the use of turn-taking mechanisms) to produce speech.

Figure 1. Speaking in a communicative competence framework (Martínez-

Flor et al., 2006: 147).

8 / 25

Linguisticcompeten

ce

Intercultural

competence

Strategiccompete

nce

Pragmaticcompete

nce

Discourse

competence

SPEAKING

Fulcher (2003: 31-48) proposes a similar framework.

He divides core language competence into three broad

areas: phonology (pronunciation, stress and prosody),

accuracy (syntax, vocabulary and cohesion) and fluency

(cohesion and structure). He proposes that speaking is

also moderated by strategic capacity (achievement

strategies and avoidance strategies), textual knowledge

(i.e. knowledge about the structure of dialogue),

pragmatic knowledge (appropriateness, implicature) and

sociolinguistic knowledge.

Tarone (2005) hints that the construct of speaking

may in fact be different according to the context of

speaking. She observes that speech acts primarily fall

into two categories: referential communication (for

exchanging information) and ludic discourse (for

entertainment and for establishing and maintaining

relationships) (Tarone, 2005: 486-490).

These models will all be useful reference points as

I consider the validity of the two Chinese speaking tests

under consideration.

3.2 L2 oral proficiency

Having examined the construct of speaking, I now

turn to the construct of L2 oral proficiency. Both tests

under consideration purport to provide a measure of

general Chinese L2 proficiency. But what does L2

proficiency look like?

9 / 25

Many speaking tests have defined the construct of L2

proficiency in relation to the proficiency of native

speakers. For example, the HSK adopts Beijing

pronunciation of standard Mandarin as the criterion for

evaluating L2 pronunciation (Jin and Mak, 2013: 28).

Similarly, the Foreign Services Institute (FSI) oral

proficiency levels and its daughters (such as ACTFL),

were originally benchmarked according to the linguistic

ability of a “well-educated native speaker” (WENS)

(Fulcher, 2003: 84). Many people have criticised the

concept of generic WENSs, since such a group is far from

homogenous in its ability (Bachman & Savignon, 1986: 383)

and in any case, as Lowe (1986: 394) suggests, many

native speakers themselves do not achieve WENS

proficiency in their L1.

In response to such criticism, many language tests

(for example IELTS and ACTFL) redefined their proficiency

scales in terms of ‘can do’ statements (Jin & Mak, 2013:

24). Under this model, L2 proficiency constitutes a

progression of ability to perform an increasing number of

tasks in the L2. Unfortunately there is little empirical

research to back up the premise on which this is based –

that all language learners learn to do the same tasks at

the same point along the learning process.

Scale descriptors of components of oral proficiency,

for example grammar, vocabulary and fluency, are often

intuitively derived. For example, Fulcher (2003: 98)

observed that descriptors of “fluency” in test

10 / 25

specifications often refer to “hesitation” and

“repetition” as indicators of lack of fluency. However,

discourse analysis shows that both hesitation and

repetition perform important functions in native speech.

Fulcher therefore analysed the discourse of L2 speech to

discover the distinguishing features of L2 fluency,

developing a detailed, empirically-based scale of L2

fluency descriptors (Fulcher, 2003: 250-2).

This type of research into “distinguishing features”

is currently gaining momentum, and Jin & Mak (2013) have

shown high levels of correlation between quantitively

scored distinguishing features in L2 Chinese speech and

other objective measures of Chinese L2 proficiency.

Second language acquisition research also provides

valuable insights into the stages of acquisition of

grammatical and discourse features which, I suspect,

could usefully inform a theoretically and empirically

based construct description for L2 oral proficiency (see

Wu and Ortega, 2013: 681).

4 Validity of oral proficiency tests

Validity is more than just a matter of whether a test

truly measures what it purports to measure. Messick

(1989: 13) defines validity as “an integrated evaluative

judgment of the degree to which empirical evidence and

theoretical rationales support the adequacy and

appropriateness of inferences and actions based on test

11 / 25

scores or other modes of assessment.” Thus, the

consequences of being assigned a particular score on a

particular test must be shown to be reasonable and

justifiable.

In this section, I first look at the validity of the

rating scales, the tasks and the scoring for the two

tests being considered. I will make reference to test

result consequences, including washback as an important

component of consequence validity. I will not address the

issues of face validity (Hughes, 2003: 33) or criterion-

related validity (Chapelle, 1999: 255) in detail. In

terms of the former, both the CST and the HSKK carry a

reasonable amount of face validity and are both generally

accepted by a wide variety of stakeholders. In terms of

the latter, there are unfortunately no Chinese L2

proficiency tests of sufficient reliability and validity

with which the CST and the HSKK can usefully be compared.

Indeed, both the CST and the HSKK are so well established

that they tend to be used as benchmarks for establishing

criterion-related validity of other tests (Li and Li,

2014: 105).

4.1 Validity of rating scales

4.1.1 HSKK

I turn first to the HSKK, since this is probably the

most widely recognised test of Chinese spoken proficiency

12 / 25

today. The HSKK, like its sister exam the HSK (a general

Chinese proficiency test comprising listening, reading,

grammar and writing), is scored from levels 1 (lowest) to

6 (highest). These levels are based on the International

Chinese Proficiency Standards (ICPS) (OIPC, 2007), developed

around the key concept of communicative competence (Xie,

2011: 11). Within this framework, there are separate

descriptors for “language proficiency” and “sample

tasks”, both of which are framed in terms of “can-do”

standards. In addition, there are descriptors of

pronunciation, vocabulary and grammar for each level (Jin

& Mak, 2013: 28). HSK descriptions even provide

guidelines concerning expected range of vocabulary and

has published vocabulary lists for each level (Confucius

Institute, 2010). Overall, the rating scale for oral

proficiency for the HSKK appears to be comprehensive,

although discourse competences and strategic competences

are largely overlooked.

Xie (2011: 11) claims that the ICPS levels are

directly equivalent to the Common European Framework of

Reference for languages (CEFR) Levels A1 to C2 (Council

of Europe, 2011), although this is somewhat unlikely

given that the CEFR levels were empirically derived

(North and Schneider, 1998) whereas the ICPS levels were

not (Xu & Cheng, 2011; Meyer, 2014).

4.1.2 CST

13 / 25

The CST uses the ACTFL (2012) rating scale which,

like the ICPS, is based on communicative competence. The

developers admit that the scale is “not based on any

particular theory” and that they “do not describe how an

individual learns a language” (ACTFL, 2012: 3). The lack

of a theoretical basis reduces its validity. Each band

descriptor includes non-context specific “can-do”

statements along with descriptions of typical discourse

type (e.g. narrative or descriptive) and complexity,

typical content (e.g. can talk about abstract topics),

pronunciation, and intelligibility to “native speakers”.

The ACTFL scale has come under attack for its lack

of theoretical or empirical basis (Bachman, 1988;

Fulcher, 2003). Crucially, there appears to be no basis

for claiming that there is a relationship between second

language learners’ ability to deal with certain discourse

types, their accuracy of expression and the

intelligibility of their speech to native speakers

(Lantolf & Frawley, 1985). On the contrary, Munro,

Derwing & Morton (2006) have demonstrated that heavy

accentedness in L2 English does not necessarily result in

reduced intelligibility to native speakers (Kennedy &

Trofimovich, 2008, showed slightly different findings).

Neal’s (forthcoming) tentative findings similarly

indicate that poor tone production in L2 Chinese speech

does not affect native speaker intelligibility as much as

might be expected.

14 / 25

4.1.3 Increasing scale validity

In some ways, ICPS suffers from the same flaw as the

ACTFL in that it assumes a direct relationship between

what kind of tasks learners can handle and the type of

language they can produce.

To improve the theoretical basis, and therefore the

validity, of the scales used for both the CST and the

HSKK, I propose an empirically-driven reformulation of

the rating scales. This could either be done via more in-

depth discourse and grammatical analysis of Chinese L2

speech at different levels of proficiency, or via the

development of a set of distinguishing features based on

a comparison of descriptive assessments of L2 speakers

carried out by a large number of Chinese teachers and

examiners, along the lines of North and Schneider’s

(1998) approach for developing the CEFR.

4.2 Validity of tasks and scoring

Neither the CST nor the HSKK require an interviewer

to administer them. Their recorded responses are sent to

trained raters for grading. In this sense, both tests

suffer from invalid claims concerning what their scores

represent, viz. L2 communicative competence. In reality,

most communicative tasks are interactional in nature, and

interactional competence is not tested in either test.

Moreover, both tests almost exclusively elicit

15 / 25

referential communication, largely ignoring the ludic

discourse function of speaking (Tarone, 2005).

4.2.1 CST

I now look specifically at the tasks themselves,

already described in Section 2, and the way that they are

scored. Measured against the construct of oral

proficiency set out in the ACTFL Guidelines, the CST

tasks themselves seem to have a high level of validity.

They are focused enough to elicit the type of language

described in each proficiency band and they cover a wide

range of discourse types, situations and topics,

minimising the effects of topic familiarity.

However, there is a problem with the way the tasks

are scored. The ACTFL Guidelines constitute a holistic

scale, each band encompassing multiple areas of language

competency. It is difficult for raters to consistently

assign a level to each task because some aspects of the

candidate’s performance will likely indicate a different

proficiency level than other aspects (Bachman & Palmer,

1996: 209-210, are particularly critical of holistic

scales for this very reason).

The CST rater guidelines (see Kenyon, 1997) attempt

to solve this problem by giving precedence to content and

successful task completion over accuracy and fluency.

This is because the final rating is supposed to reflect

what a candidate can do in the L2. Despite this, each

16 / 25

rater’s judgment concerning “successful task completion”

may vary wildly, making the test unreliable. By way of

illustration, Clark (1988: 205) cites one instance in

which a qualified rater scored five particular candidates

as Intermediate-Mid, whereas another qualified rater

scored the same five candidates at four different levels

– two Intermediate-Low, one Intermediate-Mid, one

Intermediate-High and one Advanced Plus.

One way to improve the validity and reliability of

the CST, then, would be to use an analytic scale instead

of a holistic scale. Several sub-scales could be devised

according to a valid construct of L2 linguistic

competence (based on Martínez-Flor et al., 2006, for

example) and the raters could assign scores on each scale

according to a simpler set of criteria than those

provided by the current ACTFL guidelines. Rather than

providing a single overall rating, candidates could be

given more than one rating relating to key areas of

communicative competence (in a similar way to the CEFR,

which assigns ratings for both “spoken interaction” and

“spoken production”, CoE, 2001). This would better inform

institutions who make decisions based on CST results. It

would also provide useful feedback to the candidate,

showing them the specific areas of L2 speaking that they

need to work on (a washback effect).

4.2.2 HSKK

17 / 25

The HSKK tasks are very different to the CST. It

makes considerable use of “listen and repeat” tasks,

which are far more indirect measures of oral proficiency.

These tasks are very similar to those elicited by the

Versant® computerised English speaking test (Brown &

Abeywickrama, 2010: 188). They have the distinct

advantage over freer speaking tasks in that they can be

rated far more objectively. They have the disadvantage of

being unauthentic and not having much face validity.

Concerning their validity, Van Moere (2012: 328) argues

that listen and repeat tasks accurately measure “the

efficiency with which learners process language”, a

psycholinguistic measure that has shown significant

correlations with general communicative proficiency

(Bernstein, DeJong, Pisoni & Townshend, 2000).

The remaining tasks in HSKK are extremely

problematic. The Intermediate HSKK tasks require the

candidate to talk for two minutes about a photograph. The

question-answer tasks for Intermediate and Advanced HSKK

are extremely simple (for example, “You now have one

month’s holiday. What are you going to do?”). Neither

task gives further guidance concerning content. The

rating criteria for both tasks include: 1) volume and

breadth of relevant content; 2) fluency (little

hesitation or repetition); and 3) grammatical accuracy

(Hanban, 2014c). They could be argued to be “valid” in

that they attempt to elicit responses which can be rated

on the ICPS rating scale. However, it is doubtful that

18 / 25

four simple questions could elicit the range of target

content and linguistic structures that are stipulated in

the ICPS construct of oral proficiency. Furthermore, as

Zhang, Li, Li, Jie & Huang (2012: 51-2) point out, these

tasks probably test “logic, … textual organisation

skills, … imaginative skills, creativity [and] specific

topic knowledge” far more than they test actual L2

ability.

To improve the validity of the HSKK, I firstly

propose adding more tasks to cover a wider range of

interactional situations and discourse types, thus

encompassing the entire oral proficiency construct

described in ICPS. Secondly, I propose providing

candidates with more guidance on what specifically to say

and how to structure their response, thus reducing the

influence of non-language-related factors in their

performance and increasing test-retest reliability

(Hughes, 2003: 39).

5 Conclusions

In this essay, I have demonstrated the complexities

involved in validating oral proficiency tests. I have

suggested that the construct of “Chinese oral

proficiency” should be redefined in the light of real

data showing how L2 learners progress as they acquire

spoken Chinese. Rating criteria and test design can then

be developed from this redefined construct. Scoring of

19 / 25

Chinese speaking tests could be improved through the

introduction of a simple, concise, construct-linked

analytic scale. This would enhance consequence and

washback validity. Although I did not have space to

consider issues of reliability in depth, there is surely

much that can be done to improve in this area as well.

Even then, caution must still be exercised in

interpreting the results of such tests, and decisions

about people’s future should not be based upon oral

proficiency test results alone.

References

American Council on the Teaching of Foreign Languages

(ACTFL) (2012), ACTFL Proficiency Guidelines, Alexandria, VA:

ACTFL.

Bachman, Lyle F. (1988), “Problems in Examining the

Validity of the ACTFL Oral Proficiency Interview”,

Studies in Second Language Acquisition, 10(2):149-164.

Bachman, Lyle F. and Savignon, Sandra J. (1986), “The

Evaluation of Communicative Language Proficiency: A

Critique of the ACTFL Oral Interview”, The Modern

Language Teaching Journal, 70(4): 380-390.

Bachman, Lyle F. and Palmer, Adrian S. (1996), Language

Testing in Practice, Oxford: Oxford University Press.

Bernstein, Jared, DeJong, John, Pisoni, David and

Townshend, Brent (2000), “Two experiments on automatic

20 / 25

scoring of spoken language proficiency”, Integrating

Speech Technology in Learning, pp. 57-61.

Brown, H. Douglas and Abeywickrama, Priyanvada (2010),

Language Assessment: Principles and Classroom Practice, 2nd

Edition, White Plains, NY: Pearson.

Byegate, Martin (1987), Speaking, Oxford: Oxford

University Press.

Chapelle, Carol A. (1999), “Validity in language

assessment”, Annual Review of Applied Linguistics, 19: 254-272.

Clark, John L. D. (1988), “Validation of a tape-mediated

ACTFL/ILR-scale based test of Chinese speaking

proficiency”, Language Testing, 5(2): 187-205.

Confucius Institute (ed.) (2010), Xin Hanyu Shuiping Kaoshi

Dagang HSK Liu Ji [New Chinese Proficiency Examination Curriculum, HSK

Level 6], Beijing: Commercial Press.

Council of Europe (2001), Common European Framework of

Reference for Languages: Learning, Teaching and Assessment,

Cambridge: Cambridge University Press. Available

online:

http://www.coe.int/t/dg4/linguistic/source/framework_en

.pdf (Accessed 2014-05-12).

Fulcher, Glenn (2003), Testing Second Language Speaking,

Harlow: Pearson Education.

Hanban (National Office for Teaching Chinese as a Foreign

Language) (2014a), HSK [Chinese Proficiency Examination],

Beijing: Hanban, On-line at:

http://www.hanban.edu.cn/tests/node_7486.htm (Accessed

2014-05-17).

21 / 25

http://www.coe.int/t/dg4/linguistic/source/framework_en.pdf

http://www.coe.int/t/dg4/linguistic/source/framework_en.pdf

Hanban (2014b), HSKK [Chinese Proficiency Oral Examination],

Beijing: Hanban, On-line at:

http://www.chinesetest.cn/godownload.do (Accessed 2014-

05-17).

Hanban (2014c), HSKK Pingfen Shuoming [Chinese Proficiency Oral

Examination Scoring Explanation], Beijing: Hanban, On-line at:

http://www.hanban.edu.cn/tests/node_38289.htm (Accessed

2014-05-17).

Hughes, Arthur (2003), Testing for Language Teachers: Second

Edition, Cambridge: Cambridge University Press.

Jin, Tan and Mak, Barley (2013), “Distinguishing features

in scoring L2 Chinese speaking proficiency: How do they

work?”, Language Testing, 30(1): 23-47.

Kennedy, Sara and Trofimovich, Pavel (2008),

“Intelligibility, Comprehensibility, and Accentedness

of L2 Speech: The Role of Listener Experience and

Semantic Context”, The Canadian Modern Language Review,

64(3): 459-489.

Kenyon, Dorry M. (1997), “Further research on the

efficacy of rater self-training”, in Ari Huhta, Viljo

Kohonen, Liisa Kurki-Suonio & Sari Luoma (Eds.), Current

Developments and Alternatives in Language Assessment: Proceedings of

LTRC 96, Jyväskylä, Finland: University of Jyväskylä,

pp. 257-273.

Lado, Robert (1961), Language Testing: The Construction and Use of

Foreign Language Tests, London: Longman.

22 / 25

Lantolf, James P. and Frawley, William (1988),

“Proficiency: Understanding the Concept”, Studies in Second

Language Acquisition, 10: 181-195.

Li, Xiaoqi and Li, Jinghua (2014), “Hanyu Kouyu Kaoshi

(SCT) de Xiaodu Fenxi [Validity Analysis of Spoken

Chinese Test]”, Shijie Hanyu Jiaoxue [Global Chinese Teaching],

28(1): 103-112.

Littlewood, William (1981), Communicative Language Teaching: An

Introduction, Cambridge: Cambridge University Press.

Lowe, Pardee, Jr. (1986), “Proficiency: Panacea,

Framework, Process? A Reply to Kramsch, Schulz, and,

Particularly, to Bachman and Savignon”, The Modern

Language Journal, 70(4): 391-7.

Luoma, Sari (2004), Assessing Speaking, Cambridge: Cambridge

University Press.

Martínez-Flor, Alicia, Usó-Juan, Esther and Soler, Eva

Alcón (2006), “Towards Acquiring Communicative

Competence through Speaking”, in Esther Usó-Juan and

Alicia Martínez-Flor (eds.), Current Trends in the Development

and Teaching of the Four Language Skills, Berlin: Mouton de

Gruyter, pp. 139-157.

Messick, Samuel J. (1989), “Validity”, in Robert L. Linn

(ed.), Educational Measurement, New York: MacMillan,

pp.13-103.

Meyer, Florian Kağan (2014), Language Proficiency Testing for

Chinese as a Foreign Language: An Argument Based Approach for

Validating the Hanyu Shuiping Kaoshi (HSK), Frankfurt am Mein:

Peter Lang.

23 / 25

Munro, Murray J., Derwing, Tracey M. and Morton, Susan L.

(2006), “The mutual intelligibility of L2 speech”,

Studies in Second Language Acquisition, 28: 111–131.

Neal, Robert J. (forthcoming), PhD thesis, University of

Cambridge.

North, Brain and Schneider, Günther (1998), “Scaling

descriptors for language proficiency scales”, Language

Testing, 15(2): 217-262.

North, Brian (2000), “Linking language assessments: an

example in a low stakes context”, System, 28: 555-577.

Office of the National Working Team for the International

Promotion of Chinese (OIPC) (2007), Guoji Hanyu Nengli

Biaozhun [International Chinese Proficiency Standards], Beijing:

Foreign Language Teaching and Research Press.

Tarone, Elaine (2005), “Speaking in a Second Language”,

in Eli Hinkel (ed.), Handbook of Research in Second Language

Teaching and Learning, Mahwah, NJ and London: Lawrence

Erlbaum Associates, pp. 485-502.

Van Moere, Alistair (2012), “A psycholinguistic approach

to oral language assessment”, Language Testing, 29(3):

325-344.

Wu, Shu-Ling and Ortega, Lourdes (2013), “Measuring

Global Oral Proficiency in SLA Research: A New Elicited

Imitation Test of L2 Chinese”, Foreign Language Annals,

46(4): 680-704.

Xie, Xiaoqing (2011), “Wei shenme yao kaifa xin HSK

kaoshi? [Why Hanban developed the new Chinese

24 / 25

proficiency test?]”, Zhongguo Kaoshi [China Examinations], 3:

10-13.

Xu, Haibing and Cheng, Yan (2011), “ ‘Guoji Hanyu Nengli

Biaozhun’ zhong Zhide Shangque de Difang – Jian yu

‘Ouzhou Yuyan Gongtong Cankao Kuangjia: Xuexi, Jiaoxue,

Pinggu’ Bijiao [Debateable Issues on International Standard

of Chinese Ability: In Comparison with Common European

Framework of Reference for Languages: Learning, Teaching,

Assessment], Guangdong Haiyang Daxue Xuebao [Journal of

Guangdong Ocean University], 31(5): 91-94.

Zhang Jinjun, Li Peize, Li Yanan, Jie Nini and Huang Lei

(2012), “Dui Xin Hanyu Shuiping Kaoshi de Xin Sikao

[New Thinking on the New HSK]”, Kaoshi Luntan [Examination

Exploration], 2: 50-53.

25 / 25

Improving the validity of Chinese speaking tests

Documents

Transcript of Improving the validity of Chinese speaking tests