Assessing EFL Speaking Skills in Vietnamese Tertiary ...

492
Assessing EFL Speaking Skills in Vietnamese Tertiary Education by Thanh Nam Lam B.A. (EFL Pedagogy), M.A. (TESOL) A dissertation submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy (Education) Supervisors: Professor James Albright Professor John Fischetti Professor Greg Preston School of Education Faculty of Education and Arts October 2018

Transcript of Assessing EFL Speaking Skills in Vietnamese Tertiary ...

Assessing EFL Speaking Skills in Vietnamese Tertiary Education

by

Thanh Nam Lam

B.A. (EFL Pedagogy), M.A. (TESOL)

A dissertation submitted in fulfilment of the requirements for

the Degree of Doctor of Philosophy (Education)

Supervisors:

Professor James Albright

Professor John Fischetti

Professor Greg Preston

School of Education

Faculty of Education and Arts

October 2018

ii

STATEMENT OF ORIGINALITY

I hereby certify that the work embodied in the thesis is my own work, conducted

under normal supervision. The thesis contains no material which has been

accepted, or is being examined, for the award of any other degree or diploma in

any university or other tertiary institution and, to the best of my knowledge and

belief, contains no material previously published or written by another person,

except where due reference has been made. I give consent to the final version of

my thesis being made available worldwide when deposited in the University’s

Digital Repository, subject to the provisions of the Copyright Act 1968 and any

approved embargo.

Thanh Nam Lam

Signature: ………………

Date: …… October, 2018

iii

ACKNOWLEDGEMENTS

After an intensive period of more than four years, today is the day: writing this note of

thanks is the finishing touch on my thesis. It has been a period of rewarding learning for

me, not only in the scientific arena, but also on a personal level. I would like to reflect on

the people who have supported and helped me so much throughout my academic journey.

First and foremost, I would like to express my sincere gratitude to my wonderful

supervisors Professor James Albright, Professor John Fischetti, and Professor Greg Preston

for their continuous support of my PhD study, for their patience, responsibility, enthusiasm,

and immense knowledge. I am fortunate to have had them as my supervisors. Without their

dedicated guidance, this research project would never have been possible.

I acknowledge my debt to the Government of Vietnam through the Ministry of

Education and Training and the University of Newcastle, Australia, for awarding me a

VIED-TUIT scholarship to support my doctoral studies.

I am very thankful to the University of Newcastle for providing such a rich research

resource and excellent student services and, in particular, to the friendly and supportive

academic staff of the School of Education. I must say a special thank you to Vietnam

Aviation Academy, where I am working, for facilitating my study with the most favourable

conditions they can.

My profound gratitude goes to Professor Max Smith, Professor Allyson Holbrook,

and Associate Professor James Ladwig, who gave me interesting lectures and useful

research skills in my coursework at the University of Newcastle.

I wish also to express my great appreciation to Associate Professor Kylie Shaw,

Associate Professor Mitchell O’Toole, Dr. Maura Sellars, Dr. Rachel Burke, Ms. Helen

Thursby, Ms. Helen Hopcroft, and Mr. Nicholas Collier for their suggestions, interest,

language assistance and constructive feedback on my very first manuscript. My additional

thanks goes to Ms. Ruth McHugh for helping me with the tedious and long running job of

reading proof.

iv

I would like to express my heartfelt thanks to Dr Ho Thanh My Phuong. As my

former supervisor, she has taught me and motivated me more than I could ever give her

credit for here.

I would also like to take this opportunity to thank my former instructor, Dr Vu Thi

Phuong Anh. She inspired me with research ideas about language testing and assessment.

She has shown me, by her example, what a good scientist (and person) should be.

My deepest respect and appreciation go to Heads of the EFL Departments at the

tertiary institutions in Vietnam for their generosity giving consent to my data collection in

their EFL classes. I would like to thank the EFL teacher and student participants in

Vietnam. Their information and ideas constituted the core of my study. Many of the

participants set aside considerable amounts of their time to provide me with a profound

understanding about their experiences and perceptions of oral assessment. The EFL

experts’ insights enriched my research results and contributed enormously to the eventual

conclusions from the investigation. I am greatly appreciative of their enthusiastic

cooperation in my study.

I wish to thank my family for their unconditional love and encouragement during

my time of studying away from home. They have been and still are ever ready to assist me

in my various endeavours.

Finally, there are my lovely schoolmates here in Callaghan campus. We were not

only able to support each other by deliberating over our study problems and findings, but

also happily by talking about things other than just our papers. I cannot forget my friends

from Vietnam and the US. Their messages and emails gave me motivational strength to

complete my study.

Newcastle, 30 October 2018

Nam Lam

v

TABLE OF CONTENTS

STATEMENT OF ORIGINALITY .................................................................................... ii

ACKNOWLEDGEMENTS ............................................................................................... iii

TABLE OF CONTENTS .................................................................................................... v

LIST OF ABBREVIATIONS ............................................................................................. x

LIST OF TABLES ............................................................................................................. xi

LIST OF FIGURES .......................................................................................................... xiv

LIST OF APPENDICES ................................................................................................... xv

GLOSSARY .................................................................................................................... xvii

ABSTRACT ................................................................................................................... xviii

Chapter One: INTRODUCTION ................................................................................... 21

1.1 Background of the study ............................................................................................. 21

1.1.1 A brief history of testing L2 speaking ................................................................. 22

1.1.2 Trends in language testing ................................................................................... 26

1.1.3 Context of the research ........................................................................................ 29

1.2 Research questions ...................................................................................................... 44

1.3 Significance of the research ........................................................................................ 47

1.4 Organisation of the thesis ............................................................................................ 49

Chapter Two: LITERATURE REVIEW ....................................................................... 51

2.1 Introduction ................................................................................................................. 51

2.2 Key issues in designing spoken language tests ........................................................... 51

2.1.1 Construct validity in oral language testing .......................................................... 56

2.1.2 Content aspect of construct validity .................................................................... 59

2.1.3 Reliability ............................................................................................................ 60

2.3 Conceptual framework for validating speaking tests .................................................. 63

2.4 Formats of speaking tests ............................................................................................ 65

2.5 Technological applications in oral assessment ........................................................... 68

vi

2.6 Factors affecting test validity and reliability ............................................................... 69

2.6.1 Assessment criteria .............................................................................................. 69

2.6.2 Rating scales ........................................................................................................ 70

2.6.3 Test tasks ............................................................................................................. 71

2.7 Washback of oral language assessment ...................................................................... 74

2.8 Summary ..................................................................................................................... 76

Chapter Three: METHODOLOGY ............................................................................... 78

3.1 Introduction ................................................................................................................. 78

3.2 Rationale for the research design ................................................................................ 78

3.2.1 Adoption of a mixed methods approach .............................................................. 78

3.2.2 Using a convergent design .................................................................................. 80

3.3 Research setting .......................................................................................................... 81

3.3.1 Research sites ...................................................................................................... 82

3.3.2 Research participants ........................................................................................... 83

3.4 Data sources ................................................................................................................ 87

3.4.1 Test room observation ......................................................................................... 89

3.4.2. Questionnaire surveys ........................................................................................ 90

3.4.3 Interviews ............................................................................................................ 95

3.4.4 Expert judgements ............................................................................................... 98

3..4.5 Documents .......................................................................................................... 99

3.5 Data collection procedures ........................................................................................ 100

3.6 Methods of data analysis ........................................................................................... 105

3.6.1 Quantitative data processing ............................................................................. 105

3.6.2 Quatitative data processing ............................................................................... 106

3.7 Presenting data analysis ............................................................................................ 112

3.8 Research ethics and compliance ................................................................................ 113

3.9 Assuring research quality .......................................................................................... 115

3.10 Summary ................................................................................................................. 117

Chapter Four: RESULTS: TEST TAKER CHARACTERISTICS AND TEST

ADMINISTRATION .................................................................................................... 119

4.1 Introduction ............................................................................................................... 119

vii

4.2 Test taker characteristics ........................................................................................... 120

4.3 Speaking test administration across institutions ....................................................... 135

4.4 Candidates’ and raters’ perceptions of the oral test administration .......................... 145

4.4.1 Candidates’ perceptions of the oral test administration .................................... 145

4.4.2 Raters’ perceptions of the oral test administration ............................................ 150

4.5 Summary ................................................................................................................... 155

Chapter Five: RESULTS: CONTENT RELEVANCE OF SPEAKING TEST

QUESTIONS ................................................................................................................. 157

5.1 Introduction ............................................................................................................... 157

5.2 Defining test constructs ............................................................................................. 158

5.3 Designing the content judgement protocol ............................................................... 161

5.4 Selecting approaches to data analysis ....................................................................... 164

5.5 Relevance of test contents ......................................................................................... 165

5.5.1 EFL experts’ judgements on test content relevance .......................................... 166

5.5.2 Linking expert opinions with other data sources ............................................... 180

5.6 Summary ................................................................................................................... 190

Chapter Six: RESULTS: SPEAKING TEST TASKS ................................................ 192

6.1 Introduction ............................................................................................................... 192

6.2 A comparative analysis of speaking test tasks across institutions ............................ 193

6.2.1 Response formats .............................................................................................. 193

6.2.2 Task purposes .................................................................................................... 206

6.2.3 Time constraints ................................................................................................ 225

6.2.4 Channels of communication .............................................................................. 230

6.3 Raters’ and candidates’ perceptions of the test tasks ................................................ 235

6.3.1 Teachers’ perceptions of the test tasks .............................................................. 235

6.3.2 Candidates’ perceptions of the test tasks ........................................................... 238

6.4 Summary ................................................................................................................... 240

Chapter Seven: RESULTS: RATER CHARACTERISTICS AND RATING ORAL

SKILLS .......................................................................................................................... 242

7.1 Introduction ............................................................................................................... 242

7.2 Test rater characteristics ............................................................................................ 243

viii

7.3 Rating and scoring ..................................................................................................... 246

7.3.1 Oral assessment criteria ..................................................................................... 247

7.3.2 Rating scales ...................................................................................................... 248

7.3.3 Oral rating process ............................................................................................. 251

7.4 Raters’ consistency in oral rating .............................................................................. 256

7.4.1 Scoring methods ................................................................................................ 256

7.4.2 Aspects of rating in oral assessment .................................................................. 258

7.4.3 Giving bonus points ........................................................................................... 261

7.4.4 Familiarity between the rater and candidates .................................................... 263

7.5 Test raters’ and candidates’ perceptions about the practice of rating and scoring .... 266

7.5.1 Test raters’ perceptions of the rating process .................................................... 266

7.5.2 Candidates’ perceptions of the rating and scoring ............................................ 269

7.6 Test score analysis ..................................................................................................... 272

7.6.1 Distribution of test scores .................................................................................. 273

7.6.2 Inter-rater reliability in direct scoring between pairs of raters .......................... 275

7.6.3 Inter-rater reliability in semi-direct scoring across raters ................................. 278

7.7 Summary ................................................................................................................... 283

Chapter Eight: RESULTS: IMPACT OF ORAL TESTING ON EFL TEACHING

AND LEARNING .......................................................................................................... 285

8.1 Introduction ............................................................................................................... 285

8.2 Impact of the oral test from candidates’ perspectives ............................................... 286

8.2.1 Impact of test scores .......................................................................................... 287

8.2.2 Learning activities candidates found useful for the oral test ............................. 290

8.2.3 Candidates’ perceptions of the test impact on EFL learning ............................. 292

8.3 Impact of the oral test on teaching from teacher raters’ perspectives ....................... 298

8.3.1 Major desired changes in teaching speaking skills ........................................... 298

8.3.2 Teaching activities to prepare learners for the oral test ..................................... 300

8.3.3 Teachers’ perceptions of the test impacts on teaching EFL speaking skills ..... 304

8.3.4 Implementing a new method of assessing speaking skills ................................ 314

8.4 Summary ................................................................................................................... 316

ix

Chapter Nine: SUMMARY AND CONCLUSIONS ................................................... 317

9.1 Introduction ............................................................................................................... 317

9.2 Summary of research results ..................................................................................... 318

9.2.1 Issues in test administration affecting test fairness and candidates’ speaking

performance ................................................................................................................ 319

9.2.2 Test content relevance and inequality in test questions’ degree of difficulty ... 323

9.2.3 Diversity in test tasks elicited different speech patterns for assessment ........... 324

9.2.4 Inconsistency in rating and scoring spoken language performance .................. 326

9.2.5 Impact of oral testing on EFL teaching and learning ........................................ 327

9.3 Limitations of the study and recommendations for future research .......................... 335

9.4 Implications ............................................................................................................... 340

9.4.1 Implications for speaking test administrators .................................................... 340

9.4.2 Implications for speaking test designers ........................................................... 342

9.4.3. Implications for oral test raters and scorers ...................................................... 343

9.4.4 Implications for oral test takers ......................................................................... 344

9.4.5 Implications for educational policy makers ...................................................... 345

9.4.6 Implications for language tesing researchers .................................................... 347

9.5 Conclusions ............................................................................................................... 348

REFERENCES ................................................................................................................ 352

APPENDICES ................................................................................................................. 385

x

LIST OF ABBREVIATIONS

BA

BEC

CA

CALT

CBT

CEFR

CLT

EFL

ELT

ESL

ESP

ETS

IELTS

L1

L2

MOET

OPI

SBA

SE

SPSS

TESOL

TOEFL

TOEIC

UCLES

Bachelor of Arts

Business English Certificate

Conversation analysis

Computer-assisted language testing

Computer-based testing

Common European framework of reference for languages

Communicative language teaching

English as a foreign language

English language teaching

English as a second language

English for specific purposes

Educational Testing Service

International English Language Testing System

First language

Second/Foreign Language

Ministry of Education and Training (Vietnam)

Oral Proficiency Interview

School-based assessment

Spoken English

Statistical Package for Social Sciences

Teaching English to Speakers of Other Languages

Test of English as a Foreign Language

Test of English for International Communication

University of Cambridge Local Examinations Syndicate

xi

LIST OF TABLES

Table 2.1 Purposes of language assessment ...................................................................... 53

Table 2.2 Contrasting categories of L2 assessment .......................................................... 54

Table 2.3 Kinds of language tests ..................................................................................... 55

Table 2.4 Categories of speaking assessment tasks .......................................................... 73

Table 3.1 Sampling of participants and research instruments used in the study ............... 87

Table 3.2 Summary of key points covered in the test room observation scheme ............. 90

Table 3.3 Summary of contents covered in the questionnaire for EFL teachers .............. 92

Table 3.4 Summary of contents covered in the questionnaire for EFL students .............. 95

Table 3.5 Summary of the EFL teacher interview protocol .............................................. 96

Table 3.6 Summary of the EFL student interview protocol .............................................. 97

Table 3.7 Information about EFL experts ......................................................................... 99

Table 3.8 Data collectection methods in a three-stage process ....................................... 101

Table 3.9 Data collection before the test ......................................................................... 102

Table 3.10 Data collection during the test ...................................................................... 103

Table 3.11 Data collection after the test .......................................................................... 104

Table 4.1 Test takers’ gender profile .............................................................................. 123

Table 4.2 Speaking test anxiety across institutions ......................................................... 127

Table 4.3 Candidates’ oral test anxiety in relation to general self-evaluation of English129

Table 4.4 Test takers’ English education prior to the oral test ........................................ 130

Table 4.5 Test takers’ profile of English-speaking class attendance .............................. 133

Table 4.6 Candidates’ oral test anxiety in relation to class attendance ........................... 134

Table 4.7 Comparing test administration methods across institutions ............................ 136

Table 4.8 Advantages and disadvantages of direct assessment ...................................... 139

Table 4.9 Raters’ opinions for and against audio-recording oral tests ............................ 141

Table 4.10 Examiners’ performance during an oral test session .................................... 143

Table 4.11 Means and standard deviations (SD) for candidates’ perceptions of the test

administration ...................................................................................................... 146

xii

Table 4.12 Candidates’ comments for and against of computer-assisted oral testing .... 150

Table 4.13 Means and standard deviations (SD) for raters’ perceptions of the test

administration ...................................................................................................... 152

Table 4.14 Teacher raters’ comments for and against computer-assisted oral testing .... 155

Table 5.1 Comparison of the English-speaking courses across the institutions ............. 159

Table 5.2 Summary of test questions used for oral assessment across the institutions .. 163

Table 5.3 Information collected from EFL experts’ content judgement protocol .......... 166

Table 5.4 Content validity index of test items (I-CVI) across institutions ..................... 167

Table 5.5 Examples of test items with I-CVIs of 1.0 and .83 ......................................... 168

Table 5.6 The average of the I-CVIs for all items on the scale ...................................... 169

Table 5.7 Examples of items rated ‘Relevant/Highly relevant’ and positive comments by

all the experts ...................................................................................................... 171

Table 5.8 Examples of problematic items and experts’ suggestions for revision ........... 172

Table 5.9 Examples of confusing test questions with experts’ comments and

suggestions .......................................................................................................... 173

Table 5.10 Question lengths in numbers of total words (TW) and content words (CW)

across different task types adopted at the institutions ......................................... 176

Table 5.11 Examples of picture-cued questions with experts’ comments ...................... 179

Table 6.1 Response formats used in oral assessment across institutions ....................... 194

Table 6.2 Description of underlying purposes of oral test tasks ..................................... 207

Table 6.3 Summary of characteristics of the test tasks across institutions ..................... 208

Table 6.4 Rubric words and phrases indicating task purposes ........................................ 209

Table 6.5 Elicitation question types used for interview tasks ......................................... 220

Table 6.6 Time constraints in minutes for speaking tasks across institutions ................ 226

Table 6.7 Channels of communication in EFL oral assessment across institutions ........ 231

Table 6.8 Means and standard deviations (SD) for EFL teachers’ perceptions of test

tasks ..................................................................................................................... 236

Table 6.9 Means and standard deviations (SD) for EFL students’ perceptions of test

tasks ..................................................................................................................... 238

Table 7.1 Age profile of EFL teacher participants in the survey .................................... 244

Table 7.2 Assessment weighting of the Listening and Speaking components ............... 246

xiii

Table 7.3 Assessment criteria across institutions ............................................................ 247

Table 7.4 Features to be assessed and scored in a rating scale ....................................... 249

Table 7.5 Example of a holistic rating scale for the interlocutor .................................... 250

Table 7.6 Purposes of the oral test for EFL majors ......................................................... 253

Table 7.7 Raters’ tendency of giving bonus points ......................................................... 262

Table 7.8 Oral rating affected by candidates’ performance in class ............................... 265

Table 7.9 Means and standard deviations (SD) for raters’ perceptions of the oral rating

and scoring .......................................................................................................... 268

Table 7.10 Means and standard deviations (SD) for candidates’ perceptions of the oral

rating and scoring ................................................................................................ 270

Table 7.11 Candidates’ opinions for and against the number of raters in each test room272

Table 7.12 Descriptive statistics of oral test scores across the institutions ..................... 273

Table 7.13 Comparing categories of test scores across institutions ................................ 274

Table 7.14 Correlation of test scores by University A’s raters ....................................... 279

Table 7.15 Descriptive statistics of University A’s raters’ scores in the second time of

scoring ................................................................................................................. 280

Table 7.16 Correlation of test scores by University B’s raters ....................................... 281

Table 7.17 Descriptive statistics of University B’s raters’ scores in the second time of

scoring ................................................................................................................. 281

Table 7.18 Correlation of test scores by University C’s raters ....................................... 282

Table 7.19 Descriptive statistics of University C’s raters’ scores in the second time of

scoring ................................................................................................................. 283

Table 8.1 Means and standard deviations (SD) for EFL students’ perceptions of the test’s

impact on learning ............................................................................................... 292

Table 8.2 Candidates’ desired strategies to improve English speaking skills ................. 295

Table 8.3 Means and standard deviations (SD) for EFL teachers’ perceptions of the test’s

impact on teaching and learning speaking skills ................................................. 304

Table 8.4 Teachers’ changes and adjustments as a consequence of the oral test ............ 311

xiv

LIST OF FIGURES

Figure 2.1 Model of spoken language ability in the Cambridge Assessment .................. 57

Figure 2.2 Construct validity of test score interpretation ................................................. 58

Figure 2.3 Socio-cognitive framework for validating speaking tests (Weir, 2005, p. 46) 64

Figure 2.4 Interrelationship between assessment, teaching and learning ......................... 74

Figure 3.1 Example of a paragraph in the First Coding Cycle ....................................... 110

Figure 3.2 Framework for validating speaking tests (adapted from Weir, 2005) ........... 112 Figure 4.1 Age groups of test takers taking the oral test ................................................ 122

Figure 4.2 Candidate self-evaluation of English proficiency in terms of general

performance, accuracy, and fluency ................................................................................ 124

Figure 4.3. Candidates’ experience with oral testing ..................................................... 132

Figure 4.4 Seat arrangements for oral assessment at different institutions .................... 138

Figure 5.1 Content validity index (CVI) ......................................................................... 165

Figure 7.1 EFL teachers using spoken English in Speaking classes .............................. 244

Figure 7.2 Challenges for raters in oral rating ................................................................ 254

Figure 7.3 Aspects in candidates' oral performance the rater paid attention to .............. 259

Figure 7.4 Inter-rater agreement in oral test scoring across the institutions ................... 277

Figure 8.1 Impact of test scores on candidates ............................................................... 287

Figure 8.2 Useful learning activities for candidates before the test ............................... 291

Figure 8.3 Major desired changes in teaching speaking skills ....................................... 299

Figure 8.4 Factors affecting teaching oral skills ............................................................. 301

Figure 8.5 Useful activities to prepare students for the oral test .................................... 303

Figure 9.1. Example of a wrap-up activity for a speaking skills lesson ......................... 331

xv

LIST OF APPENDICES

Appendix A: Ethics approval documents …………………………………………….. 385

A.1 HREC Approval (17/11/2015)

A.2 Information Statement for (a) Head of EFL Faculty

(b) EFL teachers

(c) EFL students (*)

A.3 Consent Form for (a) Head of EFL Faculty

(b) EFL teachers

(c) EFL students (*)

A.4 HREC Expedited Approval of a protocol variation (22/02/2017)

A.5 Information Statement for (a) EFL teacher raters

(b) EFL experts

A.6 Consent Form for (a) EFL teacher raters

(b) EFL experts

A.7 Verification of translated documents

Appendix B: Data collection instruments ……………………….…………………….. 422

B.1 Observation Protocol

B.2 Questionnaire for (a) EFL teachers

(b) EFL students (*)

B.3 Interview Protocol for (a) EFL teachers (*)

(b) EFL students (*)

B.4 Documents for EFL experts

(a) University A’s Test Content Judgement Protocol

(b) University B’s Test Content Judgement Protocol

(c) University C’s Test Content Judgement Protocol

(d) Demographic information about EFL experts

B.5 Focus group discussion protocol for EFL teacher raters (*)

Appendix C: Additional resources for data collection ……………………….………… 451

C.1 Procedure and speaking notes for initial contact with potential participants

xvi

C.2 Instructions for raters’ speech sample recording in test rooms

C.3 Checklist of documents to be collected

C.4 Interview transcription template

Appendix D: Examination documents ……………………….…………………………. 455

D.1 Assessment criteria, rating scales and scoring sheets at three institutions

D.2 Example of test material for oral examiners (interlocutor outline)

D.3 Speaking skills and discussion strategies introduced in course books

D.4 Scoring sheets for semi-direct assessment

Appendix E: Data management ……………………….……………………….……….. 463

E.1 Coding scheme of participants and data sources

E.2 Managing recorded speech samples

E.3 Multiple data sources used to examine key aspects in speaking test validation

E.4 Tabulated qualitative data

Appendix F: Examples of preliminary data processing ……………………………….. 471

F.1 Coding a rater interview transcript in two cycles

F.2 Rough transcript of a speech sample

F.3 Additional excerpts of of task-based oral performances

F.4 Occurrence of language functions in speech samples

Appendix G: Sample rating scales for assessing speaking skills ……………………… 479

G.1 Sample rating scales from the literature: global and analytic

G.2 Common Reference Levels proposed by the CEFR

G.3 The CEFR-based English Competence Framework adopted in Vietnam (*)

Appendix H: Flowchart of procedures for data collection and data analysis ….………. 486

Appendix I: Transcription notation symbols ……………………….……….…………. 487

Appendix J: List of tertiary institutions in HCMC (Vietnam) …………….…………… 488

Appendix K: Original quotes in Vietnamese ……………………….……….…………. 489

K.1 Quotes from the Vietnamese press and literature

K.2 Quotes from Vietnamese interviews with research participants

___________________________ (*) a Vietnamese translation included

xvii

GLOSSARY

The following terms will be used as key words in this research study:

• achievement test: a test which aims to measure what has been learnt within a

course of instruction or up to a given time

• construct validity: the degree to which a test can measure what it claims to be

measuring

• content validity: the extent to which the test approximately provides samples

from the domain of knowledge and skills relevant to performance according to

the preset criterion

• inter-rater reliability: the extent to which two or more examiners would

agree with each other on the scores awarded to a single learner

• rating scale (also scoring rubric): an ordered set of descriptions of typical

performances in terms of their quality, used by raters in the rating process

• reliability: consistency of measurement across individuals by a test

• reliability coefficient: a statistic on a scale between -1 and +1, expressing the

extent to which individuals have been measured consistently by a test

• testing: the collection of any data that can be used to assess the language

abilities of respondents

• assessment: the inclusion of more variables when attempting to determine

language proficiency and achievement

• washback: the effect of a test on the teaching and learning leading up to it

xviii

ABSTRACT

English language mastery plays a crucial role in global communication, trading, and

cultural exchange. Vietnam has set English as a strategic goal of the national education

system for boosting the process of regional and international integration. Vietnamese

education has been making a great effort to enhance the effectiveness of English as a

foreign language (EFL) instruction with a focus on communicative competence,

particularly listening and speaking skills. My study aimed to examine the operational

practices of oral assessment as an inseparable component in relation to L2 teaching and

learning in Vietnam.

Participants in my study project were EFL majors (candidates) and teachers

(examiners) involved in testing English speaking skills at three universities in South

Vietnam. My data collection for this empirical research started in late 2015 and continued

to early 2016. Data sources included oral test observation, questionnaire surveys,

interviews, testing documents, test scores, and EFL content experts’ judgements.

The results highlight a methodological diversity in oral assessment across the

institutions in terms of test administration, task design, language content, the process of

rating, and the impact of testing speaking. Test scores served more to complete a required

unit of study rather than to make reliable inferences about learners’ oral production ability

and provide useful feedback for future improvement. Interview tasks did not reflect

characteristics of natural conversations when the interlocutor (also examiner) played a

predominant role in the assessment context. Several oral test questions required candidates’

theoretical knowledge in order to be answered. These factors had the potential to increase

students’ test anxiety and hindered their best performance. Discussion tasks enabled a wide

variety of speech functions to be produced and provided opportunities to manage verbal

interaction between paired candidates. Interactive speaking revealed students’ weaknesses

in a co-constructed oral performance, and a tendency for individual turn-taking to talk

about the assigned topic.

xix

My study suggests implications for various stakeholders who could assist to improve

the quality of oral testing and assessment in educational contexts. Rater training and double

rating are necessary for oral assessment to eliminate unavoidable measurement errors by

human raters. My research results indicate a need for more clearly defined assessment

criteria and descriptors in the rating scales to obtain higher consistency in assessing spoken

English abilities. The recent promulgation of the CEFR-based evaluation framework for L2

proficiency in Vietnam has brought both opportunities and challenges for those who are

concerned about enhancing and standardising the national quality of EFL education.

xx

21

Chapter One

INTRODUCTION

1.1 Background of the study

Oral language testing and assessment for educational purposes play an inseparable part in

Communicative Language Teaching (CLT) as they can provide valuable evidence

regarding pedagogical effectiveness, learners’ progress and achievement in their ability to

communicate with others. Teachers and education administrators can make data-based

decisions to implement curriculum adjustments and/or teaching methods to increase

learners’ success in language study. Language assessment is popularly used to gather

information for making decisions that have definite influences on stakeholders –

individuals, programmes, organisations, and societies (Bachman & Palmer, 2010).

Contemporary language assessment is based on theories of learning and cognition,

attaching importance to authentic skills and abilities necessary for learners’ future success

(Cheng, 2005). Particularly in this era of global integration, when English has become an

important means for global communication, cooperation and development, many

theoretical and practical issues have been emerging in research about measuring English as

a foreign or second language (EFL/ESL) competence not only in high-stakes examinations

but also in regular school-based assessment (SBA). I provide a brief historical overview of

foreign or second language (L2) assessment for better understanding of the advancements

in the use of various measurement methods and current trends in language testing

concomitant with innovations in applied linguistics and EFL/ESL teaching methodology.

My review focuses on L2 assessment in “the mother tongue English countries” (Jenkins,

2006, p. 160), such as the United Kingdom, the United States, and Australia, where English

is the first and official language, and from which language testing was originated and was

developed.

22

1.1.1 A brief history of testing L2 speaking

Language assessment, a branch of Applied Linguistics, was put into practice more than two

centuries ago. Testing L2 speaking skills is the youngest subfield of language assessment

(Alderson, 1991). EFL/ESL public examinations date back to the second half of the 19th

century when the University of Cambridge sent examination papers overseas for the first

time in 1863 (Giri, 2010). It was not until the first decades of the 20th century that

remarkable innovations in language tests were seen in Britain and subsequently in the

United States (O’Sullivan, 2012). The University of Cambridge Local Examinations

Syndicate (UCLES) first introduced the Cambridge Proficiency Examination (CPE) in

1913. The examination was administered to test the English language performance of

people wishing to pursue an education in Great Britain. However, the testing of spoken

English for foreigners did not become an area of interest until World War II, 1939 (Fulcher,

1997b).

Testing speaking before 1939

The beginning of language testing research in the United States witnessed an excessive

concern regarding how to achieve reliable scores from L2 tests. Measuring oral language

skills could not produce consistent scores because the rating process depended on a scorer

who could be influenced by many uncontrolled factors (Heyden, 1920). Oral tests for large

numbers of candidates were considered to be impractical.

In 1913, a committee of the Modern Language Association recommended the

inclusion of an aural component in U.S. university entrance examinations (Spolsky, 1995).

Many institutions adopted or adapted the recommended format consisting of a 10-minute

dictation section, and a written response section requiring candidates to write down answers

to aural questions delivered by the examiner (Decker, 1925; Lundeberg, 1929). Candidates

were not actually required to speak in the test, but write down the phonetic script indicating

how written words were pronounced. Oral tests took the form of testing pronunciation

knowledge rather than communicative performance. Concerns about reliability was such a

major issue that “few tests in use at that time sought to assess the spoken language, even

though elements within the profession had been aware of this deficiency for a long time”

(Barnwell, 1996, p. 18).

23

In 1930, the College Entrance Examination Board introduced the first true speaking

test component. Before that, language testing practioners solely concentrated on multiple-

choice tests as objective measures of L2 proficiency rather than the complexity of testing

speaking. This test was designed to meet the requirement of the U.S. government that

institutions had to have a clear indication of students’ English language ability prior to

admission (Spolsky, 1995, p. 55). Its Speaking component constituted one part of a five-

section test: Reading I, Reading II, Dictation, Speaking, and Essay writing. For the oral

section, the candidate was engaged in a topic-based conversation with the examiner. Multi-

aspect assessment criteria – fluency, responsiveness, rapidity, articulation, enunciation,

command of construction, use of connectives, vocabulary and idioms – were rated on a 3-

point scale of: proficient, satisfactory, and unsatisfactory. Test takers’ shyness was recorded

as an extra consideration in test score interpretation (Fulcher, 2003, p. 3).

The story of L2 testing in the United Kingdom was slightly different from that in

the United States. Testing of English for foreigners by the University of Cambridge started

as early as the mid-19th century. By 1898, the University of Cambridge was in charge of 36

colonial centres for overseas examinations with more than 1,200 candidates (Giri, 2010). In

1913, the Certificate of Proficiency in English examination introduced by the University of

Cambridge Local Examinations Syndicate (UCLES) included a sub-test of spoken English

(Roach, 1945) in which there was a 30-minute section for Dictation and another 30-minute

section for Reading and Conversation, which was graded only for pronunciation. The main

purpose of the British tests was to support the training programme and encourage effective

language education in foreign schools (Brereton, 1944). The Cambridge examinations were

“innovative and almost non-academic” in their objectives, focusing on language use rather

than language knowledge (Giri, 2010). Unlike in the United States, assessing speaking

skills in the United Kingdom was not much concerned with consistency or measurement

theory (Fulcher, 2003). The College Board's examination put more emphasis on accuracy in

language evaluation with true/false questions and criterion-referenced grading, whereas the

Cambridge examination concentrated more on the local syllabus with English literature and

its absolute reliance on subjective grading (Fulcher, 2003; Taylor, 2011).

Testing speaking in the war years

24

The Army Specialized Training Program (ASTP) was launched in 1942 to address the

problem that many American service personnel did not have sufficient foreign language

skills required for carrying out their duties in World War II. This was the first language

instruction programme to equip trainees with colloquial spoken language and knowledge of

the context in which the language was to be used (Angiolillo, 1947, p. 32). The assessment

of a trainee’s success upon completing the programme shifted from grammar knowledge to

ability to use the language. The test included three tasks: securing services, asking for

information, and giving information. Well-trained interlocutors and raters were required to

administer the test in friendly and informal settings.

In 1956, the Foreign Language Institute (FSI) undertook the responsibility to assess

language proficiency for all human resources engaged in external service. One considerable

challenge in testing was the officers’ age variation, which was reported to influence raters’

evaluation. Two years later, the EFI testing board supplied raters with the 6-point scale and

a 5-factor checklist for assessment including accent, fluency, grammar, vocabulary, and

comprehension. This new testing procedure was later adopted by the Central Intelligence

Agency (CIA), the Defense Language Institute, and the Peace Corps during the 1960s

(Fulcher, 2003).

Testing speaking in the post-war years

The Test of English as a Foreign Language (TOEFL), which was developed by

the Educational Testing Service (ETS) in the United States in the early 1960s and remained

unchanged for 40 years, marked “a major innovation in the standardization of English

language tests” (O’Sullivan, 2012, p. 13). The TOEFL Test of Spoken English (TSE) is

optional for candidates and “most commonly intended only for those applying for graduate

teaching assistant positions” (Dalby, Rubenstone, & Weir, 1996, p. 38). The official

speaking component was not included in the TOEFL until late 2005 when the ETS

introduced a new format - the TOEFL Internet-based Test (iBT). The TOEFL scores

provided evidence of English language ability for non-native speakers to be considered for

admission not only to colleges and universities in the United States but also to tertiary

institutions many other English-speaking countries such as the United Kingdom and

Australia.

25

By the early 1980s, test developers from the English Language Testing Service

(ELTS) in the United Kingdom began to explore the potential of designing tests of

language for specific purposes. The ELTS soon became the International English Language

Testing System (IELTS) in 1989. The IELTS is a standardised test of English language

proficiency designed for non-native English language speakers worldwide. At present, the

IELTS is jointly managed by the British Council, International Development Programme

(IDB): IELTS Australia, and Cambridge English Language Assessment. The IELTS test

covers all four language skills in four separate tests: Listening, Speaking, Reading, and

Writing. The Speaking test is a face-to-face interview between the test taker and an

examiner. A speaking test session includes three parts: introduction and interview (4–5

minutes), long turn (3–4 minutes), and discussions (4–5 minutes). Speaking tests are

included in both IELTS Academic and IELTS General Training formats. These two types

of the IELTS test serve different purposes: the first “is for people applying for higher

education or professional registration in an English speaking environment”, and the latter

“is for those who are going to English speaking countries for secondary education, work

experience or training programs” (International English Language Testing System, 2018a).

Both are designed to cover the full range of ability from non-user to expert user.

Language testing in Australia had its first steps in the early years of the last century.

The 1980s and 1990s were “characterized by a striking growth in the application of

language testing – frequently in contexts governed by macropolitical pressures”

(Hawthorne, 1997, p. 248), and marked with the establishment of several institutes

specialising in ESL/EFL teaching and research. For example, Australian Council for

Educational Research (ACER), established in 1930, has grown into one of the world’s

leading bodies in educational research. In 2012, ACER launched the PAT series

(Progressive Achievement Tests) as an online assessment system to provide teachers with

objective and norm-referenced information about students’ skills and understanding in their

learning and development, including language skills (ACER., n.d.). Founded in 1988, the

National Centre for English Language Teaching and Research (NCELTR) has become an

integrated centre for research, publishing, teacher education, TESOL/Applied Linguistics

resources, and English language programmes, in coordination with Cambridge Exams

(NCELTR, 2004). The late 1990s witnessed the Australian Second Language Proficiency

26

Ratings (ASLPR) become the standard tool adopted widely in Australia for “provid[ing]

performance descriptions couched in terms of the practical tasks that learners can carry out

and how they carry them out at nine points along the continuum from zero to native-like

proficiency” (Ingram, 1996, p. 3). This form of descriptive (or criterion-referenced) scale

presents three kinds of information about each proficiency level: general description of the

language behaviour suitable for the level, (2) examples of what specific language tasks are

and how they are performed, and (3) comments briefly explaining concepts and key

features. The ASLPR can, therefore, be used for candidates with a diverse range of

employment and educational backgrounds (Giri, 2010; Ingram, 1996).

Australia’s language tests also served as devices of social policy “designed to

control the flow of immigrants and refugees and to determine access to education and

employment in settings of intergroup competition” (McNamara, 2005, p. 351). For

example, the implementation of the White Australia Policy, which ended in 1973 (Antecol,

Cobb-Clark, & Trejo, 2004), indicated that “covert policy hinged less on issues of language

and much more on race” (Smith-Khan, 2015). Once used for a political purpose, the

language testing was aimed to facilitate immigrants speaking a European language, but to

restrict “non-European” people’s migration (Smith-Khan, 2015). Language testers in

Europe and America have been involved in considerable debates surrounding the role of

language testing in immigration procedures and citizenship policy (McNamara & Eades,

2004).

1.1.2 Trends in language testing

The above historical overview demonstrates that oral testing has achieved significant

advances from its beginning up to the present. Modelling in L2 testing is not a new

concept. Through the development of language testing, we can see that L2 theorists and

methodologists have developed and adopted language testing models in accordance with

the existing models of language teaching that changed over time to meet socio-political

needs in specific periods of history. Trends in language testing feature those in language

teaching. The following section provides a critical perspective on three different trends in

language testing based on various ways of defining language ability: the pre-scientific

trend, the psychometric trend, and the psycholinguistic-sociolinguistic trend.

27

The pre-scientific trend

Language testing theorists vary in their views of the nature of language and language

abilities, which in consequence, has led to variations of language testing trends in different

but overlapping periods at times. Before the 1960s, language was viewed as a means to

know about the target culture and language learning as a way of intellectual training

(McGarrell, 1981). The purpose of language teaching and learning was to achieve

knowledge about the target language (vocabulary, grammatical rules) as objectives of

study. This “separatist” view (Davies, 1968, p. 216) did not make an important

consideration of the relationship between society and language, or the social context in

which language was used. Language testing was viewed as testing the memorisation of

words and grammatical accuracy through writing and translation exercises.

In this “intuitive” stage (Madsen, 1983) or also the pre-scientific era (Spolsky,

1978) of language testing history, decisions on teaching and testing chiefly depended on

language teachers’ and testers’ personal discretion. Subjective judgements were, therefore,

inevitable. The pre-scientific trend resulted from the traditional grammar-translation

approach in language teaching that required learners to demonstrate understanding of

vocabulary meaning learnt by memorisation and to apply rules from grammar lessons

taught in a deductive manner.

The advantage of this language testing trend is that it has enabled a global

evaluation of learners' L2 ability through their composition and writing activities that

require them to synthesise their knowledge of linguistic rules and components to produce

syntactically and semantically acceptable language. Test tasks such as sentence structure

analysis can help learners gain control over formal accuracy when labelling its parts or

combining parts into larger units (Ingram, 1985).

The psychometric-structuralist trend

During 1960s, the birth of behaviourist language learning theories emerging out of the work

of Fries (1952) and Skinner's work (1957) led to a psychometric-structuralist trend in

language testing, also known as the era of scientific language testing (Madsen, 1983). This

trend provided another view of language as a combination of discrete-point patterns that

could be mastered by the habitual practice of stimulus responses (Ingram, 1985). Language

28

learning aimed at “acquiring conscious control of the phonological, grammatical, and

lexical patterns of a second language, largely through study and analysis of these patterns

as a body of knowledge” (Carroll, 1983). The audio-lingual approach, the oral structural

method, the mimicry-memorisation method, etc. were typical of L2 pedagogy during this

period. Language competence was viewed as the ability to handle discrete elements of the

language system and develop individual language skills.

The discrete-point approach to testing aimed to measure language proficiency by

examining learners' grammar and vocabulary knowledge of the discrete items (e.g. syntax,

correct choice of preposition), and discrete aspects of language skills by treating each of

them separately (e.g. listening for gist, reading for specific information). Discrete-point

language testing became the most widely used in the1960s and 1970s. Tests of this type are

still popularly practised in several parts of the world today (Giri, 2010).

This approach to testing separate aspects of language received a great deal of

criticism in the L2 teaching community. The counter-argument was that language is not

only the sum of its parts, but the discrete parts should be mobilised and mutually integrated

to perform particular tasks in particular circumstances (Ingram, 1985). Language testers

should pay more attention to the development and measurement of communicative

competence rather than those of linguistic competence (Weir, 1988). This suggestion is in

line with beliefs that instead of establishing a learner’s L2 knowledge in terms of skills and

elements, testers should attempt to test his/her ability to perform in a specified

sociolinguistic context (Spolsky, 1978; Morrow, 1979).

The psycholinguistic-sociolinguistic trend

L2 learning involves mastering its skills (listening, speaking, reading and writing) and

elements (pronunciation, vocabulary, grammar, etc.). Nevertheless, a language is not learnt

as isolated components of a system but in its entirety, i.e. language skills and elements are

learnt together in their mutual relation. This integrative (or holistic) perspective of language

and language learning implies that language learning is a unitary process that goes beyond

the mastery of individual skills and elements and includes an appropriate organisation of

these components in diverse social situations (Giri, 2010).

29

The psycholinguistic-sociolinguistic trend, also referred to as the integrative-

sociolinguistic trend (McGarrell, 1981), came into being as an opposing trend to the

discrete-point approach to language testing. It acknowledges that language ability is

revealed in actual performance rather than the accumulation of discrete language elements.

Language learning is more than gaining control over a set of structures or usage, so the

purpose of language testing should be oriented to accessing communicative competence

demonstrated by a candidate's performance in a given social context (Howard, 1980; Weir,

1988).

1.1.3 Context of the research

Vietnam is a Southeast Asian country with a history stretching for over 4,000 years.

Vietnamese people have possessed a traditional fondness for learning (Zhao et al., 2011; Le

V. M., 2016). The feudatory monarchs’ education in Vietnam through many centuries was

dominantly influenced by Chinese Confucianism, from its script system to ideological

values (Lam & Albright, 2018; Welch, 2010; MOET, 2016a). The appearance of Western

elements in Vietnamese culture can be traced back to Catholic missions established by

Western priests in early decades of the 16th century (Fox, 2003; Alpha History, 2018; Viet

Vision Travel, 2007). Today’s Romanised Vietnamese script, known as Quốc Ngữ

(language of the country), was developed from important contribitions by the French Jesuit

missionary Alexandre de Rhodes during the 17th century (MacKinnon & Hao, 2014; Pham

K., 2017). The teaching of Quốc Ngữ started in the South of Vietnam at the beginning of

1879 (X. V. Hoang, 2006). The policy of Quốc Ngữ utilisation expanded to the early 20th-

centuty North’s education (Tran, 2009) and gradually became a Vietnamese people’

popular means of communication.

In almost 100 years under French colonial rule from the mid-19th century until

1954, French education was imported, and French gradually became the official medium of

instruction at all levels in Vietnam. This period winessed “a mixed education system with

French schools, Franco-Vietnamese schools and Confucian feudalist schools and classes

existing side by side” (M. H. Pham, 1991 , p. 6). French schools were established “for

Europeans and not necessarily for Vietnamese” (Kelly, 1978, p. 96). Franco-Vietnamese

30

schools combined both the French and Vietnamese languages in teaching subjects related to

Vietnam such as morals, literature, geography, history, etc. These bilingual schools

inspired intellectuals welcome the new Western-style learning, seeing it a good

approach to enlighten the people, develop the country, and consequently step

towards to national liberalization. (Tran, 2009, p. 19)

The employment of French in governmental affairs and official examinations

restricted the influence of Chinese culture and the use of Mandarin Chinese calligraphy (T.

G. Nguyen, 2006; H. T. Do, 2006). Colonisation resulted in “the fall of the old Mandarin

class and the rise of a new elite of French-speaking Vietnamese administrators” (Wright,

2002, p. 229). Meanwhile, the call for maintaining and developing the Vietnamese

language received great support from patriotic Vietnamese intellectuals (e.g. Association

for Quốc Ngữ Dissemination, Dong Kinh Nghia Thuc Movement) and local Vietnamese

press (e.g. Gia Dinh Newspaper, Dong Duong Magazine, Nam Phong Magazine) not only

for the sake of literacy enhancement but also for national destiny (H. H. Nguyen, 2017; V.

S. Nguyen, 2016: V. P. Le, 2014; H. A. Do, n.d.). On one hand, the Vietnamese

intelligentsia of this period valued French in enriching knowledge about the French

civilisation and new advancements from the West. On the other hand, they did not advocate

the adoption of French as the country’s official language, but asserted that Quốc Ngữ was

the only language suitable for Vietnamese people and culture and should be consolidated in

the national existence and development (T. P. H. Tran, 2009; V. P. Le, 2017). Success of

the August Revolution of 1945 in northern Vietnam contributed to making Quốc Ngữ the

national and official language to achieve “a goal of full literacy throughout the population”

(Wright, 2002, p. 233).

The popularity of Quốc Ngữ, which uses the Latin script, created favourable

conditions for the introduction of English into Vietnam during the French colonial regime.

The French rulers needed English speakers capable of working in diplomatic affairs and

commercial services in Vietnam. For this reason, English was only promoted as a required

subject in senior high schools, not at lower educational levels (L. T. Lam, 2011). The

educational policy towards English in particular, or any other foreign languages in general,

from a historical perspective, “has been a barometer of Vietnam’s relations with other

31

countries and how the foreign language curriculum has been directly affected by those

relations” (Wright, 2002, p. 226) in each specific period. The 20th century saw “political

and economic factors were identified as most important in determining what foreign

languages were to be promoted” (Do, 1996, p. i).

1.1.3.1 Historical overview of EFL education in Vietnam

The history of English language education in Vietnam can be divided into two periods: (1)

before 1986 and (2) from 1986 up to the present. The year 1986 was chosen as a dividing

point of time because this year marked the Vietnamese government’s initiation of its

overall economic reform (Đổi Mới), implementing the open-door policy to the world “as a

departure from obsolete dogmatism” which had lasted over the decade after Vietnam

became an independent country. English has emerged as a dominant foreign language in

Vietnam since then.

(1) Teaching English in Vietnam before 1986

English in Vietnam in the past experienced many ups and downs. English language

teaching (ELT) in Vietnam before 1986 can be subdivided into three periods: (i) from the

beginning of the French colonisation of Vietnam up to 1954, (ii) from 1954 to 1975, and

(iii) from 1975 to 1986.

Before 1954, English education was limited due to the promotion of French during

the French colonial period. However, English obtained an important role in the secondary

language curriculum and became a madatory subject for senior high schools. There are

almost no extant writings on ELT in Vietnam prior to 1954. What has been kept until today

are some English textbooks designed by French authors (e.g. L’anglais Vivant: Classe de

sixième, L’anglais Vivant Classes de troisième, 1942), and a few English-Vietnamese

dictionaries compiled by Vietnamese scholars. The contents of those teaching materials

include pronunciation drills and reading comprehension practice. The chief method of

teaching English in Vietnam during that time was the grammar-translation method (Hoang,

2010).

From 1954 to 1975, Vietnam was divided into two parts – North and South, as a

consequence of the Geneva Conference signed in Switzerland (1954) between the two

32

governments in the same country: Democratic Republic of Vietnam (the North) and

Republic of Vietnam (the South). During that period, Russian was studied in North

Vietnam as the main foreign language in the formal educational system to enable direct

interactions with the Soviet Union, whose major aid effort was “to help “industrialize”

Vietnam to show Southeast Asia the practical benefits of socialist orientation” (Radchenko,

2018). English was the dominant foreign language in South Vietnam only and was taught

for direct interactions with the U.S.A. ELT in South Vietnam was facilitated with a free

supply of English textbooks and teaching resources from the US Government (Tran, 2014).

The purpose of foreign language teaching was to equip students with sufficient knowledge

and communication skills in English to work with foreign organisations and promote co-

operation with advanced capitalist countries (T. H. Dang, 2004).

After the reunification of the country in 1975, foreign language teaching in Vietnam

was characterised by the dominance of Russian over English and French in the national

education system (Be & Crabbe, 1999). From 1975 to 1986, learners of Russian received

educational aids from the Russian Government. Every year, hundreds of Vietnamese

teachers and students were sent to the former Soviet Union for both undergraduate and

graduate studies. When Russian held a dominating position of foreign language in Vietnam,

English was taught in a limited number of classes in just high schools in big cities. Tertiary-

level education witnessed a decrease in the number of students enrolling for English both as

a discipline and as a selective subject (Hoang, 2010).

(2) Teaching English in Vietnam from 1986 up to the present

The period from 1986 up to the present is characterised by the rapid EFL development and

expansion in Vietnam. The English learning movement commenced in late 1986 as

Vietnam opened its diplomatic door to the whole world. In the context of economic

integration, English became the first option for the majority of language learners. A new 7-

year series of English textbooks, entitled from English 6 to English 12, was applied to

students from Grade 6 to Grade 12 accordingly. Although the textbooks claimed to train

EFL students for all of the four language skills, more opportunity was provided to develop

reading skills while very few tasks were devoted to listening and speaking skills. There was

even a significant decrease in the proportion of listening and speaking activities in the

33

textbooks for the last three levels (Grades 10 to 12) to less than 20% for Speaking, and 0%

for Listening skills. The unbalance of language skills focus in formal EFL education was

subject to “the backwash effect of the structural end-of-level, graduation examination on

textbook design and methodology” (Be & Crabbe, 1999, p. 136) as the national high school

English examination designed for the 12th graders took the form of paper-based

assessments concentrating on reading comprehension, and grammatical and lexical

questions. This is the most important examination after 12 years of schooling because

“passing the exam certifies young people as having completed secondary education and

paves their way for further education at the university level” (World Bank, 2017).

Vietnam made a notable innovation point in 2010 when the Vietnamese Education

Publishing House (VEPH) in collaboration with world leading publishers (e.g. MacMillan

Education and Pearson Education) designed and developed the 10-year English textbook

series (Hoang, 2016), i.e. new generations of school children start learning English three

years earlier than in the past. This reformation in textbook writing for language learners

from primary to high school aimed to improve the English teaching and learning quality in

Vietnam, and “provide an excellent vehicle for effective and long-lasting change”

(Hutchinson & Torres, 1994, p. 323). There used to be a 3-year English textbook series for

students who started learning English from Grade 10. Today, there are two popular systems

of English textbook series in nationwide use: the 7-year English programme for students

who learn English from Grade 6 to Grade 12, and the 10-year English programme for those

starting from Grade 3. The 10-year English programme – a product of the National Foreign

Languages Project – was initiated in large cities of Vietnam in 2010, first in Ha Noi and

HCMC. This programme aims to achieve comprehensive innovation in EFL teaching and

learning throughout the national education system (MOET, 2010; “Distinguishing English

programmes”, 2018). The methodological approach adopted for the new programme is

communicative language teaching (CLT). As such,

teaching and learning are organised in a diverse communicative environment with

several interactive activities (games, songs, story-telling, riddles, drawing, etc.), in

the forms of individual work, pair work, and group work. (Huy Lan & Lan Anh,

2010)

34

However, the programme implementation has faced numerous obstacles because of

inadequacy of qualified teaching staff and school facilities (Anh Thu, 2017; Quynh Trang,

2017).

In major cities of Vietnam, secondary students are taught two simultaneous English

programmes: one is designed by the MOET (7-year programme), and the other is an

intensive English programme (10-year) by foreign textbook designers to help to prepare

students for International English Certificates, e.g. Cambridge ESOL examinations. It is

unnecessary to require students who have learnt English for at least 3 years earlier to start

again with those who only started learning English from Grade 6. However, this

unreasonable existence has lasted for years (My Ha, 2016; Hoang Huong, 2018). Inevitable

consequences are boring lessons for both teachers and students, and discrepancy in English

knowledge and proficiency across students of the same grades.

Since 2014, Vietnam has adopted a 6-level evaluation benchmark of English

proficiency based on the Common European Framework of Reference (MOET, 2014).

According to this framework (Appendix G.3), primary school students should obtain Level

1 (A1) after 5-years of English training. Level 2 (A2) is required for secondary and

vocational training students. High school graduates, and non-English major universities are

obliged to achieve at least Level 3 (B1) for English proficiency. English major college and

university graduates are expected to reach Level 4 (B2) and Level 5 (C1) respectively

(Thanh Tam, 2016; MOET, 2014).

1.1.3.2 Current situation of Vietnamese EFL education

In the present situation of internationalisation, the English language plays an increasingly

important role in many fields such as economics, science, education, etc. in Vietnamese

society (Hoang V. V., 2013; Phung V. M., 2015). Like many other Asian countries,

Vietnam considers foreign language teaching and learning, particularly English, to be “a

national mission” specified in its educational innovation strategies (Phan L. H., 2013, p.

162). Teaching and learning English in Vietnam have grown at “an unprecedented speed”

with the establishment of numerous foreign language centres (Do H. T., 2006). English has

become a compulsory subject in the national education system at both secondary (Grades 6-

9) and high school (Grades 10-12). At primary level (Grades 1-5), English is taught as an

35

elective subject (Hoang V. V., 2010) depending on specific conditions and needs of each

locality. There has been a growing need for Vietnamese learners to have their English

proficiency assessed and certified for employment, job promotion, graduation

acknowledgement, pursuance of higher education, studying abroad, etc. (Qandme, 2015;

Nunan, 2003).

My study on oral language testing and assessment was derived from the current

situation of English education in Vietnam, which I will discuss with regard to the crucial

role of English in socio-economic development, the ongoing challenge of L2 competency

testing and certification, and the adaptive application of a common reference framework in

foreign language assessment as an ultimate requirement of EFL education in the context of

international educational integration.

(1) Role of English education in socio-economic development of Vietnam

The initiation of political and economic reforms (‘Đổi Mới’) in 1986 marked a growing

recognition that foreign language learning plays a strategic role in facilitating social change

(Bui & Nguyen, 2016; Nguyen, 2012). Since Vietnam joined the World Trade Organisation

(WTO) in 2006, English has become an essential foreign language in the national education

system (Nguyen N. T., 2017; Ton & Pham, 2010). The growth of the English learning

movement in Vietnam has been in line with the popularity and wide spread of EFL teaching

and learning in South-East Asian countries and in the world (Kam, 2002).

Vietnamese educators assert that enhancing English competence is a step heading in

the right direction to help Vietnamese education integrate into world education (Bao Dien

Tu Giao Duc Viet Nam, 2015). This strategy, as part of the National Foreign Languages

Project for the period from 2008 to 2020, aims at making substantial changes to foreign

language education in Vietnam, especially English (Vietnamese Government, 2008). This

project formally addresses the nationwide issues of teacher training and language teaching

quality (Nguyen C. D., 2017), and more broadly, its goal is to arouse the entire society’s

interest and create a universal motivation for English learning.

The increased use of English in universities via imported educational programmes

and English as a mean of instruction (EMI) were strongly encouraged as part of the

governmental policies to promote global cooperation and Vietnamese education

36

modernisation (Vu P. A., 2018). Vietnam’s integration with the world attracted significant

investment from international organisations such as the World Bank (WB) and the United

Nations Educational, Scientific, and Cultural Organisation (UNESCO). WB-funded higher

education projects fostered the adoption of foreign programmes in English, which

contributed to the standardisation and competitivity of Vietnamese tertiary curricula (Dang

Q. A., 2009). Despite the availability of many other foreign languages, (e.g. French,

Japanese, Chinese, Korean, etc.) most Vietnamese students choose to learn English because

it offers many more well-paid job opportunities, Internet information access, or study

abroad in advanced education in the United States, the United Kingdom, Australia, Canada,

etc. Recently, many Vietnamese students have chosen to study in Asian countries like

Singapore, Thailand, Japan, or South Korea, but English is an important medium of

instruction, and knowing English is a remarkable advantage for intercultural

communication in these countries (Le & Chen, 2018). English is now widely used in

various activities at school, at work, and in daily communication since millions of English-

speaking people from different countries come to Vietnam for travel, work, and education

(Thanh Binh, 2015; Le Q. K., 2016; Lam & Albright, 2018).

(2) Challenges in teaching and testing oral skills in Vietnam

Teaching and learning EFL for the purpose of communication has been an ultimate

necessity from primary schools to universities to support Vietnam’s open-door policy and

global integration. At no time in the recent history of Vietnam has there been such an

apparent shift from the grammar-translation method towards a communicative approach,

particularly focusing on the development of oral communication skills (Bui H., 2006; Le S.,

2011). In CLT classrooms, “the objective of teaching spoken language is the development

of the ability to interact successfully in that [target] language, and that this involves

comprehension as well as production” (Hughes, 2003, p. 113). To evaluate learners’

language acquisition and their ability to use spoken English for communication, oral test

development and administration should be placed under the serious consideration of policy

makers, test designers, and examiners.

At most universities in Vietnam, students’ performances in a test for a unit of study

(or module) are assessed using a 10-point grading system. The result will be converted into

37

letter grades, e.g. from A through D or F (MOET, 2007). This academic evaluation policy

requires more meticulous consideration to achieve the highest precision and objectivity in

scoring because numerical test results contribute to making important educational decisions

and orienting pedagogical activities (Language Testing Service, 2003). A disparity of 0.5 in

test scores, or even 0.1 according to institutional scoring regulations, may result in a

difference to the ‘Pass’ or ‘Fail’ status of a candidate, an obligation of re-enrolment (and

consequentially re-payment) for a ‘Fail’ subject, decisions regarding scholarship awards,

graduate degree classification, etc.

Despite many attempts to improve the quality of EFL teaching and learning such as

teacher training programmes, international TESOL conferences, teaching material changes,

restructuring curriculum, etc. the most recent statistics by the MOET found that:

University graduates who can meet the employers’ English language requirements

account only about 49%, an approximate of 18.9% graduates cannot, and 31.8% of

them need further training. It means that more than half of graduates do not

sufficiently meet English language requirements. (V. Le, 2016)

Some Vietnamese university graduates have obtained sufficient credits for their

English modules at school, even holding international English certificates (TOEIC,

TOEFL, etc.) with a satisfactory score for graduation acknowledgement, but they cannot

speak English fluently (Thanh Ha, 2008; Moc Tra, 2017; Nguyen X. Q., 2016). It seems

that the question raised over a decade ago regarding the quality of university students’

English education currently remains an unsolved problem: learning English for 10 years,

students cannot use the language (Vu, 2007). Four deficiencies associated with the EFL

training in Vietnamese tertiary education are (1) lack of attention to the English level of

student intake, (2) lack of comprehensive criteria to evaluate learners’ level, (3) lack of

concern about the role of testing and assessment in the training process, and (4) lack of

consideration into the learner’s role in the training process (Vu, 2007).

EFL testing and assessment in general and spoken English evaluation in particular

demonstrate noticeable shortcomings from secondary to tertiary level in Vietnam. These

deficiencies have direct or indirect impacts on young learners’ English-speaking ability,

from their early years of English learnings at lower secondary schools to their successive

38

years of higher education at university. CLT started as a new approach to L2 pedagogy in

the 1970s and 1980s (Richards, 2006; Pham H. H., 2007; Walia, 2012; Murphy & Baker,

2015) and has been adopted into teaching English in Vietnam since the early 1990s (Ngoc

& Iwashita, 2012). The main goal of CLT is the teaching of communicative competence

rather than grammatical competence as in traditional lesson formats (Richards, 2006). The

practice of CLT does not neglect the role of grammar in language learning, but makes a

clear distinction “between knowing various grammatical rules and being able to use the

rules effectively and appropriately when communicating” (Nunan, 1989, p. 12).

Communicative competence emphasised in CLT includes the ability to use the target

language for different purposes and “maintain communication despite having limitations in

one’s language knowledge (e.g. through using different kinds of communication

strategies)” (Richards, 2006, p. 3). To achieve those obectives, CLT classrooms are

characterised by “practical hints and communicative activities designed to stimulate

conversation and communication among students” (Banciu & Jireghie, 2012, p. 96) such as

role-play, interviews, information gap activities, pairwork, groupwork, etc.

CLT has been implemented for almost three decades in Vietnamese EFL education

since the 1990s (Ngoc & Iwashita, 2012). However, oral testing to assess Vietnamese EFL

learners’ communicative competence is not included in the national graduation exams for

high school students. This is the most important examination that students have to pass

after 12 academic years of formal education from primary to the completion of high school

level, and to try to obtain the highest scores they can to compete for university admissions.

The score of the English subject is calculated together with three or four other subjects’

scores to decide whether the student can pursue further education at tertiary level. A

difference of 0.25 points in the total score might decide whether a student will be admitted

to a university or not (Dieu Linh, 2018). Nevertheless, the national graduation exam papers

of English for high school students have not been designed in a standardised format but

have been altered year after year in the proportion of multiple-choice items and self-

constructed responses.

Innovations in foreign language examinations are like getting stuck in a roundabout

circle. Before 2006, candidates in the university entrance examination took the

39

English test comprising 20% multiple choice format and 80% creative writing.

From 2006, the English test is 100% multiple choice format. After that, the

multiple-choice format was criticised for not being able to evaluate students’

authentic ability, far away from practical skills, and then the English test was

adjusted to 80% multiple choice and 20% for creative self-constructed responses.

The change was maintained for two years. The 2017 exam had 100% multiple

choice format again. (Thanh Tam & Phuong Hoa, 2016)

English lessons at high schools focus mainly on grammar, vocabulary, and written

tests. Students virtually do not have listening and speaking practice, or group discussions

(Minh Nhat, 2007). Nevertheless, there exists a remarkable gap between methodological

guidelines and classroom practices of English teaching as indicated in the following

comment:

although the rhetoric of the Vietnamese Ministry of Education and Training stresses

the development of practical communication skills, this is rarely reflected at the

classroom level, where the emphasis is on the development of reading

comprehension, vocabulary and structural patterns for the purposes of passing the

end-of-school and university entrance examinations into colleges or universities.

(Hoang V. V., 2010, p. 16)

Oral assessment at tertiary level is included in the English major training

programmes in the form of mandatory end-of-course tests for the Listening-Speaking

modules. There is a high demand for spoken language ability from English majors because

these students need oral skills in English for studying their major subjects (e.g. linguistics,

translation-interpretation, language teaching, public speaking, etc.), and for professional

practice after graduation (e.g. office staff, translators, interpreters, language teachers, etc.).

The rating and marking are usually performed by the teacher-in-charge of Listening-

Speaking classes. What to be assessed (the content) and how to assess it (the method) are

usually predetermined by the English faculty and relatively identical across classes of the

same institutions. Nevertheless, most Vietnamese universities do not require non-English

majors to take school-based speaking tests. Only about 65% of non-English majors

understand the importance of learning English. Students are not involved in serious study

40

of English, but learn it merely for form’s sake to obtain a programme completion certificate

(Anh Tu, 2013). This limitation would affect opportunities and competitive advantages for

career development, income, and higher education. End-of-term English tests for non-

English majors mainly focused on grammar accuracy, vocabulary range, reading

comprehension, sentence writing or short paragraph writing. Ignorance of assessing

speaking skills leads to the possibility that “teachers may simply not teach certain important

skills if they are not in the test” (Weir, 2005, p. 18). Explicable reasons for institutions not

to administer oral tests are numerous: students’ weakness at speaking as a result of English

learning from high schools as presented above; difficulties in managing large classes of 50

or more students (Hoang, 2010); time-consuming administration due to insufficient

facilities for computer-based testing (CBT); teachers’ lack of training for oral assessment

(Le Nhi, 2008); avoidance of raters’ subjectivity in scoring and candidates’ complaints

about test results when speaking test performances are not audio-recorded, etc.

Certification of English proficiency, including Speaking skills, is a real need in

today’s Vietnam’s changing society, not only for academic acknowledgement, but also for

job applications, career promotion, and higher education admission. However, domestic

certification of English qualifications is not widely recognised. For example, the Vietnam

National English certificates corresponding to three levels: Elementary (Level A),

Intermediate (Level B), and Advanced (Level C) applied since 1993, are not reliable any

more because extensive test administration and certificate issuance have resulted in

reduction of quality of these national certificates (Nhu Lich & Ha Anh, 2014; Ngoc Ha &

Tran Quynh, 2014). Further, they are no longer compatible with the CEFR benchmark that

the MOET’s guidelines deploy across institutions (Appendix G.3), and so cannot be

competitive with other international English certificates such as TOEIC, TOEFL, IELTS,

etc. Another example is the recent emergence of the Vietnam National University HCMC

English Proficiency Test (VNU-EPT), which can be regarded as an in-time response to the

need of certifying the English competence of undergraduate and graduate students in and

outside the Vietnam National University. The VNU-EPT proves to be a good substitute for

the level A-B-C certificates in that it is low-fee (suitable for most students) and contributes

to the English standardisation of graduate outcomes constructed on the 6-level CEFR.

Nevertheless, there are strategic challenges making VNU-EFT popular with test users and

41

recognised among employers nationwide. A statement by a university principal in HCMC

indicates that the social reputation of the VNU-EFT certificate needs further enhancement:

Although the fee for the Vietnam National University - HCMC English Proficiency

Test (VNU-EPT) is very low in comparison with that for international English

certificate exams, very few students want to take the test. Most students suppose

that enterprises never use this type of certificate. In order to popularise the VNU-

EPT effectively, in my opinion, we need to establish the prestige of the certificate to

society, especially to job recruiters. (Phuong Chinh, 2017)

The above-presented context of EFL teaching, learning, and testing encouraged me

to conduct a research study into the practice of oral assessment in Vietnamese tertiary

education – an important level of training that determines qualifications and the success of

the professional labour force for the nation. Not merely does the current situation indicate a

sign of permanent weakness in educational practice, but it also has negative effects on

graduates and the entire society hereafter. The following quote from an expert in language

testing and assessment summarises the need of a more professional operation of an

assessment component in Vietnam:

The core problem of enhancing the quality of foreign language teaching in the

world is how to integrate the three most basic and important components of the

pedagogical process, i.e. teaching, learning, and testing. Particularly for Vietnam at

the present time, testing and assessment is the weakest, and therefore, needs the

most attention. (Vu, 2007)

(3) Current situations of L2 competency testing and certification in Vietnam

In 2008, Vietnam launched the Project “Teaching and learning foreign languages in the

national education system for the period from 2008 to 2020” to achieve significant progress

on language competency and professional skills performance for human resources,

especially in some prioritised sectors (Vietnamese Government, 2008). The Project set as

its general goal by the year 2020, that most young Vietnamese graduates from vocational

schools, colleges and universities gain a good command of using a foreign language

independently (Le & Chen, 2018; P. A. Vu, 2018). In supporting the 2020 Project, the

Common European Framework of Reference for Languages (CEFR) has been utilised in

42

Vietnam as a national reference framework to assess language proficiency, justify

curriculum design, establish teaching and learning strategies, and measure the outcomes to

ensure the compatibility of different stages of foreign language teaching and learning in the

educational system (MOET, 2014). The CEFR covers all the main language skills

(listening, speaking, reading, and speaking) at six levels from low to high: A1-A2 (Basic

User), B1-B2 (Independent User), and C1-C2 (Proficient User). The Framework provides a

comprehensive description of “what language learners have to learn to do in order to use a

language for communication and what knowledge and skills they have to develop so as to

be able to act effectively” (Council of Europe, 2001, p. 1). This establishment of common

reference points accompanied with coherent scales of illustrative descriptors has proved to

be a useful tool for the assessment of language competence in different national and

cultural contexts, amongst which is Vietnam:

The framework is a series of descriptions of abilities at different learning levels that

can be applied to any language. It can provide a starting point for interpreting and

comparing different language qualifications and is increasingly used as a way of

benchmarking language ability around the world. (International English Language

Testing System, 2018b)

From an educational perspective, the CEFR may be beneficial for both language

learners and teachers (Nguyen, N. T., 2017; Wu & Wu, 2017; Little, 2005). Learners can

refer to the “self-assessment statements to monitor their progress and these will provide a

set of goals for each skill at each level of their learning” (English Profile, 2015). The

following are examples of ‘Can Do’ self-assessment descriptors (Appendix G.2) provided

for B2 spoken interaction:

I can interact with a degree of fluency and spontaneity that makes regular

interaction with native speakers quite possible. I can take an active part in

discussion in familiar contexts, accounting for and sustaining my views. (Council of

Europe, 2001, pp. 26-27)

and for spoken production:

43

I can present clear, detailed descriptions on a wide range of subjects related to my

field of interest. I can explain a viewpoint on a topical issue giving the advantages

and disadvantages of various options. (Council of Europe, 2001, pp. 26-27)

The CEFR can provide teachers with “a much clearer picture of what learners at a

given level are capable of” and a full understanding of what to do to enhance learners’

language proficiency from one level to the next (English Profile, 2015). Nevertheless, the

complex CEFR document suggests that “its aim is not that of providing a ready-made,

universal solution to the issues related to assessment” (Piccardo, 2012, pp. 51-52). In any

practical approach to a particular testing context, principles and guidance suggested by the

CEFR may make a significant contribution to quality improvement in language pedagogy

by “provid[ing] a uniform means of assessment across languages” (Yoneoka, 2014, p. 87)

and by “stimulat[ing] curriculum and examination reforms in different educational sectors”

(Martyniuk, 2008, p. 12). The MOET’s decision on the official application of the six-level

CEFR-based reference frame has helped to enhance transparency and coherence in English

proficiency standardisation through different sectors (e.g. English for young learners,

General English, Academic English, Business English, etc.) and stages (e.g. Primary,

Secondary, College/Vocational training, and University) in Vietnamese EFL education

(Appendix G.3).

The CEFR is not intended for any specific language, but serves as a common

reference tool across the world’s languages. The openess and adaptability of the CEFR has

seen it received with enthusism in the community of language testers in many countries

where English is not necessarily a foreign or second language. The CEFR is adaptable for

use in different circumstances, and capable of further extension and refinement (Yoneoka,

2014). On the other hand, the Framework has brought about challenges in determining

score compatibility across different tests in relation to the CEFR framework, i.e. how score

results from different tests are related to one another and to what extent the results can be

considered to be equivalent (Wu & Wu, 2007). Language testers are concerned about the

adaptability of the CEFR in professional training contexts (Hiranburana et al., 2017). Policy

makers have raised questions in defining the level of communicative competence required

for the CEFR-based accreditation of another language (Montagut & Murtra, 2005). There

44

are concerns about the CEFR’s validity and unexpected changes to the original language

proficiency scale (Nagai & O’Dwyer, 2011).

The adoption of the CEFR-based assessment has established requirements and

challenges for EFL education in Vietnam. The Framework contributes to the alignment of

Vietnamese foreign language assessment with the world’s standards. Narrowing the gap

between oral testing in Vietnam and that in the world is a vital factor for enhancing the

quality of EFL teaching, learning, and assessment in the Vietnamese educational setting.

1.2 Research questions

My overview of the research context highlights the fact that Vietnamese education needs

more effort to improve the quality of foreign language testing in general and EFL oral

assessment in particular. Testing spoken English in Vietnam has not received adequate

consideration from secondary to high schools, and most non-English major classes at

universities. My study aims to draw language testing practitioners’ and researchers’

attention to school-based oral assessment by seeking answers to three main research

questions:

(1) To what extent can assessing English speaking skills at Vietnamese universities

measure what it is intended to measure?

I divided this question into three sub-questions to consider associated aspects:

1a. How is oral English assessment administered across Vietnamese universities?

1b. To what extent is the test content relevant to the course objectives?

1c. Are the characteristics of the test tasks appropriate and fair to test takers?

(2) To what extent are the raters consistent in their rating process?

(3) What washback effects does the oral test have on EFL teaching and learning?

Current trends in oral language testing made me realise that there is a significant

gap between assessing speaking EFL skills in Vietnam and that in the world. Bridging the

gap is crucial to enhance the quality of EFL teaching, learning, and assessment in the

Vietnamese educational context. The purpose of achieving a deeper and more

45

comprehensive understanding of oral assessment in theory and practice motivated me to

conduct this research study.

Reaserch Question (1) is concerned about the effectiveness of tertiary-level oral

assessment and the degree to which a test can measure what it claims to measure. This

question involves three important aspects in the face-to-face speaking paradigm of an

educational test, i.e. administrative setting, content, and assessment tasks.

The practice of teaching and testing oral skills are different across Vietnamese

higher education as “at tertiary level, the question of what to teach is left to each particular

institution to decide” (Hoang, 2010, p. 13). The MOET’s macro-management provides

regulations on the minimum amount of knowledge, and the qualification requirements that

students are to meet after graduation for each training level of tertiary education (MOET,

2015). Each institution designed their own teaching syllabus for English speaking skills,

selected their own course books, and thus their own testing method. I examined variables in

test preparation and administration to identify the existing strengths and shortcomings of

each method in practice of spoken English (SE) testing because the circumstantial settings

in which a test takes place can have significant influence on test takers’ performance

(Taylor, 2011). These variations included administrative arrangements (e.g. supervision,

security, physical conditions, emergency procedures for unplanned incidents, etc.), test

conduct (e.g. interlocutor/examiner, and candidate arrangements), and speaking test

materials (e.g. testing booklet, interlocutor outline, scoring guidelines, assessment criteria

and scoring scale). I examined these differences with reference to uniformity requirements

for school-based test administration and the potential influence on raters’ and candidates’

performance. Data for examining administrative procedures were derived from documents

and observations.

I explored the degree to which a test content’s relevance was established in the

design stage. This was an important aspect in validating achievement tests as a measure of

what students had learnt from what had been taught in the syllabus, course books,

supplementary materials, etc. (Davies, 1990, p. 6). My data serving this goal were EFL

experts’ judgements that helped to identify whether the domain of language knowledge and

skills tested could be “representative and relevant samples of whatever content or abilities

46

the test has been designed to measure (Brown & Hudson, 2002, p. 213). This characteristic

makes school-based oral tests different from language proficiency testing in that the latter

does not restrict the training individuals may have received in a language, but measures the

overall ability to use that language for real-world purposes (Swender, 2003; Hughes, 2003).

The inclusion of test contents as a sourse of data for this research aimed to learn about how

the test designer took into account the way language and language use were incorporated in

the test construct (McNamara, 2000, p. 25). This facet of test content itself did not provide

sufficient evidence for test validation because it did not entail inferences about test takers’

speaking abilities (Bachman, 1990, p. 247), but it relied on the test designer’s decisions

about what language contents were to be sampled.

Speaking tasks play a vital role in oral assessment in that they “outline the content

and general format of the talk to be assessed, and they also provide the context for it”

(Louma, 2004, p. 29). The oral assessment being studied was designed in accordance with

the B.A. training programme employed at each institution. Tasks created necessary contexts

for candidates’ oral performances. I compared and characterised the features of these tasks

to identify which were unique and overlapped, and how effective they were in eliciting

intended speech samples for marking. Data to learn about speaking tasks were gathered

from document analysis (test material, course outlines), and task performance (recordings

of speech samples).

Research Question (2) aimed to examine raters’ characteristics and their rating

performance. In the scoring process, raters made judgements and decisions on scores for

candidates’ speaking. I focused on the assessment critera and rating scales adopted across

institutions. These variables were dependent upon the development of tests with reference

to the course objectives established by each institution. I considered features of the

assessment settings associated with timing, room layouts, and environmental conditions

that had the potential to impact on the rating process and scoring validity. I analysed how

individual rater charcteristics could shape the way they interpreted and applied the criteria

and scale, the way they made assessments, their tendency to leniency or severity, and the

consistency of their rating behaviour (Taylor & Galaczi, 2011, p. 175). In order to capture a

general picture of the evaluation outcomes, I analysed the test scores that candidates were

47

awarded at each institution. These sets of scores also helped me to learn about how

consistency was ensured across raters (inter-rater reliability).

Research Question (3) aimed to investigate the impact of oral assesment on EFL

teaching and learning. As a type of summative assessment after a course of instruction, the

oral test provided information “to help evaluate the effectiveness of programs, school

improvement goals, alignment of curriculum” (Garrison & Ehringhaus, 2007), and

adjustment of pedagogical activities in that educational context, also referred to as

“measurement-driven-instruction” (Popham, 1987). I examined this washback effect of the

test on EFL teaching and learning, and on stakeholders’ attitudes towards language

education. Data for this question came from surveys and interviews with raters and

candidates after the test event at each institution.

I studied the washback effect from the perspectives of the stakeholders directly

involved in the oral assessment. Their opinions about and experiences with the test were

validation evidence to conclude whether these effects were positive, negative, neither of

these, or both (Cheng, Watanabe, & Curtis, 2004). Examining the test’s impact in the

context of Vietnamese EFL education where it was administered would provide insights

into the consequential validity of language testing and assessment. The oral test was not the

final examination in the B.A. programme for EFL majors, but its purpose was to evaluate

students’ academic achievement after a course of training. It served as a prerequisite to

promote learners to the next level in their language study. Therefore, the impact of this test

was important in the continual cycle of teaching-learning-assessment at educational

institutions.

1.3 Significance of the research

I conducted this study project at a time when Vietnamese education had attached more

importance to the quality enhancement of EFL pedagogy than ever before. At tertiary level,

ensuring the quality of graduates’ foreign language proficiency is crucial to meet the needs

of global communication skills in the domestic labour market as well as in regional and

international exchanges (MOET, 2014; Vietnamese Government, 2017). An investigation

into the current situation of EFL testing and assessment constitutes an important

contribution to enhancing the quality of Vietnamese tertiary education.

48

Language assessment is a valuable tool for providing information relevant to

various kinds of research in applied linguistics. We might employ a variety of assessment

methods to understand the nature of oral language ability and how this associates with the

quality of speaking. An investigation into assessing speaking skills therefore, may help to

identify factors that have potential influences on the rating process. An integration of data

collected from examiners and candidates should provide a fuller description of how the

adoption of school-based testing methods align with the established objectives in L2

education. A comparibility study of different institutional testing methods will enable

language teachers and test administrators to identify strengths and weaknesses of each

method to plan, develop, and justify appropriate assessments of their own, contributing to a

construction of a common ground for quality assurance in education across tertiary

institutions.

This study aimed to provide an in-depth understanding about oral assessment in

educational contexts. Identifying and addressing difficulties in spoken English (SE) testing

will help school-based assessment to be enhanced in accuracy and fairness, which is a

legitimate expectation of not only candidates but also examiners and administrators. It will

orient students’ learning attitudes of English-speaking skills in particular, and ELF study in

general, towards a more positive direction once the role of SE is fully recognised and its

assessment is carried out properly. The research results will be used to adjust the

approaches to teaching, learning, and testing SE skills in non-native English-speaking

countries like Vietnam.

Results from a study on L2 performance in testing situations will provide practical

feedback on how the language material and methodology were appropriate for learners in

non-native English speaking countries like Vietnam. Various hypotheses in L2 acquisition

can be generated, tested, and confirmed via findings from the actual testing environment

(Shohamy, 2000). This is an important channel of information that helps to connect

coursebook development with the needs of coursebook users.

In the long run, a multidimensional study of oral assessment for English majors

should lay a crucial foundation and encouragement for assessing speaking skills to be

applied to non-English majors, who account for the majority of students across Vietnamese

49

universities, and who determine the overall quality of the EFL pedagogy of the entire

tertiary education system. On a larger scale, the research outcomes should serve as a helpful

diagnosis for local testing validation, endeavouring to narrow the inherent gap between

international English language tests and school-based assessments.

1.4 Organisation of the thesis

My thesis is presented in nine chapters, of which this chapter has been the first. The other

chapters are organised in sequence as follows:

Chapter Two reviews the literature of spoken language assessment that was relevant

for answering the research questions. This chapter discusses two central concerns in

language testing, i.e. how well the oral test can measure what it is supposed to measure

(construct validity), and how consistently the oral assessment is administered across

candidates (reliability). I address key aspects in testing speaking skills in the light of the

framework for oral test validation proposed in Weir (2005). Gaps in understanding and

implementing spoken English assessment in Vietnamese higher education are identified.

There is a focus on how answers to the research questions could contribute to filling these

gaps and improving the quality of assessing foreign language speaking skills at educational

institutions in Vietnam.

Chapter Three justifies the adoption of a mixed methods approach for this study. In

this chapter, I describe the research design in phases of multi-source data collection, data

analysis, and reporting integrated results. I detail the research instruments, including

questionnaire surveys, face-to-face interviews, test room observations, test-related

document and test score analysis. I explain the methods used for analysing each type of

data and combining the results at the end of this chapter. Ethics consideration in conducting

this educational study is presented as an obligatory requirement for my involvement with

human research.

The next five chapters aim to report and analyse data collected from the research

sites to answer the research questions. Chapter Four concentrates on examining test takers

characteristics and test administration as a physical context for oral assessment. Chapter

Five is concerned about the relevance of test contents to the course objectives. Chapter Six

is devoted to analysing issues regarding the design of test tasks and their roles in eliciting

50

candidates’ L2 speaking performance. In Chapter Seven, I investigate the rating process

and and raters’ consistency in assessing oral skills. Chapter Eight explores the impact of

oral testing on EFL teaching and learning.

Chapter Nine summarises key points in my findings to answer the research

questions. I present limitations of the study and recommendations for further research in the

field. This chapter provides suggested future implementations for different stakeholders

including raters, candidates, test developers, administrators, and policy makers to enhance

the quality of language testing in general, and oral assessment in particular.

The introductory chapter has presented the research topic and the context for my

study about oral performance assessment in Vietnamese tertiary education. At the

beginning of this chapter I presented a brief history of testing L2 speaking. I provided

information about EFL education in Vietnam as the contextual background for this study.

My presentation continued with the research questions and sub-questions that shaped the

development of the entire thesis. I discussed the significant contribution of the thesis to the

body of knowledge in language testing and assessment. This chapter ends with an overview

of all the chapters that constitute the thesis. The next chapter presents a theoretical

background of oral assessment and a critical elaboration on the literature regarding central

concerns in speaking test validation.

51

Chapter Two

LITERATURE REVIEW

The development of the assessment of speaking has gone hand in hand with the

emergence of language testing as a recognised sub-field of applied linguistics.

Attitudes to oral assessment have been shaped by the changing currents of research

paradigms in this field and in linguistics more generally. (Hughes, 2011, p. 81)

2.1 Introduction

In the first chapter, I presented an introduction to oral assessment as the research

topic of this thesis. I provided the contextual background and the significant contribution

the study can make. Based on the emerging issues in testing oral skills in Vietnamese

tertiary education, I formulated the research questions that the study aimed to answer. In

Chapter Two, I discuss the essential qualities of a good test for measuring learners’

language ability: effectiveness in measurement (construct validity), and consistency of

measurement (reliability). I continue to present key elements of the validation framework

(Weir, 2005) adopted as a methodological approach for my study. This section includes

theoretical perspectives and results from empirical studies conducted in the field that my

research questions have a strong association with. The chapter identifies factors that have

potential influence on the quality of constructing and administering school-based oral

assessment. This chapter ends with a discussion of the impact of language testing on

teaching and learning in educational contexts.

2.2 Key issues in designing spoken language tests

Testing and assessment constitute an important part of every language teaching and

learning process. The assessment of L2 learners’ language skills involves complex issues

including not only “the nature of language proficiency”, but also “the validity of assessment

instruments, the reliability of scores, and the manner in which the whole process may

influence the curriculum” (Elder & Wigglesworth, 1996, p. 1). Traditional testing and

52

assessment have focused on discrete point items, which “provides little information on the

student’s ability to function in actual language-use situations” (Cohen, 1994, p. 161). With

the growth of CLT, the current orientation of language testing and assessment has changed

to emphasise learners’ authentic performance of their communicative skills in the target

language (Chinda, 2009; McDonough & Chaikitmongkol, 2007). Innovations in language

teaching have created a dominant trend in perspectives on language learners’ oral

performance to be assessed.

Tests today are mainly concerned with evaluating real communication in the second

language. In this communicative era of testing, we feel that the best exams are those

that combine various subskills as we do when exchanging ideas orally or in writing.

In particular, communicative tests need to measure more than isolated language

skills: They should indicate how well a person can function in his second language.

(Madsen, 1983, pp. 6-7)

The purpose of school-based tests has shifted in perspective “from using assessment

as a way to keep students in their place to using assessment as a way to help students find

their place in school and in the world community of language users” (Cohen, 1994, p. 3).

Establishing the right direction for what to test and why test it is among the first steps test

developers need to take to decide how to make a useful test.

ESL/EFL tests today serve a wide variety of purposes. A test’s purpose will

influence its content and approach to ability measurement. Determining “[t]he purpose for

which the test will be used is a normal starting point for designing test specifications”

(Alderson, Clapham, & Wall, 1995). In an educational setting, language assessment is

given for several formative or summative goals (Giri, 2010). Tests are designed to serve

three general purposes: administrative, instructional, and/or research. Each general purpose

can be further specified into several particular purposes. Table 2.1 presents a summary of

test purposes under General and Specific categories (Cohen, 1994; Davies, 1990; Madsen,

1983). These purposes of assessment are not always exclusive of each other. A form of

assessment may be employed for many other purposes. The assessment given for an overall

purpose, for example, may also be used for identifying areas of difficulties learners find in

the language. An achievement test may also be used for encouraging learners to learn and

53

monitoring learners’ progress. For an instructional purpose, feedback on examinees’

performance is important in that it helps them to develop appropriate learning plans and/or

adjustments. It is normal for proficiency tests obtaining “some measure of acceptance and

institutionalizing to convert themselves into achievement tests” as a result of “the

development of a teaching programme related to the test” (Davies, 1990, p. 21). In other

words, the achievement test and the proficiency test can share some common assessment

criteria that are relevant to what has been taught in the syllabus.

Table 2.1 Purposes of language assessment

General Specific Description

Administrative Overall Placement Cerification Promotion Selection

knowing learners’ overall language ability placing learners in appropriate ability groups providing proof of language ability promoting learners to a higher level selecting able students for a purpose

Instructional Motivation Diagnosis Prognosis Progress evidence Feedback for examinee Improvement Achievement

encouraging learners to learn ascertaining areas of difficulties in the language determining learners’ readiness for a course checking whether leaners are making progress monitoring learners’ progress enhancing teaching or a course measuring how much learners have learnt after a course

Research Evaluation Experimentation Knowledge

reviewing materials and methods finding more about language and its learning knowing about communication and learning strategies

Test developers need to determine the main purpose(s) of the intended assessment

“so that tests and their resulting scores are not misused or misinterpreted in ways that

negatively affect language programs and learners’ lives” (Bailey & Curtis, 2015, p. 24).

Table 2.2 presents contrasting categories of L2 testing and assessment that help to

determine various aspects of an individual test, e.g. what is to be assessed (knowledge or

performance?), whether subjectivity is to be removed (subjective or objective?), how the

assessment is to be referred to (norm- or criterion-referenced?), when the assessment is to

be made in a course of instruction (formative or summative?), whether the course

objectives are to be included (achievement or proficiency?), and so forth. Table 2.2

54

introduces categories of L2 assessment in contrasting orientations and approaches (Council

of Europe, 2001, p. 183).

Table 2.2 Contrasting categories of L2 assessment

Knowledge assessment

Subjective assessment

Productive assessment

Language subskill assessment

Norm-referenced assessment

Discrete-point assessment

Formative assessment

Proficiency assessment

Direct assessment

Holistic assessment

Impression judgement

Peformance (or skills) assessment

Objective assessment

Receptive assessment

Communication skills assessment

Criterion-referenced assessment

Integrative assessment

Summative assessment

Achievement assessment

Indirect assessment

Analytic assessment

Guided judgement

Note. There is no significance as to whether one term in a pair of categories is placed on the left or on the right of this table.

Test categorisation is usually based on only one pair, e.g. a test is either direct or

indirect assessment, although “several labels can be applied to any one test” (Madsen,

1983, p. 9). There are crucial disctinctions to be made in association with assessment

regarding test purpose, content focus, rating reference method, scoring rubric type in use,

directness of speech sample, etc. A clear understanding of the various possible assessment

types is very useful in test design since a test of one kind may not be a successful substitute

for that of another (Edutopia, 2008).

Language tests can take various forms according to the types of information they

are intended to yield (purpose), the material input used in test writing (content focus), and

the time when they are administered within or beyond a training programme (setting). Test

classification provides test writers with practical guidance in constructing a test to achieve

its purpose. Answers to a wide range of questions regarding kinds of assessment assist with

categorising test items, texts or test banks accordingly once they have been written and

tried out (Alderson, Clapham, & Wall, 1995). Table 2.3 provides a summary of differrent

55

test types used in L2 assessment (Hughes, 2003; Davies, 1990; Alderson, Clapham, &

Wall, 1995).

Table 2.3 Kinds of language tests

Kinds Main purpose Content focus Example

Placement test

To place learners at a stage of the teaching programme most appropriate to their abilities

Dependent upon the identification of key features at different levels of teaching in the institution

Preliminary test administered at the beginning of a course

Progress test

To determine each student’s growth of knowledge and skills, enabling more reliable and valid decision making about promotion to the next study phase.

longitudinal, repeated assessment of students’ functional knowledge and/or learnt skills

Mini-test or periodical test administered to all students in a programme at the same time or at regular intervals, e.g. on completing each unit or topic

Achievement test

To establish how successful learners have been in achieving the objectives of a period of instruction or training

Directly related to the language course

End-of-course test or end-of-semester exams administered at language schools or educational institutions

Proficiency test

To measure learners’ ability in the target language regardless of any training they may have had in that language

Not based on the content or objectives of language courses which learners taking the test may have followed

TOEFL,TOEIC tests by ETS (USA); IELTS by the British Council, Cambridge Assessment Engish (UK) and IDP Education (Australia); Level A-B-C tests by the MOET (Vietnam)

Diagnostic test

To identify learners’ strengths and weaknesses for further teaching

Broad language skills Test on Listening, Speaking, Reading, or Writing skills

Aptitude test To predict future language learning success

No content (no typical syllabuses to teaching aptitudes) to draw on but concerned to predict future achievement

Test for gifted student selection (e.g. for special training or contests)

My review concerns assessments conducted in pedagogical contexts. The content

for assessment refers to “a sample of what has been in the syllabus during the time under

scrutiny” (Davies, 1990, p. 20). Limitation of content relevance to the syllabus is important

in defining and examining the validity of the assessment. In most testing situations, the

same content is delivered to all candidates taking a test. However, in live oral assessment,

56

specific test content may vary from candidate to candidate because it is impossible for all

the candidates to take the speaking test at the same time. The test developer may write

different oral test questions to use in the same test event. These questions are expected to be

relevant to the content coverage defined by the course of instruction. There is a need to be

cautious regarding the interpretation of individuals’ performance and scores produced by

tests with content variety:

Even though the content of a given test does not vary across different groups of

individuals to which it is given, the performance of these individuals may vary

considerably, and the interpretation of test scores will vary accordingly. (Bachman,

1990, p. 247)

A useful test does not only attain its intended purpose, but also has to meet many

other requirements. Test usefulness in the real world, according to Bachman and Palmer

(1996), consists of six major components, namely construct validity, reliability,

consequences, interactiveness, authenticity and practicality. To serve the purpose of this

study, I will address the notions of test construct validity, reliability, and consequences in

examining the effectiveness of oral assessment in L2 education.

2.1.1 Construct validity in oral language testing

Language tests can be classified into five types according to their purposes: achievement,

proficiency, placement, diagnostic, and aptitude tests (Brown & Abeywickrama, 2010).

This study was concerned with the school-based assessment of speaking skills as an

achievement test administered at the end of a language skills training course. Measuring

students’ speaking skills is generally made in accordance with the course objectives and

specific contents covered in the programme.

A test construct is defined as “what is being measured by the test; those aspects of

the candidate’s underlying knowledge or skill which are the target of measurement in a

test” (McNamara, 2000, p. 137). This means that test designers need to clearly

predetermine what construct(s) the test is aimed at before it is administered. For example,

spoken language ability that many speaking tests claim to measure may surround the

aspects summarised in Figure 2.1 (Cumming & Berwick, 1996, p. 16):

57

Spoken Language Ability

Language Competence Strategic Competence Grammar Discourse Pragmatic

Syntax Rhetorical Sensitivity Interaction patterns Morphology organisation to illocution Interaction skills Vocabulary Coherence Non-verbal features Pronunciation Cohesion of interaction

Figure 2.1 Model of spoken language ability in the Cambridge Assessment

This model of spoken language ability can be further developed into illustrative

scales containing more than 10 qualitative categories relevant to oral assessment if

interaction strategies are regarded as an inherent qualitative aspect of communication:

turntaking strategies, co-operating strategies, asking for clarification, fluency, flexibility,

coherence, thematic development, precision, socio-linguistic competence, general range,

vocabulary range, grammatical accuracy, vocabulary control, and phonological control

(Council of Europe, 2001, p. 193). For the question of feasibility in assessment, a selective

approach to the list of so many categories is recommended since more than four or five

categories may cause cognitive overload to rater performance. In any practical approach,

features need to be combined, renamed and reduced into a smaller set of assessment

criteria appropriate to the needs of the learners concerned, to the requirements of the

assessment task concerned and to the style of the pedagogic culture concerned.

(Council of Europe, 2001, p. 193)

Any selection and weighting of the components needs to be justified by a clearly

defined rationale (Galaczi & ffrench, 2011). Before participating in a test event, test takers

should be advised of the intended assessment criteria and their weighting since this

58

knowledge will contribute to their preparation for and performance in the test in which their

language ability is measured and scored (Weir, 2005).

Construct validity constitutes the foundational framework within which to consider

the integrity of test interpretation and use. It refers to “the extent to which we can interpret

a given test score as an indicator of the ability(ies), or construct(s), we want to measure”

(Bachman & Palmer, 1996, p. 21). In this sense a test score helps to justify inferences we

can make to a candidate’s ability (Fulcher & Davidson, 2007), e.g. whether ‘5.0’ on a

speaking test indicates ‘ability to use SE in daily communication’, and whether a decision

made based on the score is justifiable, i.e. if a candidate’s score is below ‘5.0’, then he is

not admitted to the next level programme.

The relationship between test takers’ language ability, test tasks designed to assess

it, test scores and interpretations made from them are presented in Figure 2.2 (Bachman &

Palmer, 1996, p. 22).

Interactiveness

Figure 2.2 Construct validity of test score interpretations

Figure 2.2 indicates that when we examine the construct validity of test scores

resulted from rating a candidate’s language performance, we need to consider both the

construct definition from which inferences about the language ability are made and the test

SCORE INTERPRETATION Inferences about Domain language ability of (construct definition) generalisation

TEST SCORE

Language ability

Characteristics of the test task

Con

stru

ct V

alid

ity

Auth

entic

ity

59

tasks which candidates are required to perform using their language ability(ies). We also

need to consider characteristics of a test task in terms of “authenticity” (the degree to which

it corresponds to the actual domain of the target language use in general) and

“interactiveness” (the extent to which it engages the candidate’s language ability).

It is commonly agreed that construct validity is the most crucial standard in test

validation (Bachman, 1990; Brown, 2000). Messick (1989) argued that construct validity in

educational measurement is a multifaceted but unified concept, including six

distinguishable aspects: content, substantive, structural, generalisability, external and

consequential aspects. I will discuss the content aspect of construct validity as test contents

are among the most important features to consider in designing and using academic

achievement tests.

2.1.2 Content aspect of construct validity

Many language tests are designed to measure candidates’ language levels, i.e. their

language proficiency, regardless of any training they may have had in that language. They

are referred to as proficiency tests, e.g. TOEFL, IELTS, TOEIC, etc. Others are given to

measure how much students have achieved after a language learning process or a specific

study program. Such are categorised as achievement tests or prochievement tests (Clark,

1989) that are conducted in L2 classes on a regular basis. In many institutions, achievement

tests usually take the form of classroom-based or end-of-course tests concerning content

validity, which is defined as “the degree to which the items or tasks that make up a test

accurately and adequately represent the domain in question” (Green, 1998, p. 23). Unlike

proficiency tests that aim to assess overall language competence regardless of what and

how much training candidates previously had in that language, “achievement tests are

directly related to language courses, their purpose being to establish how successful

individual students, groups of students, or the courses themselves have been in achieving

objectives” (Hughes, 2003, pp. 12-13). In educational contexts, the first thing to consider in

designing language achievement tests is to ensure that they “reflect as closely as possible

what and how our students have been taught” (Weir, 1993, p. 28).

Supporting the consideration of content aspect in achievement tests, language

testing specialists recommend “ensuring congruity between curriculum objectives and

60

exam” (Muñoz & Alvarez, 2010, p. 36) as a way to maintain educational goals on the right

track. Brown and Hudson (2002, p. 213) added that “content validity approaches are much

stronger and more convincing if linked to construct validity approaches”. According to

Hughes (2003), a test is supposed to have content validity if its content constitutes and

represents what it is intended to measure. Content validity in education can be achieved

“through theoretical arguments or expert judgements or both” (Brown & Hudson, 2002, p.

213).

There is evidence that testing has failed when its content validity was not ensured.

In a recent investigation, when testing lacked content validity, students tended to exert

efforts only to gain marks without referring to their course materials as source of

knowledge (Siddiek, 2010). Achievement tests essentially focus on core syllabus

constituents to identify how much new knowledge and skills the learner has accumulated.

Testing separated from the established content would make most of the teaching

concentrate on examination techniques rather than doing authentic teaching, which is

viewed as “failure in the achievement of pedagogical objectives of language education”

(Siddiek, 2010, p. 133).

2.1.3 Reliability

A good test does not only have to be valid but also to be reliable. A reliable test

should produce “consistent and dependable” results (Brown, 2004, p. 20) which test users

can trust. Reliability constitutes “a quality of test scores”, and “has to do with the

consistency of measures across different times, test forms, raters, and other characteristics

of the measurement context” (Bachman, 1990, p. 24). Consistency in testing and

assessment demonstrates test administrators’ effort in ensuring accountability and fairness

to all the candidates taking it. There are a number of factors that can contribute to the

unreliability of a test: inconsistency in raters’ marking, in procedural practices of test

administration, in the students’ characteristics and physical conditions, and in the test itself

(Mousavi, 2002, p. 804).

Hughes (2003, p. 43) claimed that “if the scoring of a test is not reliable, then the

test results cannot be reliable either”. In other words, consistency needs to be established in

the rater’s performance to minimise measurement error so that the test can become

61

trustworthy in providing stable information about what it is supposed to assess. This

information is particularly important in educational contexts in which teaching, learning,

and assessment have a close relationship with each other (Cheng, 2005; Shepard, 2000).

In speaking tests, different raters may value different features of candidates’

language use differently, and hence give different ratings to the same performance of a test

taker. In such cases, it is possible to estimate the consistency in rating performances by

different raters, i.e. the inter-rater reliability of the scores (Bachman, 2004, p. 169). One

common way to do this is by quantifying the reliability of a test by means of correlation, “a

statistical indicator for the strength of relationship between variables” (Louma, 2004, p.

182). In theory, values of a correlation coefficient1 can fluctuate between -1 and +1.

Expectations of correlation values vary for different types of language tests. For instance,

the range of a good vocabulary, grammar and reading test usually varies from .9 to .99,

whereas a test of listening comprehension is frequently in the .8 to .89 range, and speaking

tests (subjective assessments) may be in a lower range, between .7 and .79 (Lado, cited in

Hughes, 2003, p. 39). This is because:

language use is a multi-componential phenomenon, requiring interlocutors2 to

negotiate meanings, no two listeners hear the same message. This aspect of

language use is a source of bias in test scores. … if raters focus attention only on

pronunciation, grammar, fluency and comprehensibility, for example, the many

other features of the discourse will not influence them. There is mounting evidence

that this is a vain hope (Douglas, 1997, p. 22).

1 “Values (of a correlation coefficient) close to zero indicate no relationship between the two varibales, while values close to +1 indicate a strong positive relationship. Thus, a score correlation value of close to 1 means that performances which are scored high in one set of ratings also receive high scores in the other set. This is desirable in reliability statistics, whereas negative values are undesirable (and also unlikely if the raters use the same rating scale). Negative values indicate an inverse relationship between the variables being compared, so that high scores in one set would correspond to low scores in the other. There is always some error in scoring, so that a perfect 1 is practically unattainable” (Louma, 2004, p. 182). 2 Some formats of oral testing involve an interlocutor, who talks with the candidates, and an assessor, who marks their performances. “The interlocutor, whether an interviewer or a peer test candidate, becomes a variable in speaking tasks, alongside the other task characteristics” (Galaczi & ffrench, 2011, p. 166).

62

Rater reliability in oral assessments has been one of the most controversial topics

attracting much debate. A major difficulty derives from the fact that raters’ subjective

evaluation counts in the rating process (Restrepo & Villa, 2003; Brown, cited in Zahedi &

Shamsaee, 2012). It was found that test reliability was low when examiners did not obtain

standardised training on language testing, or have a marking scheme with clear criteria for

judging candidates (Francis, 1981). Oral tests were claimed to be “impressions of the tester

about the student's speaking ability rather than accurate objective measures of speaking

proficiency” (Seward, 1973, p. 76). Fisher’s (1981) study pointed out many cases in which

candidates were failed by different examiners while being awarded a Pass or even a

Distinction by most of the other examiners. A previous study indicated that inter-rater

reliability varied depending on types of test tasks: the highest .91 was for oral interview, .81

for a reporting test, .76 for role play, and .73 for group discussion (Shohany, Reves, &

Benjarano, 1986). Authors have different suggestions as to what is considered to be weak

or strong. Dancy and Reidy (2007) suggest that a reliability coeffiecient from 0.1 to 0.3 is

regarded as being weak, 0.4 to 0.6 as moderate, and 0.7 to 0.9 as strong.

Other important findings regarding rater reliability have also been reported. Raters

who rate a taped performance tend to pay greater attention to form and not be distracted by

the human qualities present in a live interview (Thompson, 1996). When raters only have

access to audio data, they have been found to underestimate the scores of more proficient

candidates (Nambiar & Goon, 1993), and judge more harshly if the oral performances are

poorly recorded (McNamara & Lumley, 1997). Although oral skill testing was studied in

different assessment contexts, implications for rater training and careful test preparation

and administration were found in most studies (Shohamy, 1983; Reed & Cohen, 2001).

Construct validity and reliability in language testing have always been fundamental

concerns among language test developers and practitioners. A useful test is required not

only to be valid in its design and administration, but also reliable in its rating and scoring

phases. What needs to be pointed out is that reliability is a necessary but not a sufficient

condition for a good test (Davies, 1990, p. 22). Supporting this argument, Bachman (1990,

p. 160) claimed that “both [validity and reliability] can be better understood by recognising

them as complementary aspects of a common concern in measurement – identifying,

63

estimating, and controlling the effects of factors that affect test scores” because they lead to

enhancing the overall quality of a test by (1) minimising the effects of measurement error,

and (2) maximising the effects of language abilities we want to measure. The Cambridge

English approach to test validation illustrates the central

One practical way of moderating the influence of raters’ scoring-related

inconsistency is the use of assessment scales, together with clearly-defined scoring criteria

and performance descriptors. Studies have also shown that the use of analytic rating scales,

alongside holistic ones, has a positive effect on scoring validity (Hamp-Lyons, 1991; North,

1995; Upshur & Turner, 1995). Another issue relating to the school-based assessment

(SBA) component and the involvement of the teacher is that differences in the nature and

magnitude of teacher inputs into the SBA work of students may unfairly over-compensate

some students while unfairly penalising others (Williamson, 2017, p. 304).

2.3 Conceptual framework for validating speaking tests

The theoretical foundation for this study is the socio-cognitive framework for validating

speaking tests in Weir (2005). I selected the framework as a methodological approach for

examining the oral assessment in Vietnamese EFL education because its validation

components “fit together temporarily as well as conceptually” (Weir, 2005, p. 43). Its

temporal sequence enabled me to visualise what should happen and when it should happen

in each stage: context and cognitive validity (before the test), scoring validity (during the

test), consequential and criterion-related validity (after the test). As visually illustrated in

Figure 2.3:

The arrows indicate the principal direction(s) of any hypothesised relationships:

what has an effect on what, and the timeline runs from top to bottom: before the test

is finalised, then administered and finally what happens after the test event. (Weir,

2005, p. 43).

The connection between components in a temporal sequence allows “a unified

approach to gathering validation evidence” (Taylor, 2011, p. 27) from multiple sources. A

mixed methods design helps to achieve a sound and comprehensive understanding about

the social phenomenon under research. I will present in detail the rationale for choosing the

mixed methods to address my research questions in Chapter Three.

64

Figure 2.3 Socio-cognitive framework for validating speaking tests (Weir, 2005, p. 46)

Another reason for choosing this framework is that its approach is in line with the

CEFR’s perspective adopted in assessing foreign language proficiency for Vietnamese EFL

education. The test taker is treated as “a social agent who needs to be able to perform

certain actions in the language” (North, 2009, p. 359). The socio-cognitive aspect of the

model is reflected in the test taker’s internal process of monitoring knowledge to produce

verbal utterances (the cognitive process), and the use of spoken language in performing

purposeful tasks is viewed as a social phenomenon rather than merely a linguistic

performance (Taylor, 2011, p. 25).

65

The application of this socio-cognitive framework to validating the achievement test

helped me to answer research questions formulated in Chapter One. My study focuses on

the fairness and appropriateness of test administration (Research Question 1a), the

relevance of test contents (Research Question 1b), the operation of test tasks in eliciting

speech samples for assessment (Research Question 1c), the reliability in scoring and

interpretations of test scores (Research Question 2), and impact of the test on various

stakeholders and pedagogical activities (Research Question 3).

2.4 Formats of speaking tests

Unlike testing other language skills, in which test takers are left in silence to concentrate on

completing test tasks, oral assessments normally require a face-to-face experience, with

individuals’ speaking performances. Formats for speaking tests can be categorised into

three common types according to the number of candidates: one-on-one interview, paired

speaking test, and group oral assessment.

§ One-on-one interview

A traditional format for a speaking skill assessment is an oral proficiency interview.

Until the 1990s, this took the form of a one-to-one interaction between a candidate and an

interlocutor or examiner. An interview conducted for L2 assessment is defined as

“a face-to-face interaction usually between two participants”… “one of whom is an

expert (usually a native or non-native speaker of the language in which the

interview is conducted) and the other a non-native speaker or learner of the

language as a second or foreign language.” (He & Young, 1998, p. 10)

There have been a number of studies investigating the construct validity of oral

interviews (Dandonoli & Henning, 1990); interview variation and the co-construction of

speaking proficiency (Brown, 2003); and the relationship between test taker behaviour and

performance (Kunnan, 1994; Huang & Hung, 2013). Conclusions from these studies have

shown that there is a considerable discrepancy between two dimensions: some researchers

argue for the usefulness of the oral interview and its rating scale manipulation (Shohamy,

1983), whereas others strongly criticise this mode for its lack of features typical for normal

66

conversation in terms of turn-taking, topical nomination or negotiation, and communicative

involvement (Johnson & Tyler, 1998).

Since the 1970s, L2 assessment has witnessed the increased use of semi-direct

speaking tests (O’Loughlin, 2001). Candidates’ active speech samples are elicited by

prerecorded questions or tasks in a language laboratory and recorded on an audiocassette or

a computer for rating later “rather than through face-to-face conversation with a live

interlocutor” (Clark, 1979, p. 36). This mode of simulated interviews is supposed to be

similar to direct (live) oral interviews “in terms of content, functions, and rating scales, and

is considerably more efficient in time and cost of administering and grading the test”

(Koike, 1998, p. 69). Semi-direct tests are more uniform than direct because

interlocutor/interviewer variables are eliminated. All candidates answer the same questions,

perform the same task(s), within the same time constraint in a test event (Shohamy, 1994).

The adoption of semi-direct testing has raised a question regarding what happens when the

candidate has no one to talk to. In practice, candidates in semi-direct speaking tests tended

to produce more formal discourse, make self-correction extensively and use more fillers

such as uh, um, eh, etc. to fill gaps in their speech (Koike, 1998).

§ Paired speaking test

Another consistent trend across the studies of oral assessment related to a paired

format, with the possible inclusion of one or two examiners. This format allows a variety of

interactional patterns between examiner(s) and examinees, i.e. the examiner(s)-examinees,

examinee-examinee, which is claimed to reduce anxiety for the candidates since they are

interacting with their partners, or at least feeling the safety of having a partner by their side

(Saville & Hargreaves, 1999). Pairing candidates is an appropriate format for achievement

tests once students have been familiar with CLT during their previous course of instruction:

Communicative Language teaching has widely used pairwork in the classroom. The

introduction of the paired speaking test therefore brings the test into line with

classroom practice (Fulcher, 2003, p. 186).

The popularity of paired speaking tests has been stimulating serious research,

particularly into the discourse analysis and potential impacts of the candidate’s non-tested

factors on mutual performance. The findings of Ducasse and Brown’s (2009) investigation

67

reported three interaction parameters identified in the assessing process: non-verbal

interpersonal communication, interactive listening, and interactional management, which

challenge our understanding of the construct of effective interaction in paired candidate

speaking tests.

An inevitable problem is raised when candidates are paired: Is a candidate’s

speaking influenced by their partner’s proficiency level? Studies indicated different results

on this issue. Nakatsuhara (2004, p. 57) concluded that no matter how their partners’

speaking proficiency levels are, “they [students] are likely to obtain rather identical

opportunities to display their communicative abilities”. Nevertheless, Norton (2005, p. 291)

found that “being paired with a candidate who has higher linguistic ability may be

beneficial for lower level candidates who are able to incorporate some of their partners’

expressions into their own speech”. To reach a deeper understanding about what was being

elicited through pair speaking tasks, it is recommended that candidate discourse be further

examined (Swain, 2001).

§ Group oral assessment

Grouping candidates is possibly a practical option for many educational institutions

dealing with the pressure of large numbers of students in a limited testing timeframe.

Although positive feedback to the group oral from perspectives of both test raters and test

takers was demonstrated in Hilsdon’s (1991) evidence, there were many concerns regarding

candidates’ authentic participation in group discussion and fairness in awarding marks. He

and Dai (2006) studied a group oral interaction in the setting of a college spoken EFL test

through discourse analysis of the recorded segments involving group discussion on a given

topic. The candidates were reported to “consider the examiners, rather than the other

candidates in the group, to be their target audience”, which resulted in their avoiding more

complex language functions such as supporting, persuading, or negotiating meaning for

discussion as expected, but the repetition of simple functions, e.g. agreeing, disagreeing,

asking for information, etc. The authors concluded that “it seems many candidates interpret

[group] contribution in terms of quantity rather than quality” (p. 391).

Scoring group oral interactions is challenging. Candidates’ performances and scores

are influenced to a large degree by interlocutors’ characteristics and conversational

68

dynamics amongst group members (Van Moere, 2006). In addition, speaking tasks for

group discussion require meticulous consideration from test designers because “the fact that

a task is used for assessment makes it unlikely that participants will engage with it in the

same way that they would if they were not being assessed, no matter how much the

assessment task resembles a real-world task in other aspects” (Spence-Brown, 2001, p.

479).

2.5 Technological applications in oral assessment

Recent developments in multimedia recording have opened up a new era for EFL

computer-based testing (CBT). A wide variety of technology applications offer innovative

methods for speech analysis and assessment. These applications include webware (e.g. the

Web resource Odeo, at www.odeo.com), computer software (e.g. the Audacity recorder,

available at www.audacity.sourceforge.net), and portable hardware (e.g. the Sanako MP3

recorder), which are considered to be quite user-friendly and totally free or at affordable

prices. The utilisation of these devices in oral tests is reported to “enhance accuracy and

reliability in assessing student performances” (Early & Swanson, 2008, p. 46) since their

digital voice recordings can be replayed for criteria scoring calibration and objective

grading by human raters, or directly scored by a scoring computer programme.

The benefits of incorporating webware-based asynchronous speaking with classical

oral examination and presentation were demonstrated in a study by Pop and Dredetianu

(2013). They found that this three-item model of assessment contributed to “a more

objective, error-targeted, learning-oriented and transparent evaluation of the students’

speaking ability, through the reversibility inherent in recording” (Pop & Dredetianu, 2013,

p. 109). Additionally, computerised oral testing has been proved to be more beneficial than

traditional tests in reducing the test taker’s anxiety and boosting their creativeness (Early &

Swanson, 2008), allowing time and space saving in large-scale examinations, and leading

to the development of online testing procedures across educational institutions (Fall, Adair-

Hauck, & Glisan, 2007).

More importantly, students can receive feedback from teachers or from these

technical applications for self-assessment, which helps them identify what they need to

improve about their speaking skills (Alderson, 1998). In fact, students become more

69

involved in practice using available software before making final recordings to submit for

evaluation (Early & Swanson, 2008). Additionally, digital voice recording enables teachers

to discriminate between phonetically similar sounds that might affect communication

(Larson, 2000). There were also findings which revealed that human raters and computer

software never gave significantly different results when rating EFL learners’ language

acquisition levels (Alshahrani, 2008). Therefore, as a practical implication for CBT, there

are exciting challenges for the professional cooperation of test developers, language experts

and researchers alike to exploit such enormous advantages associated with the use of

technology for educational purposes.

Vietnam in the context of global integration acknowledges a variety of international

English certificates. Language learners in Vietnam have a wide choice of foreign language

centres that offer preparatory courses for international English proficiency certificates of

their particular needs (MOET, 2016b; Thao Nguyen, n.d.). Most of these English exam

formats include the speaking component with one-on-one interview task (e.g. IELTS),

collaborative/paired speaking task (e.g. Cambridge FCE), computer-based task (e.g.

TOEFL iBT). Formats for oral tests in Vietnamese tertiary education vary across

institutions since each institution decides on their own testing and assessment practices

(MOET, 2007).

2.6 Factors affecting test validity and reliability

Assessing spoken language is a complicated process from theory to practice. Apart from

raters, several other factors that potentially affect the test scores in language performance

assessment include: criteria, scales for rating (scoring rubric), test tasks, candidates

themselves and other candidates involved as interlocutors in generating speaking

performances (McNamara, 1995). The following sections will present findings and debates

on these internal factors from previous studies.

2.6.1 Assessment criteria

The term “criterion” in testing and assessment can be used in two senses: (1) the standard

level (or cut-off point) to make pass-fail decisions, and (2) the benchmarks (course

objectives) on which the test design is based. In a criterion-referenced view of language

testing, tests with respect to course objectives are believed to “take on an advantage that

70

makes them particularly useful for language curriculum development” (Brown & Hudson,

2002, p. 49).

Assessment criteria are explicitly written and proposed to be used reliably by test

raters for performance rating. Recognising this guiding role, Aldreson and Banerjee (2002)

criticised the way that rating criteria are constructed by proclaimed experts from theoretical

deduction rather than from observation and experience. They also expressed reasonable

doubts about the practicality of using generic rating scales for any task and that different

audiences might need different rating scales (Alderson, cited in Weir, 2005). To address

this problem, Richards and Chambers (1996) are important for their conclusion that

criterion-referenced scales (ones with descriptors and numeric values for each of several

general levels of performance) are more reliable than norm-referenced categorical scales

(ones with weighted criteria and numeric scales for each but no descriptors for the criteria).

2.6.2 Rating scales

Rating scales (also called scoring rubrics, or marking schemes) comprise “graded

descriptors intended to characterise different levels of ability” and function as “the

established means of guiding raters to improve their level of agreement with their

colleagues” (Green & Hawkey, 2012, p. 299).

Using a rating scale in oral assessment is essential because it “provide(s) an

operational definition of a linguistic construct” (Fulcher, 2003, p. 89). Human raters would

give different ratings to the same performance if they did not adopt a compromised scoring

rubric. In this regard, Hirai and Koizumi (2013) also acknowledged that a valid and reliable

rating scale is crucial in contributing to the success of oral assessments. Theory-based

scales have been proved to work ineffectively in classroom situations; therefore,

constructing a rating scale for institutional use needs particularly designed descriptors of a

scale that conforms to the teaching objectives (Turner & Upshur, 1996). More empirically

based rating scales derived from samples of test performances are encouraged to improve

the quality of diagnostic assessment for academic English (Knoch, 2007 & 2009).

There are two kinds of rating scales: holistic (global) and analytic. A holistic scale

represents “overall quality” of the speech sample according to the rater’s general perception

of a candidate’s oral performance (Fulcher, 2003), whereas an analytic scale can “provide

71

for a more detailed description of a performance and so can give useful feedback to the

learner” (Green & Hawkey, 2012, p. 303). The face-to-face Cambridge ESOL Speaking test

requires two raters for scoring each candidate’s performance. One acts as an interlocutor

delivering tasks to the candidate and using the holistic scale. The other acts as an assessor

observing the candidates’ oral interactions and using the analytic scale to make judgements.

Examiners switch their roles half way through the scoring session “in order to allow both

examiners an equal opportunity to maintain their experience in both roles” (Taylor, 2011, p.

341). Examples of holistic (global) and analytic rating scales are included in Appendix G.1.

There is evidence that the customisation of two existing scales helps to establish

more reliable and valid measures for assessing speech performance in the context of story

retelling (Hirai & Koizumi, 2013). An example of success in testing speaking skills is the

design of the Performance Decision Tree (PDT) (Fulcher, Davidson, & Kemp, 2011). This

new scoring instrument makes good the shortcomings in measurement-driven scale

construction by explicitly valuing performance data from a specific communicative context,

resulting in providing richer descriptions for sounder inferences and scoring. In addition,

Richards and Chambers (1996) are important for their research results which showed that

global criterion-referenced scales (with descriptors and numeric values for each of several

general levels of performance) are more reliable than a norm-referenced categorical scale

(one with weighted criteria and numeric scales for each but no descriptors for the criteria).

2.6.3 Test tasks

Candidates do not initiate their speaking in a testing context the same as they do in

everyday conversations. They need to have a task prescribed as part of the test

specifications and delivered by the examiner, to decide how they should demonstrate their

speaking skills. A task provides a necessary context for the candidate to speak and for their

speaking to be assessed. Oral assessment tasks are designed to elicit ratable speech samples

for measuring learners’ productive language skills through performance, allowing learners

to exhibit the kinds of L2 skills that may be required in a real-world context (Wigglesworth,

2008, p. 111). A language use task invites learners’ active participation and keeps them

goal-oriented in their task performance within a specific social setting. A test task is

therefore defined as, “an activity that involves individuals in using language for the

72

purpose of achieving a particular goal or objective in a particular situation” (Bachman &

Palmer, 1996, p. 44).

Tasks that are appropriate for both language use and language testing must include

two aspects: the individual has a clear understanding of what sort of outcome is expected to

be achieved, and the individual is aware of the criteria by which the task performance will

be evaluated (Carroll, 1993). Insufficient understanding, or comprehension problems in

interpreting task requirements may cause inappropriate responses in test takers’ perfomance

(Pollitt & Murray, 1996).

With respect to the notion of real-world tasks, learner-centredness is taken into

account in the following definition of academic tasks:

[t]he term “task” focuses attention on three aspects of students’ work: (a) the

products students are to formulate… ; (b) the operations that are able to be used to

generate the product… ; and (c) the “givens” or resources available to students

while they are generating a product (Doyle, 1983, p. 161).

This three-part definition of educational tasks is “particularly useful with respect to

the elements involved in item specifications within a task-based performance assessment”

(Norris et al., 1998, p. 33). As indicated in Figure 2.2, test tasks are used to determine how

a test taker’s language ability is performed, so it directly affects the test score given for

his/her performance. Applied linguists generally agree that “these [language use tasks] are

(1) closely associated with, or situated in specific situations, (2) goal-oriented, and (3)

involved the active participation of language users” (Bachman & Palmer, 1996, p. 44).

Much of the debate in recent studies has surrounded “the construct of interactional

competence and its operationalisation by raters” (May, 2010, p. 1). This trend can be

explained as a result from the Communicative Language Teaching (CLT) approach in

which language learners are constantly encouraged to participate in classroom

communication with partners in pairs and groups. The following quote illustrates the

correspondence between communicative test tasks (required by raters) and spoken language

ability (performed by candidates):

Task-based performance testing is attractive as an assessment option because its

goal is to elicit language samples which measure the breadth of linguistic ability in

73

candidates, and because it aims to elicit samples of communicative language

(language in use) through tasks which replicate the kinds of activities which

candidates are likely to encounter in the real world. (Wigglesworth, 2008, p. 119)

Different tasks obviously elicit different ranges of responses in the target language.

In a study focusing on candidates’ performances in oral presentations, Pietila (1998, p. 258)

found that “success in one type of (test) task does not necessarily predict success in another

type of task” because their typical orality/literacy features are not identical by nature. The

decision on what interactional format and oral production tasks are to be used for assessing

speaking depends upon the established course objectives and students’ level. Table 2.4

summarises five categories of speaking performance tasks in terms of the speech sample

elicited and the degree of interaction required of candidates (Brown, 2004, pp. 141-142).

These tasks range from imitative speaking performance (such as imitation of short stretches

of spoken language) to extensive oral production in an individual long turn. A task might

get candidates involved from no interactive competence to longer and more complex

interpersonal exchanges.

Table 2.4 Categories of speaking assessment tasks

Speaking assessment tasks

Elicited samples of oral ability Level of interaction

Example

1. Imitative Imitation of short stretches of spoken language

None Mimicry tasks (a word/phrase/sentence repetition)

2. Intensive Production of short stretches of spoken language

Minimal Directed response tasks, limited picture-cued tasks, etc.

3. Responsive Production of very short exchanges

Somewhat limited

Question and answer, paraphrasing, giving instructions, etc.

4. Interactive Interaction in form of transactional and/or interpersonal communication

Longer and more complex

Interview, role play, discussion, conversation, etc.

5. Extensive Production of speeches in an individual long turn

Highly limited or ruled out

Oral presentation, picture-cued description/storytelling, retelling a story/news event, etc.

Modern language tests have witnessed the use of integrated tasks in which a

language skill is used as a stimulus for performing another skill. These tasks are designed

74

to achieve “uniformity in elicitation procedures” (Hughes, 2003, p. 122) and enhance

“authenticity of the tasks” (Wigglesworth, 2008, p. 118). For instance, reading and/or

listening might be used to trigger speech samples in task-based speaking tests. This is the

case of the new internet-based TOEFL (iBT) speaking session in which candidates are

required to give responses after watching and listening to video-recordings. The speaking

section of the Business English Certificate (BEC) Higher Speaking test requires each pair

to read a business-related situation before they discuss what to do in that situation.

Nevertheless, rating such integrated tasks is supposed to be more difficult since

comprehension (or not) of the language input in stimulating tasks may affect the quality of

the targeted skill being assessed (Wigglesworth, 2008).

2.7 Washback of oral language assessment

This study aimed to explore language testing operated at the institutional level where

language curriculum, teaching/learning and assessment are all involved in a closed cycle

and have a cause-effect relationship with each other. Figure 2.4 indicates that “in school-

based assessment, assessment needs to be continuous and integrated naturally into every

stage of the teaching-learning cycle, not just at the end” (HKDSE English Language

Examination, 2012).

Figure 2.4 Interrelationship between assessment, teaching and learning

Teaching (and observing/ monitoring)

Planning (and reflecting)

Assessing (and feedback/

reporting)

Learning (and recording/ self-evaluation)

75

As can be seen from the cycle in Figure 2.4, the assessing component includes

feedback on the learner’s (and presumably the candidate’s) performance and reporting the

assessment result to them. It is through assessment that the teacher (and also the rater) can

reflect on how the learner performed in the test event so that he/she can make appropriate

plans to improve teaching methods and create more positive impacts on learning activities.

The teaching and learning process then goes through assessment and rating again.

The interrelationship between assessment, teaching and learning entails potential

influence of testing further on education systems and community:

The notion of washback in language testing can be characterized in terms of impact,

and includes the potential impact on test takers and their characteristics, on teaching

and learning activities, and on educational systems and society. (Bachman &

Palmer, 1996, p. 35)

I examined the impact of language testing on educational practices in order to

recognise possible ways to enhance positive effects and reduce the negative influences that

tests may have on teaching and learning activities in EFL classes. It is argued that

assessments potentially reflect ‘systemic validity’ in such a way that they promote

development of the cognitive skills that a test is supposed to measure (Frederiksen &

Collins, cited in Messick, 1996; Zhao, 2013). In applied linguistics, the impact of a test on

the teaching and learning for that test is commonly addressed as “washback” (McNamara,

2000; Burrows, 2004).

Regarding washback effects in language testing, research findings have

demonstrated that high-stakes examinations do have a more obvious effect on teaching

materials but less on teachers’ methodology and behaviours than school-based assessment

does (Alderson & Wall, 1993; Chen, 1997). However, in another study into the washback

effect of an ESL classroom-based assessment, Burrows (2004) found that how the test

affected teachers’ methodology remarkably depended on the teacher’s personal attitudes

and beliefs. Teachers’ inadequate understanding about the nature of assessment,

unwillingness to change, or misuse of test contents have been reported to be prospective

obstacles to positive washback in the ongoing cycle of training (Wall, 1996). To facilitate

positive washback of oral assessment in EFL classrooms, students should be well-informed

76

of the assessment procedures and scoring scales, specifying objectives, and structure of

assessment tasks so they can focus their learning on specific goals for better language

performance (Muñoz & Álvarez, 2010).

The influence of testing on teaching and learning is present in the scenario of EFL

school-based assessment where communicative tests have been proved to “have beneficial

washback on the teaching that led up to them, as the teaching would focus on preparing

students for the representative communicative tasks beyond the test” (McNamara, 1996, p.

23). Potentially the positive washback effect of an exam can be perceived via the following

results (Wall & Alderson, 1996, p. 199):

• Content of teaching: What teachers teach (also what students learn) is derived from

the course book contents because the text types and language use tasks therein will

be tested.

• Method of teaching: How teachers teach (also how students learn) is operated in

such a way that students are efficiently equipped with the necessary skills to be

assessed in the exam.

• Ways of assessing: The test is designed and developed in such a way that it has

samples relevant to the contents of the course book, and examiners use the criteria

set as course objectives to assess students’ performances.

2.8 Summary

The themes I have discussed in this chapter are central issues in language testing validation

towards “transparency and evidence-based proof of the value of test outcomes”

(O’Sullivan, 2011, p. 5) in alignment with educational goals. Evidence from the literature

has laid essential foundations for expanding insights into EFL oral assessment. Although

several features discussed in proficiency tests are supposed to be relevant for achievement

tests (Pino, 1989) in terms of construct validity, rater reliability, tasks, rating scales, etc.,

many other areas of school-based oral tests have not been thoroughly explored in the

literature of L2 testing and assessment in Vietnam. Like many other Asian countries where

EFL education has been crucial for national development and international integration,

Vietnam could adopt and adapt advancements in the world’s language testing to suit the

77

current Vietnamese context and conditions. Information from the literature helped me

identify gaps to explore in language testing research and challenges in local test use, test

interpretation, and its pedagogical implications.

Oral assessment is a multifaceted social phenomenon, research on which needs to

engage the participation of various stakeholders to achieve a comprehensive understanding

of the test in different stages from its construction, development, operation, and

consequences. Theories and results from previous studies in the field helped me to

determine who the research participants would be, what research instruments I would need

to design and when I would use them for data collection. The literature review staged my

formulation of research questions in consideration of the Vietnamese setting and assisted

me to build relevant strategies for data analysis and presentation of results. Suitable

customisations of existing instruments and newly designed methods provided me with a

deeper insight into how I should analyse the data and best report results to answer the

research questions.

The next chapter presents the methodology adopted in my study, including a

discussion on the rationale for the empirical research, and detailed descriptions of how I

conducted the study from initial contact with potential participants to outcome presentation.

78

Chapter Three

METHODOLOGY

3.1 Introduction

In Chapter Two, I presented a review of literature associated with my research into oral

assessment in Vietnam. Assessing speaking skills is challenging because there are many

factors affecting a candidate’s performance, and the rater’s judgments of how well a

candidate speak. Oral assessment in pedagogical contexts requires not only accuracy and

fairness, but also a sufficient reflection of the content that language learners achieve after a

course of instruction. In a general educational cycle, assessment has an influential impact

on teaching methods and students’ engagement in learning.

In this chapter, I present the methodology I employed to study oral assessment in

Vietnamese tertiary education. I begin with a critical overview of the rationale for the use

of a mixed methods approach from initial procedures for data collection to data analysis

and presentation of results. After detailed descriptions of the research sites and participants,

I continue with justifications for the research instrument design and procedures

implemented in particular phases of the data collection. The chapter ends with an

elaboration of research confidence and human ethics issues.

3.2 Rationale for the research design

In order to establish an essential basis for undertaking this study, I will discuss the rationale

of the research design under two subheadings (1) Rationale for using a mixed methods

design, and (2) Utilising a convergent design.

3.2.1 Adoption of a mixed methods approach

My objective in conducting this study was to examine how effectively universities

in Vietnam administered English speaking tests for their students. The main purpose was to

investigate whether the assessment accurately measured what it claimed to measure, and to

what extent its consistency was maintained to ensure fairness among candidates

79

participating in the test. A mono-method study, i.e. either quantitative or qualitative, was

not sufficient to provide thorough insights into the research problem. I applied a mixed

methods design in collecting and analysing both quantitative and qualitative data to answer

the research questions. The purpose of blending methods was that quantitative data gave

useful evidence for “theory generation” and qualitative data provided descriptive

explanation for “theory verification” (Punch, 1998, p. 17).

There were many good reasons for this methodological approach. First, mixing the

two quantitative and qualitative components enabled me to produce “stronger evidence for

a conclusion through convergence and corroboration of findings” (Johnson & Christensen,

2012, p. 433). Second, the strengths of one method make up for the limitations of the other

when combined to arrive at a more accurate understanding of social phenomena in the real

world (Cresswell, 2011). In addition, the integration of data from various sources was

particularly appropriate for a multi-level analysis of issues in applied linguistics (Dörnyei,

2007, p. 45).

In the field of language assessment, by combining information from different

sources, the results can often facilitate a deeper understanding of complex phenomena

under study, most especially in the areas of validity and instrument development,

classroom-based assessment, large-scale assessments, construct definition, and rater effects,

to name a few (Moeller, 2016, p. 11). Applying mixed methods to study language testing

and assessment involved the collection and combination of quantitative (numbers oriented)

data and qualitative (text and stories oriented) data in the test development and assessment

process. This combination has applicability in several testing arenas, such as gathering both

forms of data when developing a test, examining how test takers and stakeholders view the

utility of a test, revising the rating scales for a test, and assessing the appropriate level for a

test based on an individual’s language ability. Mixed methods research is not simply

collecting both quantitative and qualitative data. It goes beyond this information gathering

to bring together, combining, or integrating both forms of data (Creswell & Zhou, 2016, p.

36).

Combining both quantitative and qualitative research methods was essential at the

stages of data collection and analysis to obtain comprehensive results, as the nature of the

80

research topic sought to explore and describe a multifaceted problem in language testing

and assessment. The data for the current study were collected from various data sources at

all stages of the test event – before, during, and after candidates’ oral performance for

rating. Primary sources of data were EFL teachers and students as previously presented.

Secondary sources came from official documents associated with the test and EFL experts

from outside the universities being studied. Combining data of different types by different

instruments was to serve the purpose of “triangulation” and “complimentary” of claims to

be made and achieve the most complete findings about the research problem (McMillan &

Schumacher, 2001, p. 408). For example, evidence to examine the consistency in assessing

speaking skills would go beyond the test scores (in a quantitative manner), and would need

to be combined with fieldnotes, descriptions, and transcripts (in a qualitative manner)

derived from test room observations and interviews with test raters. In a similar approach to

learn about oral test impact on English learning, ideas collected from students group

interviews (qualitative data) would help to elaborate statistical results (quantitative data)

from the questionnaire survey on candidates’ perceptions and opinions. The following

sections will describe in detail the instruments used for data collection that include the

questionnaires, group and individual interview outlines, and test room observation protocol.

3.2.2 Using a convergent design

The current study implemented a convergent design in which I collected quantitative and

qualitative data separately and treated them with equal priority as they contributed to

understanding the research problem. However, the quantitative and qualitative strands of

the study did not always occur concurrently because of the availability of a particular data

source or the possibility of convergent results emerging from integrated data sources. For

instance, textual documents about test formats and test administration had to be gathered

before the test so that a decision on how the audio-recording of speaking test samples

would be arranged (i.e. how many voice recorders would be needed, whether a co-

investigator was necessary if the test was going to be held in different test rooms at the

same time, etc.). EFL experts’ judgements on the relevance of test content could only be

collected after the test completion to ensure test material security. After that, I processed

and analysed these sets of data separately using appropriate analytic procedures

81

corresponding with each data type. Finally, I integrated the results in a thematic

presentation of the findings to answer the research questions.

It was noted that the integration of methods helped to enhance research validity

once the findings were trianguated and mutually corroborated by multimethods approaches

(Bryman, 2006). Moreover, combining methods is claimed to be “an efficient design”

(Creswell & Clark, 2011, p. 78) as several data sets of various types were collected at

virtually the same time. In this study, the data collection stretched over a few months only

for the purpose of comparing the same phenomenon occurring at different institutions

within a single educational system.

3.3 Research setting

This research was conducted in the largest city in the South of Vietnam - Ho Chi Minh

City, formerly named and still popularly referred to as Saigon3 (Tri Thuc Tre, 2018; Lan

Tam, 2017; Tran Quan, 2015). Best-known as “a great economic, cultural, educational,

scientific and technological centre in Vietnam”, the city plays the role of an important

central point bridging regions and countries in Southeast Asia (VOV, 2015). At the present

time (2017), there are as many as 50 universities and six academies offering tertiary level

training programmes for students nationwide (www.vi.wikipedia.org). EFL is a compulsory

subject at most of Vietnamese universities. However, fewer than half (19) of these

educational institutions offer Bachelor of Arts (BA) degree programmes for English majors

(Appendix J). At the other institutions, English is taught as a non-major subject among

other foreign languages such as French, Japanese, Chinese, Korean, etc. End-of-course

exams for non-English majors concentrate mostly on Grammar, Vocabulary, and Reading

Comprehension. They do not usually include speaking tests during the training

3 Originated from a deserted area of land, Saigon (or Sài Gòn in Vietnamese) has become a dynamic, modern, and open city in South Vietnam with a history of more than 300 years of formation and development (Lich su Viet Nam, 2016). “Although Ho Chi Minh City (often shortened to HCMC, HCM, or HCMc in writing) is the new official name of the city, Saigon is still used daily by many Vietnamese – particularly in the south. Despite official mandates, the label “Saigon” is shorter and is used more often in daily speech” (Rodgers, 2017).

82

programmes. Therefore, the target population of this study was English major students and

EFL teachers who took part in oral assessment as part of the BA degree programmes at

their universities.

3.3.1 Research sites

I selected three institutions to be the research sites on the basis of the type of non-

probability sampling which is very commonly used in educational research (Creswell,

2012; McMillan & Schumacher, 2001). These institutions had a higher number of potential

subjects who were accessible and representative of the typical characteristics of the target

population. I decided to use a combination of convenience (or availability) and purposeful

(sometimes called purposive or judgemental) sampling methods (Neuman, 2011; McMillan

& Schumacher, 2001) in preparation for the data collection stage.

Factors of data availability and approachabilty formed part of my decision-making

process of selecting the research sites. First, the institutions were accessible to me. I was

able to obtain permission from the Heads of the EFL Faculties to carry out the study project

after orally presenting the research plan, and with written documents approved by the

Human Research Ethics Committee (HREC) of the University of Newcastle, where I am

doing my research study. The EFL teacher and student participants’ consent was also

sought before the commencement of data collection. There were cases of institutions at

which collecting data could not proceed because their Heads did not ratify the research

plan. Second, the speaking test was available at these institutions during the time of my

fieldwork from November 2015 to March 2016 (Appendix H). Each institution had their

own exam schedule which did not overlap the others’. Such convenience allowed me

sufficient time to set up initial contacts with prospective participants and necessary

preparation for data collection. Third, the institutions were convenient for my travel to and

fro – a factor referred to as “geographical proximity” (Dörnyei, 2003). This was the reason

why all three universities involved in the study were located in the same city in Southern

Vietnam and its neighboring suburbs, so that I could arrange transport to reach the research

sites during the peak examination season.

I did not make the selection of the sample institutions merely because of their

convenience, but in a purposeful manner. My knowledge and experience as an EFL lecturer

83

living and working in this city informed me that the prospective participants at the

institutions had typical characteristics that were well-suited to the research purpose. These

potential subjects were EFL teachers and students both “representative” and “informative

about the topic of interest” so they would “provide the best information to address the

purpose of the research” (McMillan & Schumacher, 2001, p. 175). I specified attributes of

the target population and asked questions to ensure that individuals belonging to a research

site had these attributes before deciding whether to include them as representative samples

(Johnson & Christensen, 2014, p. 264). For example, I considered the following issues:

‘Does the institution organise an end-of-course test to assess their students’ English

speaking ability?’ This preliminary question allowed me to determine whether the site

was appropriate for investigation.

‘Do the teachers and students at the institution have experience as test raters and test

takers in EFL speaking tests?’ I asked this question to decide who would be invited to

be research informants.

‘Does each institution apply a speaking test format different from that used in the

others?’ This information helped to anticipate whether I could collect diverse data in

relation to oral test formats.

The answers to these questions, which helped to make judgements about sample

selection, were derived from what I knew about the institutions, and from pre-existing

professional networks with individuals from these organisations.

3.3.2 Research participants

The participants recruited for this study included subjects who were directly involved in the

end-of-course examination of English speaking skills, i.e. EFL teachers as test raters and

EFL students as test takers. After the research sites were identified, those who met the

required criteria were invited to join the research study. I sent out invitations until a

sufficient number of participants was obtained (Johnson & Christensen, 2014). These

invitations included the Information Statement about the research project and the Consent

Form (Appendix A). In addition, EFL experts constituted the third group of participants to

provide judgements about the test contents used in the school-based speaking examinations.

84

The relevance of test contents was part of the study to estimate the extent to which each test

obtained its content validity. More descriptions of the research participants including (1)

EFL teachers, (2) EFL students, and (3) EFL experts are detailed in the following sections.

(1) EFL teachers

The first group of the research participants were EFL teachers at the selected universities.

They had experience in teaching EFL speaking skills and undertook the role of examiners

in the speaking test under investigation. In accordance with the test administration at each

institution, the EFL teachers were either raters for the classes they were in charge of, or

they swapped their classes to be raters for other classes, or the Faculty nominated EFL

teachers other than the official ones to be test raters for the classes they had not been in

charge of. This variation in rater arrangement depended on each institutional policy.

Besides the teachers acting as official raters, I invited other EFL teachers of the

Faculty with experience in teaching and assessing English speaking skills to complete the

questionnaire survey because they had similar characteristics to the official teacher raters.

They would be able to teach and/or assess speaking skills in the next semester or anytime

during the current semester if there was a requirement for substitution by the Faculty.

Therefore, their opinions from the perspective of teacher raters were taken into account for

the present study.

(2) EFL students

The second group of participants, also the largest group, were all the second-year students

majoring in the English language. After I gained permission from their teachers, I gave a

short introduction about the research project in class, and invited them to participate in the

questionnaire survey. To ensure a sufficient and consistent provision of information as

planned, I followed procedures and speaking notes to communicate with potential student

participants in these short class meetings (Appendix C.1). The students returned the signed

consent forms prior to the actual commencement of the survey.

I targeted second-year students as research samples to make sure that the selected

candidates were quite familiar with speaking tests; they had experienced sitting for oral

examinations in their first academic year at the university. Such familiarity would help

85

them to participate with less hesitation and provide stable and comprehensive perceptions

about what a speaking test was and how it was organised. Another reason why second-year

majors were chosen to be samples for the research, rather than students of the other years

was that oral testing of EFL was likely to be new to first-year students due to the diversity

in their social and English learning background from previous high schools. EFL majors

did not have language skills lessons, including Speaking skills, in their training

programmes for the third and fourth years. Many of them might have come from the

countryside or rural areas and/or had no experience of oral tests, since most English tests at

Vietnamese high schools, even in towns and cities, fundamentally focused on Reading,

Grammar, Vocabulary and Writing (Nguyen, 2017; Nhat, 2017).

It normally takes students from three and a half to four years to complete their

training course at a Vietnamese university. Students whose major is English can choose to

pursue a BA degree either in Pedagogy (English teaching), Studies of the English language,

Business English or Translation and Interpretation. Speaking is one of the four language

skills they learn in their first 2 years at university. Later in their third and fourth years, they

focus mainly on their specialised subjects, e.g. Syntax, Semantics, English teaching

methodology, etc. that serve their future career.

(3) EFL experts

The selection of EFL experts to judge the relevance of the speaking test contents followed

procedures suggested in the literature on content validity in testing and assessment

(Delgado-Rico, Carretero-Dios, & Ruch, 2012; Polit & Beck, 2006). To minimise the so-

called “halo effect” (Dörnyei, 2003, p. 13) and obtain more objective judgements on the

relevance of the test contents employed in the test, I contacted EFL experts outside, rather

than those working at the institutions involved in the study. This variation in recruiting

research participants received supplementary HREC Approval in 2017 (Appendix A.4).

The prospective experts were expected to have the following characteristics:

� Professional qualification: holder of an M.A. in TESOL or a higher degree in EFL

Education.

� Experience: at least 5 years in test design and/or rating speaking performance

86

� Gender: I attempted to balance the mix of voices from both male and female

experts. However, the number of male experts was remarkably outnumbered by

female experts specialising in EFL education and/or EFL assessment that I was able

to have access to.

I adopted the “snowball” (also called “network”, or “chain”) sampling method when

recruiting the expert panel. Thanks to professional networking, I invited two EFL experts

with the necessary characteristics to participate in examing the speaking tests’ contents.

The initially selected experts then suggested the names of others who were appropriate for

the sample (Creswell & Clark, 2011; McMillan & Schumacher, 2001). I asked the selected

experts to recommend others who could meet the criteria. As local EFL experts knew each

other better than I, it did not take much time to find four more experts as planned. I then

contacted the experts in person to gain their consent and provide a briefing on what they

were invited to do, providing the same judgement protocol for all the experts.

The test content judgement protocol comprised three lists of test items in the form

of specific questions or test tasks gathered from the authentic test material provided by the

three institutions. Each expert individually rated each test item in terms of its relevance to

the course objectives and the contents of the course book by marking a tick (P) to indicate

whether it was (l) Highly irrelevant, (2) Irrelevant, (3) Relevant, or (4) Highly relevant. The

experts were encouraged to make further comments on level of difficulty and/or make

suggestions for revision of each item if necessary (see Appendix B.4).

Data provided by participants with various roles directly or indirectly associated

with the test helped to reflect the research problem from different angles. Wherever

possible, information and reflections from participants were mixed during an overall

interpretation to examine in what ways the results were convergent or divergent. Table 3.1

shows the research samples selected for the study and the methods of data collection

including questionnaire surveys, face-to-face interviews, test room observation, speech

sample recording, and test content judgement.

87

Table 3.1 Sampling of participants and research instruments used in the study

Methods of data collection

Participants N Research instrument/device

Questionnaire survey + EFL teachers 35 Questionnaire for teachers

+ EFL students 352 Questionnaire for students

Face-to-face interview + Individual EFL teachers 6 Interview protocol for teachers

+ EFL students in 7 groups 27 Interview protocol for students

Observation Speaking test rooms 6 Test room observation scheme

Speed sample recording EFL students 84 Audio recorder

Content judgement EFL experts 6 Test content judgement protocol

3.4 Data sources

This study used multimethods for collecting data of different types and sources. The

following sections present justifications and descriptions of the research instruments’

development and procedures of using them within different groups of participants.

I gathered information about Speaking course outlines, test formats, and

administration guidelines in order to prepare for data collection on the test day. I carried out

test room observations during the test; questionnaire surveys and face-to-face interviews

were completed with EFL teachers and students after the test; and judgements on the

relevance of the test contents were provided by EFL experts using the judgement protocol

constructed from the authentic speaking test questions collected. Depending on each

particular method of data collection, different types of data (qualitative, quantitative, or

both) were gathered to examine the key aspects in oral assessment related to the research

questions: test administration (1a), test contents (1b), test tasks (1c), raters and rating (2),

and test impact (3). A summary of how I used different data sources to explore important

areas in speaking test validation can be found in Appendix E.3.

Before collecting any official data, I carried out procedures in compliance with the

research protocol approved by the Human Reserch Ethics Committee (HREC) of the

University of Newcastle, Australia. The first point of my contact with each educational

88

institution was the Head of EFL Faculty to seek consent before contacting their EFL

teachers and students.

I had a short meeting with the teachers in charge and their students in class a few

weeks before the test day. Setting up initial contacts has been proven to be “an effective

method of generating a positive climate for the administration and it also reduces the

anxiety caused by the unexpected and unknown” (Dörnyei, 2003, p. 84). These pre data

collection steps enabled me to make positive impressions on the potential participants by

discussing my research intent and inviting the teachers together with their students to

participate in the study. I explained that by spending a little of their time contributing their

opinions and knowledge, they would help the study project aimed at enhancing the quality

of EFL teaching in general and assessing EFL speaking skills in particular. At these

meetings, I presented a brief introduction to the reason for my presence at their institutions

and answered any questions that the prospective participants raised (Appendix C.1). I

delivered copies of the Information Statement and the Consent Form to the potential

participants and allowed them 10 minutes to read the documents in about ten minutes

(Appendices A.3-4). Without my intervention, the teachers and students returned their

Consent Forms with their signatures indicating whether or not they would participate in the

study, and whether they would participate in all or part of the items listed in the Consent

Form.

Based on the number of students who gave their consent to participating in the

study, I made necessary arrangements to begin collecting data from the students on the test

day, including observation and questionnaire survey. Only those who returned the consent

form with their signatures of agreement were contacted to complete the questionnaire. As I

was engaged with the test room observation, I asked two monitors for assistance with

delivering copies of the questionnaire to the students and collecting them immediately after

they finished their speaking session in the testroom.

I sought cooperation from a teacher who had a good rapport with the staff at each

institution to assist with delivering the Information Statement and the Consent Form to

potential English speaking skill teacher participants and collecting them after one week.

This ‘network’ sampling method enabled me to know how many teacher raters had agreed

89

to participate prior to the test day to ensure suitable plans for data collection on that day. I

was very appreciative of the teacher raters’ cooperation for their 100% participation in the

study, which made the test room observation possible and fruitful with data collected from

“real-world” and “naturalistic settings” (Johnson & Turner, 2003, p. 313).

3.4.1 Test room observation

I included speaking test room observation as part of the research design to gather

information about the actual test administration, any verbal interactions occuring during the

test, as well as the overall atmosphere in the test room and the examination area. I attended

the speaking performance of the candidates on the testing day after having obtained

permissions from the EFL Faculty, the official examiners, and individual candidates.

Although all the test raters gave consent for me to observe their test room as a non-

participant, i.e. the whole test session took place as normal without any interference from

the researcher, not many candidates felt comfortable being observed. Only the speaking

sessions of the students who had given consent to the study were observed.

According to the information collected from initial contacts, all second-year English

majors at University A would take the speaking test on the same day and at University B on

another day in January 2016. University C organised classroom-based speaking assessment

for their second-year English majors on different days without overlapping the test days at

Universities A and B. Therefore, an assistant was recruited to be an observer together with

me so that two test rooms at both University A and B could be observed on the same day.

The assistant observer was an EFL teacher colleague who had about ten years’ experience

in EFL teaching, testing, and class observation. Before the test, I discussed with the co-

observer how to record the speaking performances, take fieldnotes and fill in the

observational scheme to make sure “each category for observation is demarcated

beforehand, agreed and understood” (Scott & Morrison, 2006, p. 242). A trial using the

observational protocol (Appendix B.1) with a stimulated speaking test was organised for us

both to make “a preliminary assessment of inter-observer reliability” so that necessary

amendments to the observational procedure could be made before the actual test (Cohen,

Manion, & Morrison, 2007, p. 401).

90

Observation provides me with an opportunity “to sample educational experience

first-hand rather than depend on what participants say what they do” (Scott & Morrison,

2006, p. 168). Live observational data gathered from a naturally occuring social situation

have distinctive features that other methods cannot provide. They help to learn about the

physical environment (the physical setting), the people involved (the human setting), verbal

and non-verbal exchanges (the interactional setting), and the organisational styles (the

programme setting) being observed (Morrison, 1993, p. 80).

Quantitative and qualitative data were recorded in a 14-item observational scheme.

There were fixed options of descriptions about the test room for the observer to choose

from. In addition, there were empty descriptions for the obeserver to complete on the basis

of what was seen and heard during the observation process. Such “intramethod mixed

observation” allowed me to achieve the goal of “capitalizing on the strengths of both

qualitative observation and quantitative observation” (Johnson & Turner, 2003, p. 313).

Table 3.2 summarises the contents in my test room observation scheme.

Table 3.2 Summary of key points covered in the test room observation scheme

Test room fieldnotes Contents

(1) Procedural information Name of institution, examiners’ names, total number of

examinees, test room number, test time and date

(2) Oral test administration

(3) Speaking tasks

Test format, interlocutor inclusion, timing

Task requirement, time constraints, interactional patterns

(4) Test prepation Questions or topic given in advance

(5) Rating and scoring Assessment criteria, rating scale (scoring rubric)

(6) Break time Interval inclusion in a test event

(7) Influential factors External factors affecting the test operation

(8) Test venue Space, location, facilities, using mobile phones, waiting area

(9) Minute taking Test rater’s and test taker’s activities

(10) Test room plan Visual description of the test room plan

3.4.2. Questionnaire surveys

My empirical study targeted at EFL teachers and students involved in an assessment event

administered at different tertiary institutions at nearly the same time. I selected a

91

questionnaire design to collect factual, behavioral and attitudinal data from respondents

who were testers and test takers in this study (Dörnyei, 2003, p. 8). The power of the

questionnare is recognised as an effective device to generate quantitative information about

a large number of participants’ behavior, attitudes, opinions, beliefs, characteristics,

expectations, knowledge, and self-classification (Neuman, 2011). In educational research,

questionnaires are described “as a safety net for objectivity, scales, breadth and

generalizability” (Scott & Morrison, 2006, p. 189).

This type of survey allowed the researcher to gather a remarkable amount of

information from large groups of respondents in a limited time frame. There was no other

option of the survey date other than the day when the students took the speaking test as it

was their final class meeting. After that day, they would be enrolled in different classes in

the following semester.

The questionnaire survey needed completing no earlier than when the students

finished their speaking sessions since there were question items asking about their

experience with, and attitudes towards the completed test. The main purpose of conducting

teacher and student questionnaire surveys was to capture the overall picture of assessing

EFL speaking skills across universities in Vietnam. Data about facts, behaviour, and

attitudes of the stakeholders were provided by themselves, who were testers and test takers

involved in the testing process.

There were two questionnaires designed for two groups of participants: a five-page

questionnaire for EFL teachers (1), and a three-page questionnaire for EFL students (2).

The teacher questionnaire was written in plain English to make sure it was thoroughly

understood. The student questionnaire was translated into Vietnamese so students of all

levels could understand it clearly and accurately.

(1) Questionnaire for EFL teachers

The questionnaire delivered to EFL teachers covered five A4 pages and was written in

English. There were 60 numbered question items in total (Appendix B.2a). Applying

“sectionalizing” as “a useful technique for grouping together questions about a specific

issue” (Cohen, Manion, & Morrison, 2007, p. 338), I divided these questions into eight

sections for clarity with focus headings: Demographic information, Teaching and testing

92

experience, Test preparation and administration, Test tasks, Assessment criteria and rating

scale, Rating process, Test impact on teaching and learning, and Other questions. The

contents of the teacher questionnaire are summarised in Table 3.3.

Table 3.3 Summary of contents covered in the questionnaire for EFL teachers

Section Contents

(1) Demographic information

Five questions collecting basic information about the participant’s workplace, gender, age group, first language, and professional qualifications

(2) Teaching and testing experience

Seven questions gathering the respondent’s experience in EFL teaching, testing, test training, speaking assessing, and using spoken English in class

(3) Test preparation and administration

Nine questions addressing teachers’ perceptions about test preparation and administration such as the test room atmosphere, informed assessment criteria, timing, and ensuring consistency in testing

(4) Test tasks Ten questions covering aspects about test tasks such as test design, candidates’ familiarity with the test tasks, and relationship between language in the course book and task performance

(5) Assessment criteria and rating scale

Seven questions regarding teachers’ evaluation of the assessment criteria and rating scale they used in terms of their development, usefulness, and correspondence with course objectives

(6) Rating process Seven questions addressing teacher raters’ practice and behaviours in their rating process

(7) Test impact on teaching and learning

Nine questions asking the teacher raters about evidence of impacts the test had on EFL teaching and learning

(8) Other questions Eleven questions addressing raters’ difficulties in assessing speaking, awareness of the test purpose of the test, preferences in rating and scoring, adjustments in teaching practice, advice for candidates, and recommendations for improving the effectivesness of the current test

The questions were designed in both open and closed formats. For closed questions,

the respondents had to choose a suitable answer that best described their opinion from a set

of prescribed options (Bryman, 2012). Questions of this type did not take participants much

time to complete and enabled the researcher to code and process responses quite easily for

statistical analysis on a computer (Bryman, 2012, p. 249). On the other hand, open-ended

questions, as their name suggests, do not restrict the respondents’ options for answers but

resemble “an open invitation” to provide their answers in their own words (Cohen et al.,

2007; Creswell & Clark, 2011). The combination of both closed and open questions in such

93

a semi-structured questionnaire helped to establish “the agenda” of the research, and

respect “the nature of the response” (Cohen et al., 2007, p. 321).

I included clear instructions before each type of questions to make sure that the

respondents would understand exactly how they were expected to answer. This procedure

aimed to increase the credibility of the data collected. There were three groups of questions

in the teacher questionnaire: multiple-choice questions (Questions 1-7), Likert scale

questions (Questions 8-49), and multiple answer questions (Questions 50-60).

For each multiple-choice question, the respondent was asked to choose only one of

the provided options that best described his/her teaching and testing experience. The

multiple-choice section came first in the questionnaire as these questions were “reader -

friendly” and “relatively straightforward” (Dörnyei & Taguchi, 2010, p. 33), which would

give the respondent an easy and comfortable start to continue with the next sections.

For each item in the Likert scale section, the respondent indicated his/her attitude

and opinion about a given statement by choosing one of the five options: (1) Strongly

disagree, (2) Disagree, (3) Neither agree nor disagree, (4) Agree, and (5) Strongly agree.

The utilisation of an attitude scale is considered to be very helpful for my research to

measure “a degree of sensitivity and differentiation of responses while still generating

numbers” (Cohen et al., 2007, p. 325).

For each of the multiple answer questions, the respondent could choose more than

one possible option and/or give his/her own opinions. Unlike the previous sections in which

respondents’ answers were limited within a set of fixed options for each question, a

combination of different types of questions in this section not only elicited clear-cut

responses but also served as “a funnelling or sorting device for subsequent questions”

(Cohen et al., 2007, pp. 322-323). This section comprised various types of questions:

dichotomous, multiple-choice, and open-ended questions. Dichotomous questions (e.g.

Question 58) required respondents to make a choice between two possible answers and

provide further information for the selected option. Multiple-choice questions (e.g.

Question 52) allowed the respondent to choose as many possible answers as they thought

appropriate and motivated them to provide short answers for the ‘Other’ category in their

own words (Cohen et al., 2007). These ideas enabled me “to trawl for the unknown and the

94

unexpected” (Gillham, 2008, p.34) when I developed the questionnaire. Open-ended

questions (e.g. Question 60) enabled respondents to reply in their own words and opinions.

Dörnyei (2003, p. 47) comments that “Open format items can provide a far “richness” than

fully quantitative data. The open responses can offer graphic examples, illustration quotes,

and can also lead us to identify issues not previously anticipated”.

(2) Questionnaire for EFL students

The questionnaire delivered to EFL students covered three A4 pages and was written in

Vietnamese. I arranged the monitor of each class for assistance with delivering the

questionnaire to the students right after they finished their speaking session, as they walked

out of the test room. Each student spent about 15 minutes completing the questionnaire and

returned it to the monitor. I collected all the completed questionnaires from the monitors at

the end of the speaking test event.

There were 46 question items in the student questionnaire. The questions were

grouped into six sections under the headings: EFL learning and and test-taking experience,

Test administration, Test tasks, Rating and scoring, Test impact on learning, and Other

questions as summarised in Table 3.4.

95

Table 3.4 Summary of contents covered in the questionnaire for EFL students

Section Contents

(1) Demographic information Three questions gathering factual information about the participant: name of the university, gender, and age group

(2) Learning and test-taking experience

Seven questions addressing the respondent’s experience in EFL learning and test-taking experience, self-evaluation of EFL proficiency, physiological status, class attendance and preferences of test formats

(3) Test administration Seven questions regarding test takers’ perceptions about test administration in terms of preparation, facilties, and organisation

(4) Test tasks Ten questions addressing candidates’ perceptions about and evaluation of the test tasks they performed in terms of interest, level of difficulty, timing, and association with what had been learnt from the speaking course

(5) Rating and scoring Eight questions addressing candidates’ opinions and perceptions about the test rating and scoring

(6) Test impact on learning Six questions asking the candidate about test impacts on their EFL learning in regards to spoken English improvement, learning strategies, test-taking skills, motivation in learning English speaking skills

(7) Other questions Seven questions covering various topics such as candidate’s anxiety, factors influencing test performances, test scores, test preparations and suggestions for improving the quality of language testing

The layout and the types of questions used in the student questionnaire were similar

to those in the teacher questionnaire (Tables 3.3 and 3.4). I repeated the same topics (e.g.

test administration, test tasks, rating and scoring) asked in the questionnaires to compare

findings from the two important groups of stakeholders and to examine convergence and/or

divergence when integrating data from the two different sources.

3.4.3 Interviews

As an educational researcher, I was interested in stories behind the numbers in statistical

results. There was no better choice than having face-to-face conversations with people who

were actually involved in the testing event. Interviews with test raters and test takers were

anticipated to yield a deeper understanding about the phenomenon under study. As test

raters and test takers held different positions and played different roles in the test, the

researcher had different methods of approaching them. Test raters were supposed to make

their own decisions assessing and scoring candidates’ performances. Therefore, individual

96

interviews with test raters would best suit the research subjects and purposes providing

opportunities to probe their beliefs, attitudes, and knowledge about oral assessment. Table

3.5 presents a content summary of interview questions for EFL teachers.

Table 3.5 Summary of the EFL teacher interview protocol

Section Contents

(1) Introduction Expressing thanks for the interviewee’s participation, stating the purpose of the study and reconfirming the anonymity of participating in the current research

(2) Opinions and evaluation

Four questions addressing teachers’ self-evaluation of the practice of assessing speaking at their institution in terms of its importance, achievements, strengths and weaknesses, and giving feedback after the test

(3) Rating and scoring Five questions addressing the issues of raters’ performance in the test in regards to rating methods, difficulty in rating, consistency assurance, and factors affecting the rater’s performance

(4) Test impact on teaching

One question about possible impacts of the test that might result in some adjustments in teaching methods

(5) Recommendations and suggestions

Three questions inviting the interviewee to provide suggestions to improve the current oral assessment in terms of rater involvement, test audio-recording, and CBT application

(6) Extension One question encouraging the interviewee to add any other comments or raise any questions they might have

I arranged interview schedules to the interviewees’ convenience. Meetings in person

were scheduled after the test at a place and time suited the participants most so that they

were comfortable to disclose their perspectives. The semi-structured interviews with

teachers were conducted in Vietnamese language as it is the mother tongue of both the

interviewer and the interviewee. Sometimes code-switching between Vietnamese and

English did occur in the interviewees’ response because they felt more comfortable using

some English terms or phrases which they use almost every day in their lessons at school.

Test takers were young adults that shared many common characteristics since they

were in the same class, attending the same language course and taking the same test at their

institutions. Hence, focus group interviews were organised to “encourage a variety of

viewpoints on the topic in focus for the group” (Kvale & Brinkmann, 2009, p. 150). I

played the role of a moderator in these interviews by introducing the discussion topics,

97

raising guiding questions, and facilitating verbal interchanges among group members. The

interview protocol for EFL students covered seven points as presented in Table 3.6.

Table 3.6 Summary of the EFL student interview protocol

Section Contents

(1) Introduction Expressing thanks for the interviewees’ participation, stating the purpose of the study and reconfirming the anonymity of participating in the current research

(2) Self-evaluation One question addressing candidates’ general self-evaluation of their speaking performance in the test

(3) Perceptions Five questions encouraging candidates to discuss their perceptions about test preparation, test administration, test usefulness, external factors affecting speaking performance

(4) Test scores Two questions gathering candidates’ opinions about the test scores they received

(5) Test impact on learning

One question about possible impacts of the test that might result in some adjustments in language learning strategies

(6) Preferences Four questions inviting candidates to talk about the oral assessment in terms of rater involvement, test formats, test audio-recording, and CBT application

(7) Extension One question encouraging the interviewee to add any other comments or raise any questions they might have

Group interviews with four representative students from each class were conducted

right after the scores had been officially proclaimed (varying from one to two weeks after

the test). I employed “snowball” sampling method again to facilitate group members

sharing and discussing with one another in an open and friendly manner. For each test

room, I selected one student and asked him/her to recommend three more classmates of

his/hers to form a group of four, preferably with mixed genders if possible. It was important

that all the students gave consent to a face-to-face group interview. I emailed the list of

interview questions to the group at least one day prior to the interview so that they could

have some preparation and did not feel anxious about the questions to be asked.

When conducting focus group interviews with the students, I constantly bore in

mind how to encourage the members “to talk freely and emotionally to have candour,

richness, depth, authenticity, honesty about their experiences” (Cohen et al., 2007, p. 270)

so that a rich source of extensive data could be collected. All the interviews were conducted

98

in Vietnamese since the interviewer’s and interviewees’ first language was Vietnamese.

The interviews were audio-recorded and transcribed in Vietnamese versions for full

understanding and thematic analysis. I followed procedural steps (Creswell, 2012) for

conducting interviews: giving each participant a formal invitation with an Information

Statement and a Consent Form (Appendices A.2, 3, 5, 6), compromising an appropriate

time and place for the interview, using icebreakers to encourage individuals to talk, asking

questions and recording participants’ answers, thanking informants for their contribution,

and writing down comments that helped explain the data.

The purpose of these interviews, for both examiners and candidates, was to enable

interviewees to provide ideas that went beyond the questions in the questionnaires. Probing

the interviewees was possible in an informal conversational atmosphere. However, the

interview questions were “prespecified and listed on an interview protocol, they can be

reworded as needed and are covered by the interviewer in any sequence or order” (Johnson

& Turner, 2003, pp. 305-306).

3.4.4 Expert judgements

I invited six EFL experts to participate in the study voluntarily. Besides the Information

Statement and the Consent Form (Appendices A.5b and A.6b), I provided each with the

Test Content Judgement Protocols corresponding with the course outlines, course books,

and the test contents used at the three institutions (Appendix B.4). Each expert gave

independent ratings for test questions indicating their content relevance to the course

objectives. In addtion to judgements on a 4-level scale – Highly irrelevant, Not relevant,

Relevant, and Highly relevant (adapted from the scale developed by Davis, 1992), the

content experts made further comments on the wording, language focus, and difficulty

degree of the question items. I will present a detail description of designing the content

judgement protocol in Section 5.3.

With experience in EFL testing and assessment, the content experts could also make

suggestions for revision of particular items wherever possible. The experts had one to two

weeks to complete their evaluation, and returned the documents to me for data entry.

Feedback in the form of relevance ticks was coded for SPSS statistical analysis and further

textual comments were displayed on a Microsoft Word sheet (Appendix E.4b) for inter-

99

expert thematic comparison. Table 3.7 shows demographic information about the experts,

who were all native Vietnamese teachers of EFL and within the age group of 40 to 45.

Table 3.7 Information about EFL experts

Expert code

Gender Professional qualification

Years of experience

Position

01 F M.A. 20 Head of L2 Department

02 F M.A. 6 Head of L2 Department

03 F Ph.D. 10 EFL Lecturer

04 F M.A. 8 EFL Lecturer

05 M Ph.D. 15 Deputy Head of L2 Department

06 F M.A. 10 Dean of EFL Division

These experts had an average of 11.5 years of experience working on TESOL

course design and/or language test development. Two of them held a Ph.D. level in

Education, and the others had an M.A. degree in TESOL. All of the EFL experts were

teaching English and/or doing administrative work at tertiaty institutions in Vietnam.

3.4.5 Documents

Besides the primary data sources collected from teacher and student participants by means

of questionnaires and interviews as presented above, I gathered official documents as a

source of secondary data that were used to “uncover meaning, develop understanding, and

discover insights relevant to the research problem” (Merriam, 1988, p. 118). According to

Bowen (2009), documentary materials have five functions to serve in a research

undertaking: (1) providing data on the context within which research participants operate,

(2) suggesting questions that need to be asked and situations that need to be observed as

part of the research, (3) providing supplementary research data, (4) providing a means of

tracking change and development, and (5) verifying findings or corroborating evidence

from other sources.

Documents used for this study took the form of official documents written,

recorded, or popularised by the institutions. They include course outlines, test scores, rating

scales and test administration guidelines that were classified into two subgroups:

100

constructed data and existing data (Johnson & Christensen, 2014). First, constructed data

were test scores because they were produced/constructed by research participants during a

research study. The scores were awarded to candidates’ speaking performances by test

raters. The sets of scores were used for statistical analysis regarding score distributions and

correlations across the universities. Second, existing data were available from course

outlines containing information about course objectives, course descriptions, teaching

schedules, assessment, and other related matters. They were used for triangulating with

findings from observational field notes on test administration and assessment criteria. In

addition, speaking test guidelines, scoring rubrics and scoring sheets were collected to

compare the test administration and assessment methods between schools.

Despite potential benefits that documentary data can bring to research, careful

consideration was made as to how each kind of document could help to answer the research

questions, as Atkinson and Coffey (1997, p. 47) emphasise the role of learning through

documentary records:

That strong reservation does not mean that we should ignore or downgrade

documentary data. On the contrary, our recognition of their existence as social facts

alerts us to the necessity to treat them very seriously indeed. We have to approach

them for what they are and what they are used to accomplish.

In comparison with other data collection methods, approaching documents,

according to Bowen (2009, p. 31), are the most effective thanks to its efficiency (available

and less time consuming), cost-effectiveness (affordable) and stability (suitable for constant

reviews).

3.5 Data collection procedures

The research questions shaped the data collection procedures for my study. In general, both

objective data and subjective points of views, depending on stage of the empirical research

cycle (Teddlie & Tashakkori, 2009, p. 88), are recommended for a mixed research

approach. I collected quantitative and qualitative data of different types from various

informants and sources relying on their presence and availability at three stages: before,

during, and after the assessment event. Table 3.8 summarises the methods I employed in a

three-stage process of collecting data for this study.

101

Table 3.8 Data collectection methods in a three-stage process

Stage Data collectection method Instrument/device Appendix

Before - Initial contacts - Documents

- Notes for initial contacts - Document checklist

C.1 C.3

During - Observation - Audio-recording of speech samples

- Observation protocol - Sony recorder manual

B.1 C.2

After - Questionnaire surveys

- Teacher questionnaire and student questionnaire

B.2

- Interviews

- Interview protocols for teachers and students

B.3

- Test content judgements

- Test content judgements for Universities A, B, and C

B.4

- Test scores - Templates for score processing D.1

In the following section, I describe in detail what data collection methods I used in

each stage. I include the rationale for using each method, and clarify how I manipulated

each method to collect the intended data.

Before the test

I processed preliminary information about the EFL teaching, learning and testing context

via informal talks with individuals who knew about potential research sites so that I could

make decisions on which targeted institutions were to be invited to participate in the study

project.

I collected data associated with test administration such as test formats, interlocutor

inclusion (if any), test time and venue, etc. to enable the preparation necessary for the next

phase of data collection. These strategies were very useful in preparation for data collection

prior to the oral test: knowing about the test formats helped me to adjust the questionnaires

and observational protocol to fit the kinds of data anticipated to occur; being informed of

the test time and venue enabled me to be proactive in preparing devices for sound recording

and arranging the student survey; and foreseeing the test tasks to be used helped me to have

the best selection of research sites to access the diversity of assessment methods currently

102

applied at local universities. Table 3.9 summarises the methods and procedure for

collecting data before the test taking place.

Table 3.9 Data collection before the test

Before the test Methods Data collection Purposes � Initial contacts Qualitative about potential

research sites and participants Making best selection of samples for the research purpose

� Documents Qualitative about course outlines, test tasks and rating scales

Enabling necessary preparation for data collection and analysis after the test

During the test

After obtaining consent from the Heads of EFL Faculties, the raters, and candidates, I

arrived at the test venues for test room observation on the notified test date. Testing

procedures being applied for each candidate in the test room were observed and recorded in

the observational protocol. Samples of candidates’ speaking performances were audio-

recorded during this time. These two sources of data provided important evidence to

describe the test administration, examine the test consistency, authentic language outcomes

with reference to what was described in the documentary materials, and modify the

research instruments (interview protocols) to be used after the test. For example, I

attempted to seek answers to construct comprehensive understanding about the

phenomenon under study:

- Test administration: What interaction pattern occurred? Was there a break

(interval) during the test? How many examiners and candidates were involved in each oral

test session? How was the seating arrangement for the raters and test takers? etc.

- Test consistency: Was the speaking of each candidate timed equally? How were

questions delivered to elicit each candidate’s responses? Did candidates receive support

from the examiners during their speaking? etc.

103

- Language outcomes: What kinds of questions were used in speaking tasks? How

did candidates respond to the questions? What communication strategies did the candidates

use? etc.

I observed the test sessions with a non-preparatory role and respected the proposed

schedule of the test. Only candidates who had given me consent were observed and had

their speaking performance recorded. Data collection methods used during the test event

are summarised in Table 3.10.

Table 3.10 Data collection during the test

During the test

Methods Data collection Purposes

� Observation Quantitative and qualitative about test administration

Understanding about test formats, interaction patterns, operation of test tasks

� Audio-recording

Qualitative of speaking test samples performed by raters and test takers

Examining speech patterns produced by candidates, and learning about the consistency in raters’ performance

After the test

I conducted questionnaire surveys for candidates and examiners. Based on the list of

candidates agreeing to take part in the survey, I delivered a copy of the student

questionnaire to each of them to complete immediately after the speaking was finished. The

teacher questionnaire was a little more flexible. To reduce the effect of fatigue after the test

that might influence their response to the questionnaire, the raters had the option to do it at

home or at school and return the completed questionnaire after one or two weeks.

Based on the information from participant consent forms, I arranged one-to-one

interviews with teacher examiners and focus group interviews with candidates subject to

their availability. All of the interviews were audio-recorded. Interviews were an important

channel through which testers and test takers provided genuine qualitative data on their

opinions, attitudes and knowledge about the test they had been involved in.

After the test, I commenced the collection of testing material and other testing-

related documents with each Head of EFL Faculty’s approval and the teacher raters’

104

consent due to the security of the test contents. After gathering all the question items to put

in a judgement protocol, I invited six EFL experts to judge the relevance of test content

used in the test. I combined the quantitative and qualitative data from these experts’

judgements for the purpose of examining the content validity of the test, i.e. the extent to

which the tasks and questions used for assessment were relevant to the contents covered in

the course books of Listening-Speaking skills.

One or two weeks after the test, I collected authentic test scores from each Faculty’s

secretary to compare candidates’ performances across institutions. Statistical analysis of

test scores served to describe the score distribution and estimate the extent to which raters

in the same test rooms agreed with each other on the score they awarded.

In sum, the three phases of data collection were designed to obtain the information

necessary for constructing a comprehensive and multi-dimensional picture of the oral test

being studied. Data collection methods used after the test event are summarised in Table

3.11.

Table 3.11 Data collection after the test

After the test

Methods Data collection Purposes

� Questionnaire survey

Quantitative and qualitative of candidates’ opinions and reflections about the test Quantitative and qualitative of test raters’ opinions about and practice in the test

Learning about test taker characteristics and raters’ backgrounds Generalising and comparing testers’ and test takers’ perceptions of the test in which they were involved

� Interview Qualitative of what test takers and raters think, say and do about the test (Scott & Morrison, 2006, p. 133)

Providing insights into various aspects of the test informed by stakeholders’ viewpoints

� Document Quantitative about test scores Estimating inter-rater agreement and test score distribution

� Test content judgement

Quantitative and qualitative of EFL experts on the relevance of test items

Estimating the extent to which test contents were representative of the language domains covered in the course books

This section has presented how I applied a mixed methods approach in data

collection for the current study. I collected the mixed data from various sources by means

105

of different research instruments in three phases before, during and after the test depending

on the availability of the kind of data intended. An overview of the data collection is

summarised in the diagram presented in Appendix E.3.

3.6 Methods of data analysis

I analysed quantitative and qualitative data separately by using methods appropriate for

each type. Pertaining the goal of this research project, the integration of quantitative and

qualitative data was to simutaneously adress both “confirmatory and explaratory questions

and therefore verify and generate theory in the same study” (Teddlie & Tashakkori, 2009,

p. 33). For example, the first research question (1a) of this study was explanatory in nature

and aimed at describing the operational practices of oral assessment across institutions. The

last question (4) was confirmatory in nature: testing and assessment do have effects on

teaching and learning. I examined this question by statistical analysis of questionnaire

surveys supported by evidence from interview data. Computer programmes played an

essential role in storing, retrieving and processing the data from both quantitative and

qualitative strands of the study.

3.6.1 Quantitative data processing

I used the SPSS Statistics software V22.0 (2016) to process and analyse quantitative data.

Most of the quantitative data of my study derived from the questionnaire surveys for EFL

students and teachers. First, I coded the respondents’ answers to closed-ended questions

into numerical values for data entry. Short answers to open-ended questions were typed and

imported into the NVivo V10 software package (2012) for qualitative data analysis, which I

describe further in the next section.

The remaining data were the test scores that test raters awarded to candidates’ oral

performance in the test. The official test results were the agreed scores pairs of raters gave

each candidate, except for University C, where there was only one examiner (scorer)

making decisions on every score. Only the test scores of the students who gave consent

were used in the study.

After the data entry phase, data reduction approaches using internal consistency

reliability and factor analysis were adopted to develop multi-items scales from the

106

associated question sets in the questionnaires for EFL teachers and students (Williams,

Onsman, & Brown, 2010). The rationale for these approaches was that “multi-items scales

maximize the stable component that the items share and reduce the extraneous influences

unique to the individual items” (Dörnyei, 2003, p. 34).

I conducted descriptive statistics of the data obtained from the two questionnaires to

rank categories, summarise patterns, or identify trends in the respondents’ answers about

their awareness, opinions and attitudes towards the oral assessment method in their

institutions. Depending upon the data type and information of interest, I selected the most

legitimate among the three ways: tabular, graphical, or statistical analysis for describing

quantitative data (de Vaus, 2002). Descriptive statistics provided the general information

about the distribution of the test scores obtained by the candidates (e.g. the range of the

scores) as well as other central issues in examining test scores (e.g. the mean, the mode, and

the median) across institutions.

Mean comparisons were also made to identify differences in candidates’ opinions

and test scores distributed among student groups in terms of gender, self-evaluation, class

attendance, etc. A comparison of standard deviation among institutions revealed how much

the scores of the groups varied from the mean and indicated how well the mean reflected

the scores of the test population as a whole.

In addition to descriptive statistics to learn about score distribution across the

institutions, correlations based on pairs of score sets informed me about the characteristics

of the scoring performances by pairs of oral test raters within the same institution (Black,

2005). The correlational results helped to confirm whether the directions of the

relationships were positive or negative as well as whether the relationships were weak,

average, or strong. In addition, test score correlations indicated whether the findings were

due to chance or whether they could reflect the common trends in a wider population

(Green, 2013).

3.6.2 Qualitative data processing

I gathered qualitative data from interviews with EFL teachers and students, responses to

open-ended questions in the surveys, observational field notes and official documents in

relation to teaching and testing speaking skills.

107

Responses to open-ended questions in the questionnaires were treated on the basis

of the openness of each question. “Specific open questions” (Dörnyei, 2003, p. 48) were

coded with distinct labels and processed as nominal variables. For example, the “Other”

option at the end of Question 52 (Teacher questionnaire) encouraged the respondents to

specify their own answers that were assigned codes different from the available ones.

Responses to “clarification questions” or “short-answer questions” (Dörnyei, 2003, pp. 48-

49) were treated as qualitative data that would be coded to analyse emerging themes. For

example, Question 45 (Student questionnaire) invited the candidates to make suggestions

on examiners’ performance in facilitating candidates’ speaking in the test.

Transcribing audio data

There were two types of audio data collected for this study: recordings of oral test samples

and face-to-face interviews with research participants. The oral test samples were in mostly

spoken English since all the examiners delivered speaking tasks in English, and the

candidates gave oral performances in response to the tasks in English. I conducted

interviews with EFL teachers and students in Vietnamese for the reason of convenience

because we all spoke Vietnamese as our first language. Code-switching into English words

and phrases sometimes occurred in interviews with EFL teachers, but it did not affect our

communication because we had mutual understanding about EFL teaching, learning, and

assessment.

I had interview and speed sample data transcribed from an oral to a written mode

before analysis. The incorporation of speech details, linguistic features and non-verbal

information in transcribing depended on the nature of these audio materials, and “the

intended use of the transcript” (Kvale, 2007, p. 95).

I considered who would undertake the transcription and what transcription

procedure would be adopted for the two types of audio-recorded files. I replaced audio

segments containing personal identification with codes to ensure anonymity of the

participants before sending out the files for transcribing.

For the English speech samples, I used a human-generated transcription service

(https://gotranscript.com) to obtain professional verbatim transcripts by native English

transcribers. These transcripts came out with word-for-word accuracy including repetitions,

108

speech errors, unintelligible words, and such fillers as “um”, “hmm”, “so”, “you know”,

etc. I modified the transcripts by using an agreed upon system of symbolic notations

(Appendix I) for representing details of all utterances to provide sufficient information to

grasp not only what but also how the people were speaking at the time of recording

(Psathas, 1995, pp. 11-12). This transcription system facilitated conversation analysis (CA)

by preserving typical features of talk-in interaction like pauses, hesitations, intonation, turn-

taking, non-vocal action, etc. (P. Atkinson, 1992). The detailed transcripts of oral language

production were essential to examine how candidates performed their English speaking

skills, and how speaking tasks generated verbal interaction between the examiner and the

candidate(s), or between paired candidates in a test session. My CA emphasised “issues of

meaning and context in interaction. It does so by linking both meaning and context to the

idea of sequence” (Heritage, 2004, p. 223).

For the Vietnamese interviews, I had two experienced Vietnamese transcribers to

assist with the transcription and cross-check each other’s work. Interview analyses focused

on the meaning or content of what the interviewees said (Kvale, 2007, p. 104) rather than

how they spoke. I did not include additional features of speech such as pauses, hesitations,

intonation, word repetitions, or non-verbal information in the Vietnamese transcripts

because they did not contribute to the content analysis and thematic presentation of findings

later on. The interview transcripts in Vietnamese were less detailed than the speech sample

transcripts in English due to the consideration of time, financial and human resources

against the purpose of exploiting the qualitative data to answer the pre-formulated research

questions. To increase the practicality in data management, “the use of analysis techniques

such as thematic or content analysis (of the interviews) seeks to identify common ideas

from the data and, therefore, does not necessarily require verbatim transcripts” (Halcomb &

Davidson, 2006, p. 40). After obtaining sufficient transcripts, I continued the data

processing phase with data coding.

Coding qualitative data

My coding of the interview transcripts, open-ended responses, and speaking test samples

was completed by computer because of the large amount of textual information. Coding for

content analysis of the course outlines, assessment criteria from the three instidutions and

109

further comments by EFL experts was performed manually as the number of these

documents was within my manageability.

Qualitative data in form of interview transcripts were transferred into the software

package NVivo 10 for coding and analysis. The coding process of qualitative data obtained

from the interviews and open-ended responses was carried out in two phases: (1) Initial

Coding to become familiar with the “participant language, perspectives, and worldviews”

(Saldaňa, 2009, p. 48); and (2) Pattern Coding “to develop a sense of categorical, thematic,

conceptual, and/or theoretical organization” from the Initial Coding array (Saldaňa, 2009,

p. 149).

Coding qualitative data commenced after the interview transcripts were checked

against the audio recordings for accuracy, but also for ensuring the harmony of “verbatim

oral versus written style” (Kvale, 2007, p. 95). A code, as Saldana (2009, p. 3) defines, “is

most often a word or short phrase that symbolically assigns a summative, salient, essence-

capturing, and/or evocative attribute for a portion of language-based or visual data”. It may

summarise the main topic of a text segment (Descriptive Code) and be put in the

appropriate column. Alternatively, it may be taken directly from the informant’s original

words (In Vivo Code) and be placed in quotation marks.

When coding was achieved, the data were interrogated and “systematically explored

to generate meaning” (Coffey & Atkinson, 1996, p. 46). After that, data analysis was based

not merely on what content was conveyed but also how it was expressed within the texts

(Coffey & Atkinson, 1996, p. 83). In the transition from the coded data to interpretation, I

considered “comparative and relational analysis” (Bazeley, 2013, p. 228) as well as

patterns, themes, contrasts, regularities and irregularities to move toward generalisation

(Delamont, cited in Coffey & Atkinson, 1996, p. 47).

Code labels indicate what passages of data are about, so they will help to locate and

connect codes with one another in analytical stages afterwards. Coding [qualitative] data is

not trying to reach a clear-cut completion in one practical step, but it is “cyclical rather than

linear”, i.e. a progressive refinement of coded and recoded data (Saldana, 2009, p. 45). The

coding stage of this study thus was divided into two cycles:

First Coding Cycle

110

The objective of this initial coding level was to achieve manageable chunks of information

and organise the data into meaningful categories by attaching codes which came “more or

less directly from the informant’s words” (Coffey & Atkinson, 1996, p.36). Figure 3.1

illustrates how I reduced an extract from the interview script (U2.T2) by ruthlessly asking

“So what?”, “What’s going on here?”, “Why is that?”, and so forth (Bazeley, 2013, 130).

In my opinion, assessing students’ speaking skills is very

important because students might express their language

ability naturally in their learning process without any pressure,

but it is virtually true that psychological and other factors do

have a lot of influence on them. In fact, without examinations,

we cannot evaluate students’ language competency. Such

assessment and rating enabled us teachers to withdraw some

experience in what needs improving, how we have to teach so

that students can obtain better results. What has been

achieved in the test makes the students themselves more

confident to continue to study in the next stage. So I think it

(assessing speaking) is very important.

‘very important’ naturally speaking in learning without pressure psychological factors affect speaking in test exams to evaluate students’ language ability experience for teachers from tests what to improve how to teach achievement makes students confident ‘very important’

Figure 3.1 Example of a paragraph in the First Coding Cycle

Second Coding Cycle

Not all initial codes were reused “because more accurate words or phrases were discovered

for the original codes” (Saldana, 2009, p. 49). At this level, Pattern Codes were employed

as “explanatory or inferential codes, ones that identify an emergent theme, configuration, or

explanation” to answer the research questions (Saldaňa, 2009, p. 152). Appendix F.1

illustrates how my codes from the First Coding Cycle were “reorganized and reconfigured

to eventually develop a smaller and more select list of broader categories, themes, and/or

concepts” (Saldaňa, 2009, p. 149) in the Second Coding Cycle.

111

Some codes were merged together because of their conceptual similarity; others

were dropped for their redundancy or infrequency so as to facilitate the emergence of

themes, which have been described as “integrating, relational statement(s) derived from the

data that identifies both content and meaning” (Bazeley, 2013, p. 190). I generated several

themes surrounding particular topics in consideration of repetitive patterns or patterned

relationships between identified meaningful elements in the data. For example, candidates

mentioned “noise”, “temperature”, “physiological condition”, “anxiety”, “examiner’s

attitude”, etc. when asked about external factors that they thought might have affected their

speaking performance during the test.

Qualitative data analysis

There were discrete levels of moving from coding to thematic analysis and result

interpretation. Firstly, I displayed the coded data systematically so that they could be read

and retrieved conveniently by producing diagrams, matrices, tables and maps of the codes

(Coffey & Atkinson, 1996, p. 46). All of these were carried out manually since the number

of text pages was not too large. Secondly, the codes were “retrieved, split into

subcategories, spliced, and linked together” (Dey, cited in Coffey & Atkinson, 1996, p. 46).

After that I thoroughly explored and made sense of the data by recognising patterns,

establishing connections, and contrasting what the themes contributed. I implemented two

ways of making comparisons for further qualitative analysis (Bazeley, 2013, p. 257):

• Compare a category (represented by a code) from the perspective of 2/more groups

or under 2/more different conditions; and

• Compare within or across cases (people, sites, incidents, etc.) that differ in some

descriptive regard

In addition, I incorporated and explored various concepts across several informants’

(or groups of informants, e.g. EFL teachers, students, and experts) responses to look for

dominant themes, and whether there were similarities or discrepancies in their stories. The

table in Appendix E.4a illustrates a comparison of candidates’ voices from different

institutions regarding their preferences towards the number of raters involved in oral

assessment.

112

3.7 Presenting data analysis

The presentation of research results follows the steps adapted from the socio-cognitive

framework (Weir, 2005, p. 46) for validating the oral assessment. I selected this framework

because it was fit to the purpose and the context of oral assessment administered in the

Vietnamese tertiary institutions being studied.

The framework covers key aspects of speaking test validation which my research

questions aimed to explore. Figure 3.2 summarises areas of my data analysis in relation to

test takers (characteristics, task-based performances), test raters (use of the rating scale,

marking oral performances), and the test itself (assessment tasks, administration, impact).

Figure 3.2 Framework for validating speaking tests (adapted from Weir, 2005)

The model allows the adoption of various types of data to learn about the

components of context validity (before the test), scoring validity (during the test), and

consequential validity (after the test). Diversity of data types and data sources were

essential to enable a more comprehensive understanding about the multifaceted problem of

oral assessment in L2 education. According to Weir (2005, p. 47), “the more evidence

collected on each of this framework, the more secure we can be in our claims for the

validity of a test”.

SCORE / GRADE

RESPONSE

SCORING VALIDITY

RATING

� Rater characte-ristics � Criteria � Rating scale � Rating procedures � Score awarding

CONSEQUENTIAL VALIDITY

SCORE INTERPRETATION � Washback in classroom � Impact on individuals and institutions

TASK and ADMINISTRATION � Administration setting � Interlocutor � Task setting � Task demands

CONTEXT VALIDITY

BEFORE THE TEST DURING THE TEST AFTER THE TEST

TEST TAKER CHARACTERISTICS � Physical / physiological � Experiential

113

The framework allowed me to examine different types of validity in oral assessment

in 3 phases: before, during, and after the test. This division was well-suited to my research

design for data collection to answer the research questions. I gathered data from relevant

sources thoughout the process of oral testing from its beginning to end (Section 3.5). My

analysis is an integration of empirical evidence from educational institutions with varied

curriculum designs and learners from different backgrounds. This approach was aligned

with the purpose of the CEFR-based language assessment in Vietnam in attempting to

“specify objectives and describe achievement of the most diverse kinds in accordance with

the varying needs, characteristics and resources of learners” (Council of Europe, 2011b, p.

5).

3.8 Research ethics and compliance

My maxim governing this study was “Do no harm” to participants whether by observing,

surveying, or interviewing them (Booth, Colomb, & Williams, 2008, p. 83). Ethical

practices were constantly my primary consideration and placed at the forefront of my

concern in all stages and steps of the research process (Hesse-Bieber & Leavy, 2006). The

research process complied with ethical requirements from the stages of data collection and

report writing through data storage.

§ Data collection: I paid constant respect to the research sites and individuals

involved by the use of the informed consent and considering their privacy during data

collection and after the research project.

Informed consent

My initial contacts were leaders of the educational institutions and Heads of the

Foreign Language Faculties in order to seek their full consent before any research was

undertaken at the universities. The type of data to be collected, and how the data were to be

gathered and used was clearly stated in the written documents (e.g. the Information

Statements, the Consent Forms) that I consistently complied with.

I obtained access to potential teacher and student participants at each institution

after obtaining permission from the Head of the EFL Faculty. Taking part in the study was

voluntary, and non-participation did not lead to any negative consequences for the teachers,

114

students, or their institution (Porte, 2010, p. 100). It was the EFL teachers and students who

decided on their own whether to participate or not.

All the teachers or students went into the study with clear information about their

invited participation. They were informed in writing about the nature of the project and

their written consent to participate was obtained “to protect participants from harm and

assure their rights in the research process” (Seidman, 2013, p. 139). I sent potential research

participants written documents including the Information Statements and Consent Forms

(Appendices A.2, 3, 5, 6), prior to any data collection or direct individual contact. For

example, Heads of the English language skills teachers enabled me to have initial email

communication with prospective teacher/rater participants about the intended study to seek

their voluntary cooperation. I then made first contact in person with potential

student/candidate participants in their Listening-Speaking classes to introduce my study

purpose and data collection plans (Appendix C.1). Contacting the participants this way

aimed to ensure that they were free from any pressure to give permission for data

collection. I did not made recordings without participants’ agreement (Rapley, 2007, p. 29).

The right of withdrawal from this study at any time was made clear and left to the

participants’ discretion. I greatly appreciated the privilege of conducting research in

authentic test rooms and tried my best to ensure that the teachers and students did not feel

the privilege was abused. For example, I only observed the test performances of the

students from whom I had received written consent. With reference to the list of candidates

who had consented, gathered from the first contact in class, raters helped me with the

audio-recording of speech samples so that my intervention in the oral test sessions was kept

at minimum.

Participants’ privacy

I explained to potential participants the anonymity of information they would be

asked to provide. All personal identifying details were removed during data entry, data

processing and presentation of research outcomes. Individuals were coded as numeric

variables for analysis and represented as reference codes, e.g. U1.T2 refers to the second

teacher of University A in the reporting of the results (Appendix E.1). Participants’

anonymity and confidentiality were ensured “to the very least through the non-appearance

115

of names” (Porte, 2010, p. 101). I treated data associated with the institutions and

participants in confidence without “disclosing the identity of the participants or indicating

from whom the data were obtained” (Weirsma & Jurs, 2005, p. 452). Only I, as a research

student, and my supervisors had access to the data collected for the purpose of the study.

§ Reporting and writing up results: I reported data with respect to the truthfulness of

the data collected, and without creating artificial data to facilitate certain predictions or

interest groups. Copies of jargon-free reports would be available to participants to

popularise the study results upon request as indicated in the Consent Forms for research

participants (Creswell, 2012).

§ Data storage: Data were both quantitative and qualitative in nature and were

stored in paper format and electronically on a CD-ROM. Complete copies of all the raw

data were secured with great care for at least five years after which they would be

destroyed.

3.9 Assuring research quality

This research study used a mixed methods design seeking to exploit the strengths of both

quantitative and qualitative data collected from authentic tests for the purpose of reflecting

a critical overview of EFL assessments across educational institutions in Vietnam. Assuring

the objectivity and accuracy of recordings was the central concern during the research

process. Care was taken to ensure that data of any type used in the study “allow[ed] fair,

unbiased comparisons among groups” (Weirsma & Jurs, 2005, p. 93) when differences

across institutions or participants were attributed to random fluctuations without any

systematic interventions. To achieve that goal, triangulation was utilised as a strategy to

enhance the validity of the data collection and the analytic claims based on research

findings. The value of this technique lies in “providing more and better evidence from

which researchers can construct meaningful propositions about the social world”

(Mathison, 1998, p. 15). In the current study, “validation-through-convergence” (Dörnyei,

2007, p. 165) was demonstrated through three main types of triangulation regarding data,

investigators, and methodology.

� Data triangulation. I gathered research data from multiple groups of informants in

different locational settings. For instance, to examine the test administration, I combined

116

data from observations with test raters and test takers’ responses across institutions. The

diversity of conditions was beneficial to a fruitful data collection and insightful analysis

leading to “not only convergent findings but also inconsistent and contradictory findings in

our efforts to understand the social phenomena that we study” (Mathison, 1998, p. 15).

� Investigator triangulation. I was not the sole individual to accomplish the entire

process of data collection. Routine activities were relegated to trustworthy assistants to

“minimize biases resulting from the researcher as a person” (Flick, 2002, p. 226). For

example, when the test was scheduled at different test rooms at the same time, a teacher

colleague was recruited as a co-observer in the other test room. Before the testing event

commenced, I gave practical briefings to the substitute observer on how to use the

observation sheet so that the highest level of consistency and preciseness of recording field

notes could be achieved.

� Methodological triangulation. The use of multiple methods in this study project

aimed to produce multi-faceted evidence about the local language testing. For example, to

study the reliability of the rating process, sets of scores independently given by raters were

statistically analysed (quantitative method). In addition, interview transcripts and

observation protocols were thematically synthesised (qualitative method) to learn about the

consistency in testing procedures manipulated for each test taker by different raters. I took

necessary steps during the investigation to ensure the credibility of the triangulated findings

by undertaking validity checks (Dörnyei, 2007, p. 60) with members and peers:

� Member checking (or respondent validation). Agreement from participants was

constantly sought not only in the data collection stage but also after research conclusions

had been reached (Bazeley, 2013, p. 408). Specifically, I checked with the teacher raters the

description information recorded in observational protocols for sufficiency and accuracy,

and with the interviewees their words and meaning as conveyed in the interview transcripts.

Especially, the transcripts of focus group interviews were sent to each group member to

confirm the authenticity of what they had said. The participants were involved in not only

commenting on the conclusions of the study but also suggesting actions to take in the report

writing stage as necessary (Crosby, cited in Merriam, 2009).

117

My study included reliability checks with peers (peer checking) to “clarify

interpretations, and to check for gaps and for bias” (Bazeley, 2013, p. 409). For example, I

invited Vietnamese fellow research students at the University of Newcastle to check the

accuracy and consistency of the English translations of interview transcripts with their

audio recordings. These peers, who are EFL university teachers, played an important role in

piloting the questionnaire and interview questions for the EFL teacher examiners. I

gathered their feedback on the wording and comprehensibility of the questions to refine the

research instruments for collecting more reliable data.

3.10 Summary

In this chapter, I have presented justifications for adopting a convergent mixed methods

research design to address the research questions. I described in detail procedures for

selecting research sites and sampling research participants. I clarified the development of

research instruments to collect quantitative and qualitative data together with methods for

analysing the data from multiple sources. As a research project in which the attitudes,

judgements, and interactions of human beings were questioned, I completed all the required

procedures to obtain approval for conducting the study as described in my application to the

HREC (Appendices A.1&4).

The rationale for using these methods laid a strong foundation of methodology to

establish validity in examining the research problem – measuring spoken language skills of

non-native students in an educational setting. It was the nature of the research questions

that prompted the requirements of using data from multiple sources to validate the findings.

Research results would be more logical and persuasive when displayed in both numerical

values (as evidence of a measurement tool) and textual explanation (as illustrations for

opinions and situational descriptions in a testing environment).

The next chapters present the results from the integration of both qualitative and

quantitative strands of the study to understand different aspects of the research questions.

Chapter Four focuses on test taker characteristics and test administration. I report EFL

experts’ evaluation on the relevance of speaking test items in Chapter Five. Chapter Six

aims to examine the diversity of test tasks employed across the institutions. I analyse issues

associated with raters and the rating process in Chapter Seven. My analysis in Chapter

118

Eight concentrates on the impact of oral testing. In the final chapter, I summarise key

results and includes recommendations for further work in researching L2 testing and

assessment.

119

Chapter Four

RESULTS: TEST TAKER CHARACTERISTICS AND

TEST ADMINISTRATION

4.1 Introduction

In Chapter Three, I presented the methodology I adopted to examine oral assessment in

Vietnamese tertiary education. My discussion provided a rationale for the use of a mixed

methods approach to the stages of data collection, data analysis, and presentation of results.

Primary participants in my study were EFL teachers (examiners) and EFL second-year

majors (candidates) from three universities engaged in the speaking tests administered at

their institutions at the end of the first semester of the academic year 2015-2016.

I gathered data before, during, and after the examination. Before the test, I made

initial contact with the prospective participants, including both examiners and candidates,

to gain preliminary information about the test in order to prepare for collecting primary

data. During the test, I implemented test room observations and recorded speech samples of

the speaking test sessions. After the test, I presented separate surveys to the teachers and

students who had completed their performances. I conducted face-to-face interviews with

individual examiners and focus group interviews with candidates to obtain more in-depth

information about the test they had participated in. To examine the relevance of the test

content, I invited EFL experts to make judgements on the relationship between the oral test

questions, the course objectives and the contents covered in the course books.

My study was based on a socio-cognitive framework for validating speaking tests

(Weir, 2005) that allowed me to use various types of data to learn about different testing

aspects in the context of Vietnamese education (Figure 2.3). The model was appropriate to

my multi-stage data collection and useful for my presentation of study results concentrating

on areas that the research questions aimed to investigate: test takers’ oral performance, test

raters’ scoring performance, and the test itself.

120

In this chapter, I present the results from my analysis of test taker characteristics as

a starting point from which to examine the context in which the oral assessment operated,

i.e. how the test was administered, what was tested, and how it was tested (Figure 3.4).

These subject variables have a direct influence on the way individuals process and perform

speaking tasks that have been “constructed (by test writers) with the overall test population

and the target use situation clearly in mind” (Weir, 2005, p. 51). I then focus on discussing

the evidence to answer to the Research Question 1a: How is oral English assessment

administered across Vietnamese universities? Data used for my analysis included “open-

ended, firsthand information” (Creswell, 2005, p. 211) which I collected by observing

different people and activities in their test rooms. Further, the integration of data gathered

from multiple sources enabled the creation of a comprehensive picture of what happened at

the test venues and the stakeholders’ perceptions of the methods of test administration. In

this section, I present common features in organising oral assessment across the institutions

and specific features characterised for each. My comparison of varied procedures applied to

the same kind of test will be followed by a critical analysis of candidates’ and raters’

perceptions of the speaking test administration they had experienced. This chapter

concludes with a summary of key points associated with test taker characteristics and

school-based test administration.

4.2 Test taker characteristics

Test takers (in ESL/EFL) come to the test setting with certain personal

attributes or background characteristics that may have a critical influence

on their performance in the tests, in addition to the influence exerted by

their language abilities. (Kunnan, 1995, p. 6)

Communicative language testing may involve using all language skills, i.e listening,

speaking, reading, and writing, with an emphasis on the “ability to communicate

functions/notions and perform tasks with language” (Brown, 2005, p. 19). Test taker

characteristics are not part of the language ability to be assessed, but are individual

attributes that might have influential effects on candidates’ test performance (Bachman &

Palmer, 2010). As shown in Figure 3.4, these characteristics may influence the way

speaking test tasks are developed, and how the candidates process information sources

121

(cognitive validity) to perform the test tasks as required. In practice, even for the same

question or topic, no speaking test sessions with different test takers could happen exactly

the same because:

Each test taker is likely to approach the test and the tasks it requires from a slightly

different, subjective perspective, and to adopt slightly different, subjective strategies

for completing those tasks. These differences among test takers further complicate

the tasks of designing tests and interpreting test scores. (Bachman, 1990, p. 38)

Insufficient consideration of test taker characteristics might result in biased

judgements “towards or against particular groups or individuals” (O’Sullivan & Green,

2011, p. 36). Test designers need to pay careful attention to the nature and possible

influence of test taker characteristics on L2 performance so that ratings (and/or awarded

test scores) represent the candidate’s language ability, and are not affected by test-taker

characteristics (Kunnan, 1995).

Test takers (candidates) are direct subjects whose oral performance is required for

assessment in a testing situation (see Figure 2.3). There exists a relationship between test

taker characteristics and L2 speaking test performance that affects the validity of test score

interpretations (Huang, Hung, & Hong, 2016). In this section, I will examine test taker

characteristics in three main categories: physical/physiological characteristics,

psychological characteristics, and experiential characteristics (O’Sullivan, 2002).

Physical/physiological characteristics consist of biological features, e.g. age, gender, and

physical or physiological features, e.g. short-term ailments, long-term illnesses or

disabilities. Psychological characteristics focus on oral test anxiety denoting an “emotional

state or condition characterized by feelings of tension and apprehension and heightened

autonomic nervous system activity” (Spielberger, 1972, p. 24). Experiential characteristics

relate to external influences, including candidates’ EFL education, exam preparedness, and

class attendance. All of the elements might have potential impact on oral test performance

and should be taken into account in assessing language learners’ achievement. Data for the

analysis derive from the questionnaire survey and focus group interviews with the

candidates after they finished their institutional speaking examination.

(1) Physical/physiological characteristics

122

Test takers were the largest group of partipants in my study. They were all second-

year EFL majors who had actual experience with oral testing. My analysis focuses on the

candidate samples’ physical/physiological characteristics in terms of age groups, gender,

and other physical conditions that might affect oral production or speech difficulties.

Figure 4.1 Age groups of test takers taking the oral test

Figure 4.1 illustrates that most of the candidates were between 18 and 22 years old.

The 20 to 22 years age group accounted for 45.7% at University A, whereas it was 65.7%

and 88.2% at Univeristies B and C respectively. The 23 to 25 years or over age group was

very small at Universities B (4.4%) and C (2.7%), and University A had no students who

were over 25 years of age. Unremarkable differences in the candidates’ age groups indicate

that the young learners shared more or less similar experience in schooling, social life, and

personal interests. Most paired candidates were classmates of the same intake year.

Candidates’ close acquaintainship and similar ages facilitated their performance in

interactive speaking tasks, e.g. paired discussions, because they were of an equivalent

position, and did not feel their partner was superior to them.

As shown in Table 4.1, there was a total of 352 EFL students participating in the

study. University B had the most test takers (137). Universities A and C had fewer test

takers, 105 and 110 students respectively. The total number of female student participants

tripled that of males (75% compared with 25%). This imbalance in gender of Vietnamese

54.3

29.9

9.1

45.7

65.7

88.2

3.7 2.7

0.7

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

University A University B University C

Test

take

r par

ticip

ants

Above 25

23 to 25

20 to 22

18 to 19

123

learners of English was similar across the three institutions. Recent statistical results from

the English Proficiency Index (EPI) by Education First (EF) indicate that Vietnamese

females scored higher in TOEFL iBT and spoke English better than Vietnamese males

(VTC1, 2016; Thanh Binh, 2015). A recent study with observed university students

demonstrated the effect of test-taker gender and topic on their task performance. There was

evidence that certain topics were more advantageous to candidates of one gender than those

of the opposite. For example, ‘entertainment/horse racing’, ‘leisure/places to visit’ favoured

males more than females; vice versa, ‘office/computers’, ‘airport meeting’ brought more

advantages to females than males. However, classifying task difficulty is impossible in

terms of gender differences in interests. At the level of individuals, tasks are more likely to

have influence differentially than at the level of groups (Lumley & O’Sullivan, 2005).

Table 4.1 Test takers’ gender profile

University A University B University C Total

F % F % F % F %

Gender Male 29 27.6 31 22.6 28 25.5 88 25

Female 76 72.4 106 77.4 82 74.5 264 75

Total 105 100 137 100 110 110 352 100

Note. F = frequency Candidates were all second-year English majors. They had spent at least three

semesters of learning English Listening-Speaking skills at tertiary level. Figure 4.2 shows

that most of the candidates’ self-assessment distributed between ‘Acceptable’ and ‘Fairly

good’ categories in terms of general performance, accuracy, and fluency. Universities A’s

and C’s category of ‘Fairly good’ accounted for about 30% to 40%, whereas this category

of University B was only about 20%. In general, candidates were slightly more confident

with their accuracy than fluency when self-evaluating their English proficiency. This result

echoed the previous findings from a larger scale study on English competence of

undergraduates in HCMC, that teaching and learning English at tertiary level attached

importance to vocabulary, grammar, writing, and reading comprehension, without

emphasising listening and speaking skills (Vu & Nguyen, 2004).

124

Figure 4.2 Candidate self-evaluation of English proficiency in terms of general performance,

accuracy, and fluency

1 2.9

34.3

43.8

17.1

114.8

41 40

13.3

01

11.4

35.2

41.9

10.5

005

101520253035404550

Very poor Poor Acceptable Fairly good Good Very good

University A (%)

General performance

Accuracy

Fluency

2.2

10.2

58.4

21.9

6.60.73.6

11.7

59.1

21.2

4.40

4.411.7

58.4

19

5.80.7

0

10

20

30

40

50

60

70

Very poor Poor Acceptable Fairly good Good Very good

University B (%)

General performance

Accuracy

Fluency

04.5

4036.4

19.1

00 2.7

54.5

31.8

9.1

0.90.9

8.2

48.2

29.1

12.7

00

10

20

30

40

50

60

Very poor Poor Acceptable Fairly good Good Very good

University C (%)

General performance

Accuracy

Fluency

125

Some candidates reported in the questionnaire survey that they suffered health

problems on the test date. A number of the candidates (11.4 %) admitted having some kind

of long-term illness or disability that adversely affected their speaking. A further 24.7%

suffered some kind of short-term ailments during the test, possibly unpredictable toothache,

earache, sore throat, cold, flu, or the like. Again, the ratio of female candiates with long-

term illnesses or disabilities (7.7%) was more than twice that of male candidates (3.7%).

The portion of female candidates with short-term ailments was five times (20.2%) that of

male candidates (4.5%).

Evidence from interviews with candidates supported these results. When I asked

whether there were some external factors that they thought may have influenced their oral

performance (Question 2, Appendix B.3b), some test takers reported that they had had to

endure some unexpected illnesses caused by the weather or the psychological impact of

examinations. The following quote is an example:

I think I was affected by quite many factors. The most influential factor to me was

my health condition. Every coming exam makes me so stressed that I have a

stomach ache. Therefore, I have been affected quite a lot by this condition in

important exams. (U1G1.S2)

My field notes from test room observations did not record any cases of long-term

disabilities in speaking test sesssions. However, some students from my interviews

complained about health issues that they permanently suffered from. They felt

uncomfortable having to cope with these difficulties while making every effort to complete

their speaking sessions. A candidate reported her problem in an interview:

My speaking was also hindered by my health condition. For example, I usually feel

dizzy whenever I suddenly stand up or sit down. I find it takes me some amount of

time then. I was tired, so at that time I did not do my part very well (U3G2.S1).

My analysis of observational data revealed no abnormal cases were reported to

examiners or exam administrators on the test days. The candidates did not know that they

should have informed test administrators of their situation prior to the testing event nor

126

were there any precautions or guidelines for handling physical issues. Lack of attention to

the possibility of unexpected incidents in the test period led to a lack of preparation for such

situations. There were no guidelines for special occurences or to allow for adjustments of

task delivery for less able candidates. Unclear administrative regulations meant there was

the potential for improvised solutions to unexpected happenings, therefore introducing the

possibility of inconsistency in test results. Literature on operating procedures for oral

testing suggests that administrative documentation should be developed and reviewed by a

team-in-charge to ensure that tests take place “in a standardised manner” (Fulcher, 2003, p.

166). For short-term ailments, test takers should be treated with “a supportive attitude”, e.g.

they are allowed to take the speaking test earlier or later than the normal order, or have a

candidate partner uninvolved in the exam to perform a paired speaking task. For long-term

disabilities, test takers may apply for special arrangements, e.g. additional time, adapted

visual material, single instead of paired format speaking tests, etc. (O’Sullivan & Green,

2011, pp. 46-47).

(2) Psychological characteristics

Psychological characteristics are inherent attributes of test takers, and more difficult

to anticipate than physical characteristics in testing situations. Testing is a cause of anxiety

to test takers (Cassady & Johnson, 2002; McDonald, 2001). Test anxiety is one of the

psychological components in language learning that “has a delibitating effect on the oral

performance of speakers of English as a second language” (Woodrow, 2006, p. 308). It has

potential impact not only on the candidates themselves, but also on the partners if in a

paired or group test format. According to my survey results, the majority of test takers

admitted that they felt nervous (65.1%), or very nervous (25%) when asked in the survey.

My further descriptive statistics of this situation (Table 4.2) reveal similar results across the

three institutions in that candidates who were at ease during the speaking test, i.e. ‘not

nervous at all’, always accounted for a much smaller percentage at each university in

comparison with those who were worried about the test, i.e. ‘nervous’ and ‘very nervous’.

127

Table 4.2 Speaking test anxiety across institutions

University A University B University C Total

F % F % F % F %

Very nervous 23 21.9 42 30.7 23 20.9 88 25

Nervous 67 63.8 85 62 77 70 229 65.1

Not nervous at all 15 14.3 10 7.3 10 9.1 35 9.9

Total 105 100 137 100 110 100 352 100

Note. F = frequency

Candidates’ anxiety came from both internal and external factors. Test takers were

anxious about the test because of a lack of proper preparation, fear of making mistakes, or

lack of confidence and psychological stability. Many others suffered from test score

pressure, impact from the testing atmosphere (e.g. noise, test tasks, fellow candidates, etc.)

and from the examiner or the interlocutor they were engaged in face-to-face interaction

with. When I asked a group of candidates whether there were any external elements they

thought might have adversely affected their oral performance, I received the answer:

I am nervous by nature. But when at my turn in the oral test, I saw (an examiner

with) an austere face. That made me even more nervous, and forget all about what I

was going to say. (U1G1.S3)

Sharing with me her experience of speaking exams, another test taker compared

differences in examiners’ behaviour that brought about different effects, both positive and

negative, on her oral production:

For me, the examiner’s behaviour was a primary factor in enabling me to enhance

my performance. For example, in the last (speaking) exam, the male examiner had a

very comfortable attitude to me. He was a bit serious, but he created for me a

feeling that I was encouraged to respond at my best. I felt very confident because I

saw him smile all the time, so I was more inspired to speak and express more

opinions. If I meet with an examiner who is too serious, gloomy, or usually making

faces, then I lose my confidence, and I keep thinking in my mind that I am speaking

incorrectly, or doing something wrong, which makes my ideas dissipate completely

(U3G2.S1).

128

There was a noticeable association between students’ self-evaluation of English

proficiency and the degree of oral test anxiety. A comparison of the means of candidates’

anxiety across the five categories of students’ self-evaluation ranging from ‘Very poor’ to

‘Very good’ (Table 4.3) indicates that test-takers’ anxiety increased to the lower end of

students’ general self-judgement of their English proficiency. For test-takers who evaluated

themselves ‘Fairly good’, ‘Good’, and ‘Very good’, the total mean values out of a possible

3.0 were lower at 2.02, 1.69, and 1.5 respectively. For those whose self-evaluation was

lower at ‘Acceptable’ (accounting for the majority of the students), ‘Poor’, or ‘Very poor’,

the means were higher at 2.31, 2.64, and 2.75 respectively.

A closer look at this aspect across the three institutions reveals an identical

situation: the weaker at English, the more anxious about oral English testing the test-takers

were. This result was a similar repetition of the outcomes from a previous study on

undergratuate EFL students at different proficiency levels: test-takers with higher English

language proficiency tended to be less anxious in oral test taking than those with lower

ones (Liu, 2007).

About 80% of the respondents in the candidate survey suggested examiners should

be friendlier, and smile sometimes to reduce test takers’ anxiety. Others expected that

neutral behaviour from examiners would help them keep calm to give a better speaking task

performance. A typical example of suggestions was in a candidate’s words: “Possibly the

examiners just need to have a normal look at the candidate without having to smile, but do

not frown or shake heads” (U2G2.S1).

129

Table 4.3 Candidates’ oral test anxiety in relation to general self-evaluation of English

General self-evaluation University A University B University C Total

Very good Valid 1 1 0 2

Missing 0 0 0 0

Mean 1.00 2.00 0 1.50

Good Valid 18 9 21 48

Missing 0 0 0 0

Mean 1.50 1.67 1.86 1.69

Fairly good

Valid 46 30 40 116

Missing 0 0 0 0

Mean 2.02 1.97 2.05 2.02

Average

Valid 36 80 44 160

Missing 0 0 0 0

Mean 2.42 2.33 2.20 2.31

Poor Valid 3 14 5 22

Missing 0 0 0 0

Mean 2.33 2.57 3.00 2.64

Very poor Valid 1 3 0 4

Missing 0 0 0 0

Mean 3.00 2.67 0 2.75

Total Valid 105 137 110 352

Missing 0 0 0 0

Mean 2.04 2.23 2.12 2.14

Note. 1 = Not nervous at all 2 = Nervous 3 = Very nervous

Test takers’ anxiety might be caused by a lack of discipline in test administration.

Not separating candidates who had finished their test sessions from those waiting for their

turns inside/outside the test room created disadvantageous communication that might have

increased test anxiety for those who had not yet taken the test, as a candidate reported:

On the test day, some students who had finished their speaking session went out and

said this question was difficult, or that question was difficult. Their saying such

things put psychological pressure on me. I thought it (sitting the oral test) was

130

frightening. I wondered if I could answer such questions or not. I advised myself to

review (my answers to the questions). So instead of entering the test room with

confidence, I was agitated and very uncomfortable. (U3G2.S3)

Two of the institutions assigned EFL teachers who had not been directly in charge

of the Listening-Speaking classes to be oral examiners. The main purpose of the Faculties’

arrangement for this switch was to increase objectivity in scoring and re-align a responsible

attitude towards teaching and learning Speaking skills (U1.T1, U2.T3). However,

employing raters other than the familiar teachers might have caused candidates’ anxiety.

Candidates understood that raters were different in their severity. New raters usually made

the candidates curious, and prompted them to question each other prior to the test about

who the examiners were in charge of their test room. Hearing rumours that some teacher

was harsh also made test takers nervous (U1G1.S4).

(3) Experiential characteristics

I examined test takers’ experiential characteristics in terms of the English education

the candidates had received, their oral examination experience since they had started

learning English, and their amount of class attendance as their speaking test preparedness.

There was a mutual dependence among these characteristics as students’ oral testing

experience depended upon how long they had been exposed to English education. Their

English learning styles and attitude may have affected their class attendance in preparation

for the end-of-course examination.

Table 4.4 Test takers’ English education prior to the oral test

University A University B University C Total

F (%) F % F % F %

7 years or more 84 80 95 69.3 88 80 267 75.9

4 to less than 7 years 20 19 30 21.9 14 12.7 64 18.2

1 to less than 4 years 1 1 12 8.8 8 7.3 21 6.0

Total 105 100 137 100 110 100 352 100

Note. F = frequency

131

These English majors came from different learning backgrounds because the tertiary

institutions were open to all high school graduates throughout the country. Overall, the

majority of the candidates (76%) had been learning English for at least 7 years (Table 4.4).

Some of them (18%) had spent between four- and seven-years learning English before

going to university. A very small number of students (6%) had less than 4 years of EFL

study. None of them were beginner-level students of English, i.e. having learnt English for

less than one year. Table 4.4 shows that Universities A and C had the highest percentage

(80%) of students having at least seven years of English study. University B had the highest

proportion (8.8%) of students who had studied English between one and less than four

years. In general, University B’s students had slightly shorter lengths of English-learning

time compared with their contemporaries at the other institutions.

Discrepancy in candidates’ experiential characteristics resulted from years of English

study at high schools without formal oral examinations, as discussed in Chapter One of this

thesis. Vietnam’s domestic press has remarked on the significant difference in the quality of

English language instruction across localities in Vietnam (Hoang Huong, 2015).

Vietnamese students in big cities have more exposure with English at school at earlier ages

than other regions. English specialised schools, usually located in cities, may schedule up

to 14 periods of English per week (Nunan, 2003). Students wishing to improve English

communication skills have to join English speaking clubs and/or enrol in extra classes at

private centres for foreign languages (Nguyet Ha, 2015; X3English, 2017; Quality Training

Solutions, n.d.). The situation of university students having to join English classes at

language centres has raised a question regarding the quality of compulsory English courses

included in tertiary training programmes. The teaching of English is “an enormous waste of

time and energy of both teachers and learners” when it is sufficient in quantity but not

corresponding in quality (Vu & Nguyen, 2004).

Most of the candidates had experience with English speaking tests, at least since the

commencement of their tertiary education. Figure 4.3, showing a cross-institutional

comparison of candidates’ experience with oral testing, indicates that University A had the

most candidates (66.7%) who had taken oral tests at least four times, i.e. these students

experienced speaking tests even before going to university. University C’s students held the

132

second most position of test takers who had experienced four times or more of oral English

assessment. University B had the fewest English majors (37.2%) with as much experience

with oral testing as the other institutions. The majority of University B’s test takers had

only experienced taking oral tests three times since they entered university (43.8%). This

group accounted for about one-fourth of candidates at the other institutions, 29.5% at

University A and 24.5% at University C.

Figure 4.3. Candidates’ experience with oral testing

Class attendance at university level was quite flexible in that students were not

required to have full attendance (100%) during the course. One institution applied a penalty

policy that “students who are absent more than 20% of class meetings will be forbidden

from taking the final (speaking) examination” (Appendix B.4a). The other two institutions

encouraged (and obliged) class attendance not via presence check but by awarding bonus

points for active participation in classroom activities (Appendix B.4b), or giving frequent

mini-tests according to the prescribed schedule (Appendix B.4c). As can be seen from

Table 4.5, none of the institutions achieved 100% class attendance. University A obtained

the highest percentage (30.5%) of students’ class attendance of more than 90%, which was

far higher than that of Universities B and C. Most students’ class attendance was from 50%

to less than 90%. University B had the highest percentage (5.1%) of students whose class

attendance was less than 10% of the programme, whereas this category was lower at the

66.70%

29.50%

1.90% 1.90%

University A

60.90%24.50%

6.40%8.20%

University C

37.20%

43.80%

11.70%

7.30%

University B

4 times or more Three times Twice Once

133

other institutions (only 2.9% and 2.7% at Universities A and C respectively). Students

missing a lesson during the course might have resulted in missing opportunities to practice

some vocabulary or language skills, and the teacher’s important guidance to achieve the

best performance in the end-of-course speaking examination.

Table 4.5 Test takers’ profile of English-speaking class attendance

University A University B University C Total

Class attendance F % F % F % F %

90% or more 32 30.5 16 11.7 15 13.6 63 17.9

70% to <90% 37 35.2 21 15.3 27 24.5 85 24.1

50% to <70% 22 21.0 41 29.9 35 31.8 98 27.8

30% to <50% 8 7.6 25 18.2 25 22.7 58 16.5

10% to <30% 3 2.9 27 19.7 5 4.5 35 9.9

<10% 3 2.9 7 5.1 3 2.7 13 3.7

Total 105 100 137 100 110 100 352 100

Note. F = frequency

Supplying students with all the speaking test questions was another possible reason

for low class attendance. Students attached less importance to class attendance to learn and

practise core skills for the end-of-course speaking test. They invested moderate effort in

learning to deal only with exams. Students tended not to have sufficient review of

materials and careful preparation for the test event. A test taker expressed her concern when

she was revising the test questions before the test time, many of her classmates asked

“What is the exam testing today? Which questions did the teacher ask?” (U3G2.S3).

Students’ inadequate preparation for school-based assessment could affect their academic

achievement, although they may obtain a Pass score for the subject. Further aspects of

providing test questions prior to the actual testing are covered in section 4.3, which

discusses oral exam administration.

The more regular candidates’ attendance was in Speaking classes, the less anxious

they felt in their oral test across institutions, and in the entire student sample. As shown in

Table 4.6, the highest level of test anxiety was found in the groups of students who

attended only between 10% and less than 50% of the course lessons, varying from 2.33 to

134

2.51. Candidates tended to be less anxious when they attended more classes. For example,

the level of test anxiety was 2.08 for those attending between 50% to less than 70% classes,

and 2.04 for those attending between 70% to less than 90% classes. Candidates attending

90% classes or more were the least nervous (1.98) in their oral assessment.

Table 4.6 Candidates’ oral test anxiety in relation to class attendance

Class attendance University A University B University C Total

90% or more

Valid 32 16 15 63

Missing 0 0 0 0

Mean 1.91 2.00 2.13 1.98

70% to <90%

Valid 37 21 27 85

Missing 0 0 0 0

Mean 2.00 1.95 2.15 2.04

50% to <70%

Valid 22 41 35 98

Missing 0 0 0 0

Mean 2.09 2.17 1.97 2.08

30% to <50%

Valid 8 25 25 58

Missing 0 0 0 0

Mean 2.25 2.52 2.16 2.33

10% to <30%

Valid 3 27 5 35

Missing 0 0 0 0

Mean 2.67 2.48 2.60 2.51

less than 10%

Valid 3 7 3 13

Missing 0 0 0 0

Mean 2.33 2.00 2.33 2.15

Total Valid 105 137 110 352

Missing 0 0 0 0

Mean 2.04 2.23 2.12 2.14

Note. 1 = Not nervous at all 2 = Nervous 3 = Very nervous

In this section, I have presented characteristics of the candidate samples participating

in the oral test. I examined these features in terms of test takers’ physical/physiological

characteristics, psychological matters while taking the test, and experiential characteristics

which had a potential impact on their English speaking skills performance in a testing

135

context. Language testing researchers suggest that “for a successful testing programme it is

important both that test developers understand the nature of the test takers and that test

takers have a good appreciation of the content and purpose of the test” (O’Sullivan &

Green, 2011, p. 63).

The data about test takers presented in this section have demonstrated a type of

reliability evidence that should be given greater consideration together with rater reliability

factors (to be analysed in Chapter Seven) in test design, administration, and scoring

process. My analysis suggests that candidates varied in general English background. Most

reported that they felt they had an English level between ‘Acceptable’ to ‘Fairly good’. The

weaker candidates were at English, the more nervous they felt in the oral examination. The

majority of the candidates had been learning English for at least 7 years since secondary

high school. However, about one-fourth of them only had experience with English speaking

tests at university. Many candidates admitted that they suffered from physical issues that

they thought may have affected the oral performance. Test administrators were neither

informed of these personal circumstances nor had special arrangements in preparation to

handle unexpected incidents duting the test event. In the next section, I discuss the results

associated with test administration across the institutions in order to answer the first

research question.

4.3 Speaking test administration across institutions

My purpose in this section is to examine the context validity of the oral test, i.e. the setting

in which the oral assessment was administered. This section aims to answer Research

Question 1a: How is oral English assessment administered across Vietnamese universities?

I consider similarities and differences, strengths as well as shortcomings in test

administration across the universities in light of factors causing inconsistencies or threats to

fairness in language assessment, which is a central concern not only among test designers

but also test users (Kunnan, 2000; Gipps & Stobart, 2009). I collected data from test rooms

to learn about the physical conditions and uniformity of administration. After the test, I

investigated the markers’ and the candidates’ perceptions of, and opinions about, how the

test was administered by conducting questionnaire surveys and direct interviews with these

stakeholders.

136

My field notes from observational protocols indicate that methods of oral test

administration were similar in various aspects. However, there were remarkable differences

in rating conditions and the operational practices of testing procedures. Table 4.7 shows

aspects these institutions shared in common and other specific features in their test

administration.

Table 4.7 Comparing test administration methods across institutions

Common features All the three institutions

• School-based assessment using available facilities and human resources

• Direct assessment (live scoring)

• No audio- or video-recording. Raters decided scores for individuals right after each

speaking test session.

• No rater training or meeting prior to the test for discussing what the assessment

criteria were or how the raters would score and use the rating scales

• No feedback on candidates’ oral performance after the test

Particular features University A University B University C

Interlocutor Usher Technology (PowerPoint slides) Air-conditioned Test questions informed in advance Timing for each speaking session

ü

varied

ü ü

varied

ü ü

varied

The institutions administered oral assessment on campus using the facilities and

human resources available for EFL education. Test rooms were familiar classrooms where

candidates had daily lessons during semesters. Everyday chairs, benches, and desks were

kept almost in the same position except for a slight rearrangement of the seats and desks

depending on the number of raters and candidates involved in each speaking session – four

seats for two raters and two candidates. The seating arrangement was similar across test

rooms in each institution but differed across institutions as a result of different test formats.

In general, evidence from my observations suggests that physical conditions such as light,

space, desk and chair arrangements were appropriate for a one-on-one format (University

137

C), and a paired format (Universities A and B). The former enabled face-to-face interaction

between the rater (also examiner/interlocutor) and one candidate. The latter facilitated

conversations within a group of two raters and two candidates. A side-by-side configuration

for paired candidates helped to encourage attention to the examiner (as a third party), and

receive equal attention from the examiner (Goodwin, 1981). When there was a distinction

in the roles of paired examiners (University B) as an interlocutor (Examiner 1) and an

assessor (Examiner 2), the room set up followed the instructions by University of

Cambridge Local Examinations Syndicate (UCLES):

The assessor should sit at a suitable distance from the table so that he/she can

clearly hear the candidates and interlocutor, but far enough away so that it is

obvious to the candidates that he/she will take no part in the interaction. The

assessor should be able to see the candidates and be visible to them. (UCLES, 1997,

pp. 6-7)

This positional arrangement focused candidates’ attention more on the interlocutor

and reduced their anxiety of having to deal with both of the examiners. However, an

unnecessary side-by-side setting of paired examiners (one of whom did not join in the

conversation) may have increased candidates’ nervousness, distracted their attention, and

confused them as they did not know which examiner would interact with them, or if both

might (University A). Figure 4.4 visualises positions of oral candidates and

interlocutors/examiners in test rooms at the three institutions. Arrows represent interactions

to both deliver and assess tasks.

138

University A University B University C

Legend: assessor/examiner desk

candidate interaction to deliver tasks

interlocutor/examiner Task 1’s interaction to be assessed

Task 2’s interaction to be assessed

Figure 4.4 Seat arrangements for oral assessment at different institutions

As illustrated in Figure 4.4, the seating arrangements facilitated candidates’

generation of live oral performance in different patterns of interactions. Universities A and

B involved four participants each session: one pair of examiners and one pair of candidates.

Task delivery was similar in that one of the examiners delivered tasks to candidates one

after the other. However, candidate-candidate interaction occurred only in University A’s

second task (the first was a monologic task) whereas each of University B’s speaking test

sessions enabled both interactive formats: examiner-candidate (Task 1) and candidate-

candidate (Task 2). University C involved only two participants (excluding the ‘guest’

rater) in an examiner-candidate interview format from beginning to end. Adopting the

paired format enabled a more life-like context for communicative language testing and

helped Universities A and B to assess large numbers of candidates in test rooms.

The test rooms were spacious enough to accommodate the examiner(s) and the

examinee(s) participating in the test. Because the rooms had more space than needed,

examiners in one test room allowed other candidates to wait for their turns at the back of

the test room. Test participants could hear some noise from the waiting candidates since no-

one was managing them. These classrooms were otherwise quite suitable for speaking tests

as they were located upstairs and were not affected by traffic noise or unwanted passers by.

139

Each institution nominated their own EFL teachers, either official or guest, to be

oral raters. This way of test administration was practical and economical. More importantly

the raters were very responsible in their duty of rating and scoring as it was part of their job

requirements and would contribute to the quality of the institution which they were working

for. I will discuss institutional nomination and arrangement of oral raters in detail in

Chapter Seven.

The institutions adopted a form of direct assessment that enabled “two-way

exchange tasks” and the use of “a wide range of language functions” such as describing,

comparing, explaining, analysing, discussing, giving opinions, etc. (O'Loughlin, 2001, p.

38). Raters made evaluations based on the live performance of candidates. Both raters and

candidates perceived direct speaking tests to have more advantages than disadvantages.

Responses from interviews with raters about differences between direct rating of live oral

performance and semi-direct rating of recordings indicate that the advantages of live

scoring outweigh its disadvantages. I found similar results with groups of candidate

interviewees discussing their preferences of two formats: speaking tests with a human rater

or interacting with a computer screen. Table 4.8 summarises advantages and disadvantages

of direct assessment from the raters’ and candidates’ views.

Table 4.8 Advantages and disadvantages of direct assessment

Advantages Disadvantages

For r

ater

s

• Better quality of hearing the oral production • Candidates’ mouth shape helps to guess the

words if some are not comprehensible • More mental concentration helps scoring more

accurately (as candidate’s performance only happens once and cannot be replayed)

• Can elicit speaking samples in ways that serve evaluation purposes

• Can provide weak candidates with necessary assistance in time

• Candidates’ appearance and interaction may cause bias in judgements

• More tiring as more concentration is needed

• Affected by co-raters’ opinions

For c

andi

date

s • Better performance when speaking naturally • More active in communication • Candidates are trained with communication skills

besides language skills • Raters’ appropriate gestures may motivate

candidates to speak more

• Raters’ presence may make candidates nervous or stressed

• Affected by the partner in co-constructed speaking tasks

140

Many raters had experience with semi-direct assessment due to having worked with

other institutions in English language testing and assessment. The following extract comes

from a rater interviewee when she compared direct and semi-direct assessment, expressing

her preference for live (direct) scoring:

Actually, marking from an audio-recording is not as good as direct marking where

initially we can see their (candidates’) facial expressions, then we can do the rating

better than scoring an audio file. Sometimes it is difficult to completely understand

what a candidate says in an audio file. However, we can guess the words a candidate

is speaking when we look at his/her mouth shape. (U2.R1)

More information about candidates’ and raters’ opinions about computer-based oral

testing and scoring is presented in the next section. Many Vietnamese university teachers

and students have had experience with this form of language assessment outside their

institutions.

None of the institutions had their speaking tests audio-recorded as evidence of

candidates’ (and examiners’) performances. Oral tests without recording could help to save

time in the preparation stage, but might include potential risks if unexpected incidents

should happen and affect the test results. Raters, either individually or in pairs, were the

ones who played the decisive role in marking candidates’ performance. No evidence was

kept except for the candidates’ signatures on the official scoring sheet certifying their

attendance at the test event. Rating without audio-recording reduced transparency, and

facilitated subjective measurement errors in oral scoring. No recordings made candidates’

queries or complaints about test scores impossible to resolve. Sound recording would help

to align oral assessment with other forms of written tests in which candidates’ performance

is retained as proof of both the candidate’s participation in the test and the rater’s

evaluation of that performance. The majority of rater participants in interviews (85%)

supported the recording of oral examinations as they supposed it to be beneficial to

language teaching, learning, testing, and research. Table 4.9 summarises reasons provided

by raters for and against recording oral test in the scope of institutional assessment.

141

Table 4.9 Raters’ opinions for and against audio-recording oral tests

For Against

• Keeping evidence of test performance as for other school subjects

• Encouraging candidates to speak more clearly, and so better, as they know their voices are being recorded

• Making raters perform their rating tasks with stronger sense of responsibility

• Using speech sample recordings for rater training and oral language testing research

• Enabling score revision or error correction in scoring when necessary

• Time-consuming for preparation • Possible technical problems • Issues with storing and

safeguarding the audio-files • Unnecessary as double rating (with

two raters) was sufficient to produce reliable scores

If the recording was available after the test, some teachers could make use of it to

provide feedback to candidates when requested. A rater’s opinion in the following example

illustrates how sound recordings could be useful in giving feedback to help candidates

improve their test performance:

In my opinion, if we want to give students feedback on what they have achieved or

have not achieved, I think we should have the oral test recorded. Listening to the

recordings can help students to understand more thoroughly and recall more clearly

what they did in the test… Their lack of confidence will be realised via that

recording, so they will make necessary adjustments. (U2.T2)

Examiners did not have sufficient discussion of the assessment criteria and rating

scales prior to embarking on scoring. At one institution, raters came to meet the secretary at

the teacher lounge to collect the testing materials just minutes prior to the test time and then

went to the assigned test rooms. The starting time was not necessarily the same across the

test rooms but was flexible as a few raters arrived about 3 to 5 minutes later than the

predetermined time for the oral test to start. At another institution, there was a short

meeting (approximately 5 minutes) in which Head of the Listening-Speaking teacher group

explained how to use the testing materials to deliver tasks to candidates in pairs, and how to

write down the awarded scores onto the personal draft scoring sheet. At the other

institution, the test was scheduled as the final lesson of each class with only one rater who

was teaching the class. The test days were different across Listening-Speaking classes.

142

There were no institutional meetings scheduled for raters to discuss and agree on

how the general criteria should be interpreted, how many questions each candidate would

answer, or what the examiner would do if a candidate did not understand an oral question,

etc. Teacher raters made all the decisions on timing, scoring, and reporting test results. For

this important role, training oral examiners needs to be scheduled before the start of every

speaking test rather than afterwards (Alderson, Clapham, & Wall, 1995, p. 114). A recent

study demonstrated that even though oral examiners shared the same general understanding

of the constructs and criteria to be rated, there were some differences in how they valued

the relative importance of these constructs and criteria, e.g. content knowledge.

Furthermore, the findings pointed out a notable variability in rating outcomes when the

raters assessed spoken EFL without a common rating scale (Bøhn, 2015).

Student participants were studying at the same phase in their training programme.

However, across-institutional uniformity in oral test administration was difficult to obtain

since each university established their own course objectives and adopted different testing

methods. Remarkable dissimilarities in test administration were reflected in allocating

human resources in testing operational practices (whether the interlocutor or the usher was

included), determining the weighting of oral assessment (how much the oral test weighed in

the total score of the Listening-Speaking subject), and preparing candidates for the test

(whether candidates were informed of the test questions for preparation at home prior to the

actual test).

In order to achieve a more comprehensive evaluation, one institution adopted a

double rating format: an interlocutor performed holistic (global) rating, and an assessor

performed analytic rating (Green & Hawkey, 2012; Fulcher, 2003). The interlocutor

generated candidates’ oral performance via task delivery, and awarded global scores.

Shared rating between the interlocutor and the assessor reduced the workload and made

oral assessment more focused on particular aspects. Paired raters’ role-switching after the

first half of the examining event (usually a 3 or 3.5 hour period) “allow[ed] both examiners

an equal opportunity to maintain their experience in both roles” (Taylor, 2011, p. 341).

Table 4.10 shows the first examiner (interlocutor) performed most of the procedural steps

for assessment. University B’s arrangement of examiners was different from the others’ in

143

that paired examiners performed two types of judgements – one used a global scale, and the

other used an analytic scale.

Interlocutor variability had direct influences on candidates’ performances as a result

of verbal and non-verbal communication with them during the test. The interlocutor

contributed to “the co-constructed nature of interaction” in CLT, but “could become a

potential threat to a test’s validity and fairness” (Galaczi & ffrench, 2011, p. 166).

Interlocutors’ different styles in interviews, e.g. making requests for elaboration, asking

closed questions, pausing to give time for the candidate’s responses, affected the

candidate’s performance, and thus the assessor’s perception and judgement of the

candidate’s speaking ability (Brown, 2003). Further analysis of examiner arrangement for

the interlocutor’s role in different scoring methods is presented in Chapter Seven.

Table 4.10 Examiners’ performance during an oral test session

University A University B University C

Examiner (1) (2) (1) (2) (1) (2*)

Checking candidates’ ID ü ü ü

Delivering tasks ü ü ü

Interacting with candidates ü ü ü

Managing/maintaining/controlling interaction in test sessions

ü ü ü

Observing and listening to candidates’ performance

ü ü ü ü ü ü

Making judgements with reference to a global scale

ü ü ü ü ü

Making judgements with reference to an analytic scale

ü

Deciding scores based on judgements ü ü ü ü ü ü

Note. * The second examiner (University C) was invited to be a ‘guest’ rater for this study and did not interfere with the normal procedure of oral assessment at the institution. The ‘guest’ rater’s scores were for the purpose of my study only and were not calculated in students’ academic achievement.

In my observation of the testing sites, I found the inclusion of an usher at one

institution very helpful in test administration. The usher was an office staff member who

144

took care of candidate arrangements during the test. The usher decided the pairing of

candidates, called the paired candidates to enter the test room, kept the lounge (waiting

room) in good order, and ensured those who had finished their speaking test were directed

far away from the testing sites so they did not make noise or reveal the contents of the test

to waiting candidates. By sharing such kinds of work, the usher helped examiners to

concentrate on their rating performance.

One of the institutions delivered all the test questions to students several weeks in

advance so that students had sufficient preparation time. The purpose of providing this pre-

testing information was to encourage students to practice their speaking skills intensively

after class time, and become more confident in the test room. Knowing the content for

testing did not guarantee candidates high scores but did facilitate better structured talk in

the test. A rater from this institution shared with me his perspective:

In my opinion, learning a language involves not only the ability to speak that

language, but also to know what to say. Therefore, knowledge plays a crucial part.

There were candidates who had already studied at an international school. They

spoke very well. However, their knowledge was not enough to incorporate in their

speaking. Though their pronunciation was very good, their insufficient knowledge

prevented them from speaking cohesively. They spoke very disorderly and did not

follow any structure, or it was because their understanding was not accurate or

updated in time. (U3.T2)

Well-prepared responses might not meet communicative test-setting requirements

for real life situations. Testees’ oral performances may be products of reciting from

memory rather than real communication. The conversational language becomes less

interesting and creative. To create situations that are as life-like as possible, “the students

should encounter unpredictable language input and be put in a position where they must

produce creative language output” (Brown, 2005, p. 21).

After the test, I conducted separate questionnaire surveys for examiners and

candidates to obtain a generalisation of these stakeholders’ opinions and perceptions about

the test. I scheduled face-to-face interviews with examiner and candidate representatives to

understand more about their individual thoughts in relation to the information provided in

145

the surveys. The following section is a convergent report on candidates’ and raters’

perceptions of the institutional oral test administration.

4.4 Candidates’ and raters’ perceptions of the oral test administration

EFL teachers and students were the stakeholders directly involved in the institutional oral

assessment. Their opinions and perceptions played an important part in examining the

effectiveness of test administration. Data from the two separate questionnaires for test

takers and raters helped to find out the overall evaluation of how the oral test was

administered at the tertiary institutions under study. These opinions and perceptions are

reflected in responses to statements regarding administrative issues before, during and after

the test event. Illustrative comments from interviews with the examiners and candidates are

provided in the next section to support and illustrate the survey results.

4.4.1. Candidates’ perceptions of the oral test administration

In Table 4.11, I present descriptive statistical results of candidates’ perceptions of the oral

test administration in terms of the preparatory stage, overall perceptions, feedback on

performance, and attitude towards computer-assisted language testing (CALT). The seven

items from the students’ questionnaire survey received means ranging from 2.42 to 4.34,

with standard deviations fluctuating between .83 and 1.08. The highest mean score (M =

4.34) of candidates’ need for teachers’ feedback on their oral performance (item 1) suggests

that the feedback would provide useful information for candidates in their English learning.

Nearly 90% of candidates’ responses in group interviews echoed this survey result.

Feedback was deemed necessary by language learners because it could help them know

what they had done well and what they were weak at so that they could practise more to

improve (U1G2.S2). Many candidates expressed positive perceptions of feedback and

appreciation of the constructive comments they had received from examiners:

After the test, I received from the examiner feedback on my speaking performance.

I think these comments are really helpful because one cannot have self-evaluation

when he’s speaking. There should be an outsider (an examiner) listening and giving

comments, then it will be more objective. (U1G2.S4)

146

Even in a paired speaking test, the examiner was able to give candidates feedback at

the same time. One candidate may find feedback for his/her partner useful to both of them.

This way, the effects of constructive feedback could be doubled for the pair of candidates

benefiting from examiners’ comments beyond the scores awarded:

My examiner gave comments right on the spot. I remember the examiner

commented on a friend’s pronunciation. The teacher suggested corrections and

recommended how to say it right. After the oral test was over, the teacher advised us

to read more, and think more about our answers. I think those are very useful for us

to have further practice (to prepare) for the next speaking test. (U1G3.S3)

Candidates did not expect raters’ comments to be long and detailed (U2G2.S3).

They needed practical feedback that was motivating and encouraging (U1G3.S4).

Candidates did not think it was good to have an oral test session where the teacher made no

comments from beginning to end, but only said thank you to signal that the testing time was

over, and then goodbye (U2G2.S2).

Table 4.11 Means and standard deviations (SD) for candidates’ perceptions of the test administration

Statements N Mean SD

(1) I need teacher feedback on my speaking performance in the test so that I can improve my speaking skills. 351 4.34 .74

(2) I had sufficient time to prepare for this test. 355 3.68 .82

(3) I was clearly informed of the assessment criteria to prepare for the test. 352 3.62 1.02

(4) The test was well administered. 352 3.62 .86

(5) I achieved the best speaking performance that I was capable of. 352 3.30 .85

(6) The atmosphere of the test room was stressful. 352 3.28 1.08

(7) I believe that computer-assisted speaking tests are more accurate than those by human raters. 352 2.42 .89

Valid N (listwise) 351

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree

147

The oral test was administered at an appropriate point in students’ learning time .

Responses from the student survey demonstrated a high degree of agreement (M = 3.68)

towards candidates’ having sufficient time to prepare for the test (item 2), and a similar

degree (M = 3.62) on being clearly informed of the assessment criteria to prepare for the

test (item 3). Most of the candidates entered test rooms with an understanding of the

assessment criteria so that they would have their own strategies for their speaking task

performance to gain the best scores they could. However, a high standard deviation (SD =

1.02) implied that candidates showed different levels of agreement to this statement (item

3). In other words, there were a noticeable number of candidates (14.5%) taking the test

without knowing by what criteria their speaking was to be assessed. When I raised the

question whether candidates knew about the oral assessment criteria prior to the test, a

candidate reported her concern about the weighting of each component in rating:

I remember the teacher once told us that the test would be organised with two

students each time, but I was not very clear about the assessment criteria, for

example, how many points the grammar would be worth, or how the sentence

structure is assessed… So I did not know which part I should focus on. (U2G1.S4)

Candidates’ concern about the weighting of assessment components was reasonable

since “weighting of different parts of a test or assessment criteria reflects the perceived

importance, or lack of importance, of that aspect of the test in relation to other tasks”

(Galaczi & ffrench, 2011, p. 127). Another student said that candidates would have

different strategies before and during the test to suit the assessment criteria if they obtained

exact information about these criteria. She also mentioned an inconsistency in test

administration at her institution:

If teachers informed us that the discussion section accounted for more marks, then

everyone would contend to speak. But if teachers said the emphasis was on

pronunciation, the candidates would have more pronunciation practice to speak

accurately. In my first year (at university), when we were going to take a test, we

received a table in which the teacher told us the percentage for each component, or

the sub-score for each, but later we did not see it any more. So we did not know

how raters would score our speaking. (U1G3.S1)

148

The survey results indicate a fairly positive evaluation of the way the test was

organised. The statements regarding candidates’ overall perceptions of the test

administration (item 4), and the test room atmosphere (item 6) received medium mean

values (M = 3.62 and M = 3.28 respectively). However, the highest value of standard

deviation (SD = 1.08) shows that there was a significant difference in candidates’

perceptions of the test room atmosphere (item 6). Here I give examples of candidates’

contrasting ideas about whether or not the testing environment was stressful. In one

candidate’s words:

When I entered the test room, I was not worried about anything. I just tried to

express myself at my best. I think I could demonstrate about 90% of my (speaking)

ability. (U1G1.S1)

Another student in the same interview group shared with me a different perception

of her test room atmosphere. The test room formality made her too anxious to perform oral

skills at her best, and reduced her speaking to about 60 or 70% of what she was capable

(U1G1.S4). These opinions support the argument that anxiety has a negative impact on

cognitive abilities and hinders learners from displaying what they have learnt (Chuang,

2009). Candidates told me that the interactant or the interlocutor was possibly one of the

potential causes of stress in the test room:

For my case, the surrounding atmosphere was so quiet that I could not get

accustomed to it. In other situations, it may be caused by the interactant, the person

who put questions to me. If this person caused pressure and showed displeased

facial expressions, then I would not be able to speak out (U3G2.S3).

Most of the candidates did not think computer-assisted oral testing was more

accurate than that undertaken by human raters. The mean score of candidates’ belief in the

accuracy of computer-assisted speaking tests (item 7) was close to the value of ‘disagree’

(M = 2.42). Most of the interviewed Vietnamese students (80%) had no experience with

computer-assisted oral testing, but they could imagine what it was like because they all

knew how to use a computer. Candidates tended to prefer taking oral tests with human

raters. My interviews with students provided supporting evidence for their preference:

149

I do not like this form of assessment very much. Though speaking to a computer is

likely less stressful, I do not find computer-assisted oral tests interesting since no

one is listening to me. Speaking needs interaction. Interacting with humans makes

speaking easier. I don’t think it’s natural to speak in front of a computer screen.

(U1G3.S2)

Candidates had different perceptions of stress when taking an oral test on the computer.

They realised both advantages and disadvantages of testing with the assistence of

technology, as in a student’s comments:

I think testing on the computer has both advantageous and disadvantageous aspects.

Its advantage is a fair rating of oral performance because everything the candidate

speaks is recorded. Its disadvantage is stress to the candidate. Speaking directly to a

human rater is more comfortable, whereas speaking to a computer makes the

candidate sound like a robot (U1G2.S4).

One candidate reported from her own experience that computer-based testing causes

her stress since “the continuously running clock made my heart jump accordingly. It caused

such a lot of pressure that I could not speak at all” (U3G1.S2). In general, students were

less in favour of computer-assisted oral testing than of that with human raters. In Table

4.12, I summarise students’ opinions for and against computer-assisted oral testing in terms

of interaction, consistency, and comfort for test takers. Students were not certain of

measurement accuracy by computer-based testing. There were more disadvantages than

advantages of computer-assisted oral testing that the students could identify.

150

Table 4.12 Candidates’ comments for and against of computer-assisted oral testing

Aspects of consideration

For Against

Interaction • speak more freely and naturally

• Not interesting • Less motivating than with human raters • No personal feedback • Communication skills are not improved • Passive communicator • One-way communication, not authentic/real life • Poor flexibility in response

Consistency • fairer when all candidates answer the same questions

• quick results • equal timing

• Computer recognised only clear and accurate pronunciation

• Technical issues • Possibly affected by unfamiliarity with

technology

Comfort • less stressful than assessed by human raters

• not affected by raters (behaviours and/or facial expressions)

• personal privacy (not influenced by partners)

• manage to speak within a limited time • too stressed to speak at times

I have presented candidates’s perceptions of the oral test administration. In the following

section I examine test rater’s perceptions of the oral test administration.

4.4.2 Raters’ perceptions of the oral test administration

The purpose of this section is to present oral examiners’ perceptions of physical conditions

and uniformity in test administration. Multi-dimensional responses from the test raters

helped me to obtain a more comprehensive evidence-based understanding of the oral test

under study. The questionnaire section designed to examine raters’ perceptions of the oral

test comprised eight statements. Some statements in the raters’ questionnaire were similar

to those used in the candidates’ questionnaire: the preparatory stage of test administration,

overall perceptions, feedback on candidates’ performance, and attitudes towards CALT.

In Table 4.13, I present descriptive statistics results of examiners’ perceptions of the

test with total numbers of responses, means, and standard deviations. The eight items from

the teachers’ questionnaire survey received means ranging from 2.29 to 4.66, with standard

deviations fluctuating between .59 and 1.30. Teachers’ awareness of consistency in test

administration (item 1) received the highest mean score (M = 4.66). This result indicates

151

that the majority of teachers attached importance to ensuring consistency across test rooms

and candidates in oral test administration. Unlike assessing the other language skills such as

listening, reading, or writing, testing oral skills with human raters takes more time because

not all candidates can sit the live speaking test at the same time. Uniformity of

administration was identified as an initial concern to ensure reliability in oral assessment

across candidates in different examination rooms.

The survey results suggest that candidates had understood what criteria the raters’

evaluation was based on prior to the test. The mean score of teachers informing students of

the assessment criteria before the test (item 2) was at the second highest position (M =

4.34). Teachers were confident with candidates’ understanding that:

the speaking rubric consisted of assessment criteria. When we compiled the test

questions, we also included the speaking rubric. If candidates demonstrate their

ability in some components as expected for that language skill, we usually award

similar scores. (U1.T1)

Raters affirmed that oral rating was a complex combination of assessing different

linguistic aspects and sub-skills associated with oral production, as outlined in a teacher

rater’s sharing: “Linguistic competence comprises vocabulary, grammar, many other skills

related to language and knowledge. Raters relied on those criteria in the rating scale to give

scores” (U2.T2). However, as presented in the previous section, there were a considerable

number of candidates who did not know by what criteria their oral performance was

assessed. They may not have realised the importance of the test to make sufficient

preparation for it. An uncertain understanding of the objectives of the test may hinder a

candidate’s best performance in the test room.

152

Table 4.13 Means and standard deviations (SD) for raters’ perceptions of the test administration

Statement N Mean SD

(1) Ensuring consistency in rating and scoring are very important to me. 35 4.66 .59

(2) Students were informed of the assessment criteria before the test. 35 4.34 .73

(3) Students need to learn the language material and skills outlined in the course objectives to achieve good results for the test. 35 4.31 .58

(4) There was sufficient time for students to prepare for this test. 35 4.06 .76

(5) The atmosphere of the test room was formal. 35 3.54 .82

(6) I give feedback on students’ speaking performances after the test. 35 3.31 1.30

(7) The atmosphere of the test room was stressful. 35 2.43 1.04

(8) I believed that computer-assisted speaking tests are more accurate than those by human raters. 35 2.29 .96

Valid N (listwise) 35

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree

The fairly high mean (M = 4.31) reveals that candidates needed to follow the

contents covered in the course books in order to obtain good speaking test results. The

mean score of the respondents for the importance of the language course material and skills

(item 3) was close to the value of ‘agree’ (M = 4.31). I provide further analysis of the test

content in Chapter Five. Several opinions from rater interviews aligned well with this

outcome since oral test writers were EFL teachers in charge of speaking classes. Most of

the question types and test tasks were adapted from those introduced in the students’ course

books. Many of them were exactly the same as those used in class. A teacher rater, also one

of the test designers, shared with me the construction of questions used in oral assessment:

All of them (the test questions) were located somewhere in the students’ course

book. The questions were not strange at all. For example, we completely reused

questions in the Warm-up, or Lead-in, or Consolidation sections from the course

book (U2.T3).

Most of the teachers did not think that test rooms were stressful for candidates. The

statements describing overall perceptions of the atmosphere of speaking test rooms (items

7) received low means which were close to the value of ‘disagree’ and ‘neither agree nor

153

disagree’ (M = 2.43). However, a high standard deviation (SD = 1.04) was indicative of

varied perceptions of the stress-free atmosphere in test rooms. There are examples of

candidates’ expressions of test anxiety in the test taker characteristic section (4.2 of this

chapter).

Results from the teacher survey highlight a significant difference in teachers’

opinions towards whether it was advisable (or possible) to give feedback after candidates

had completed the oral test. Teachers’ feedback on students’ speaking performances after

the test (item 6) received a medium mean response (M = 3.31) but the highest standard

deviation (SD = 1.3). Some teachers did not think it necessary to give feedback on

candidates’ performance, e.g. “There is usually no feedback session. Normally students go

home as soon as they finish the exam, because the teacher has reminded what needs

reminding (in class before the test)” (U3.T2); or with another reason being the impossibility

“to give feedback to individual students as time does not allow” (U2.T3).

A rater suggested choosing the right time to give feedback to benefit candidates, or

else it may have some counter effects:

Immediate feedback right after candidates’ speaking should be avoided so as not to

affect their psychology, because they were also taking exams for other subjects.

However, I think it’s advisable to send them a feedback report later that day, so

they’ll know which part they need for further speaking practice in the next semester.

(U2.T1)

Giving feedback received good support from teacher raters. However, it was

challenging for raters who had not taught the class (e.g. Universities A and B). They could

not relate a candidate’s performance to the expected outcome of the class level. An

experienced rater expressed her support, but anticipated possible challenges of giving

feedback on candidates’ oral production in the test:

I think it would be very useful for students (if raters’ giving feedback could be

done). But it would be very difficult. It could be if the rater was the teacher of that

class, so that they could keep up with the class level. For me, I was able to give

comments and feedback for some of my classes. My students liked it, and were very

appreciative of that. (U1.T3)

154

Teacher respondents were not very supportive of the incorporation of technology in

oral testing for tertiary education, at least in the current conditions of Vietnam. A low mean

score of teachers’ belief in the accuracy of computer-assisted oral tests (item 8) was close

to the value of ‘disagree’ (M = 2.19). Due to the demand of English education in Vietnam,

teachers of English can work at several institutions at the same time. Cross-institutional

English teaching resulted in many of the teacher interviewees (about 50%) having had

experience with computer-assisted language testing (CALT). Teachers had not yet given

much thought to the implementation of oral testing by computer because currently the

conditions of facilities at most Vietnamese universities would not allow it. Although CALT

could help to solve problems of limited time and large numbers of candidates (U1.T3,

U3.T3), many teachers were suspicious about its capacity to replace human raters in

judging learners’ communicative skills via spoken language (U2.T1, U2.T3, U3.T1). EFL

teachers anticipated numerous potential challenges if CALT was to be implemented:

boredom and tiredness in rating, low quality of audio-recording affecting raters’ perception

and evaluation, more difficulty understanding candidates in the absence of facial

expressions and body language, training for oral raters, oral performance affected by

candidates’ lack of computer navigation skills, etc.

Compared with direct speaking tests (by human raters) in general, participants

recognised that computer-based testing (CBT) in the present conditions of Vietnamese

tertiary education would bring fewer advantages than disadvantages. A universal

implementation of computer-assisted oral testing across universities needs more time and

investment of finance, technology, and human resources because:

Changing technologies makes it difficult to create a system for authoring that takes

advantage of current technical knowledge. Testing programs need to have built-in

mechanisms for updating software, hardware, and technical skills of employees.

Keeping up with the realities of technology use requires substantial financial

investment. (Chapelle, 2008, p. 129)

Table 4.14 summarises teachers’ opinions for and against computer-assisted oral

testing in terms of interaction, consistency, comfort, practicality, and L2 learning.

155

Table 4. 14 Teacher raters’ comments for and against computer-assisted oral testing

Aspect of consideration

For Against

Interaction • Not affected by facial expressions, body language, or behaviours

• Communication skills were ignored • Impossible to observe interactive reaction • Absence of friendliness and smiling • Unobservable face-to-face interaction, body

language and gestures

Consistency • Equal timing • Fair rating across candidates

• Sound quality (unauthentic) • Potential effects of computer navigation skills

Comfort • Playing back allows listening again for more accurate rating

• Tiring and boring, less lively than face-to-face interaction with candidates

Practicality • Handling large numbers of candidates

• Universial use of sound-recording devices (e.g. smartphones)

• Knowledge and skills of technology • Required familiarity (experience) • Rater training • Challenges for popular implementation

Language learning

• Pronunciation self-assessment • Pronunciation teaching support • Encouragement for improving

pronunciation

• Requirement for native-like pronunciation

4.5 Summary

My analysis of the data collected from surveys and interviews suggests that

successful oral test administration need effective collaboration of teaching and

administrative staff. The starting point in administering language tests is consideration of

test taker characteristics. Testing contexts should facilitate candidates’ oral performance

and minimise measurement bias to enhance construct validity.

The most remarkable strength of these institutional test administrations was that

examiners and examinees became engaged in the oral assessment through meaningful face-

to-face interactions at testing venues. Live scoring was typical as it provided opportunities

for use of oral language in authentic situations. The practice of employing a double-rating

format gave stakeholders a sense of test score reliability. The inclusion of an interlocutor

helped to ensure consistency in task delivery and enabled a multi-aspect evaluation

framework for paired oral raters. The test was not intended to cause examination burden to

students but provided an opportunity for language learners to perform the skills they had

learned. Besides, the test helped to maintain the sense of complying with communal

156

discipline and formality in educational quality assurance. As a school-based achievement

test, the oral assessment was an important link reflecting the institutional EFL teaching and

learning activities in that what had been taught and learnt would be tested. However, the

test showed some shortcomings to be overcome in order to achieve educational goals:

knowing the assessment criteria before the test, oral performance free of anxiety during the

test, and constructive feedback as useful information to improve language learning after the

test. Computer-assisted oral assessment has its own values in practice and in the literature

of language testing, however Vietnam needs more time and investment to keep pace with

technological innovations in oral language testing as used by the rest of the world.

In the next chapter, I examine the content relevance of question items used to elicit

candidates’ oral performance. I present results from EFL experts’ judgements on the oral

tests’ question relevance to the predetermined objectives of each institution’s programme of

teaching Listening-Speaking skills. Further analysis makes connections between experts’

comments on the quality of oral test questions and candidates’ actual performance.

Included are the experts’ suggestions for test item revision where appropriate.

157

Chapter Five

RESULTS: CONTENT RELEVANCE OF SPEAKING TEST QUESTIONS

The content of the speaking test should have a strong relationship with test

construct. If a speaking test is designed to measure achievement on a

particular programme of study, we must ask to what extent the test content

provides the opportunity for the test takers to demonstrate that their ability

on the constructs of interest have improved. (Fulcher, 2003, p. 195)

5.1 Introduction

This chapter examines the content of the speaking test from the perspective of content

experts in the post-test evaluation process, to answer Research Question 1b: To what extent

was the test content relevant to the course objectives? As an achievement test used at the

end of a course of instruction, validation of the oral assessment tool is concerned with its

content as “a sample of what has been in the syllabus during the time under scrutiny”

(Davies, 1990, p. 20). The approach adopted in examining the test content validity involved

selecting a team of EFL experts to judge whether each item in the oral test was relevant to

(or congruent with) the constructs defined in the course outlines (Ozer et al., 2013;

Delgado-Rico et al., 2012; Fitzpatrick, 1983). Subjectivity of human-based judgements was

inevitable in seeking content validity evidence this way, it was, nevertheless, “crucial for

the overall validity of the test” (Farhady, 2012, p. 37). With achievement tests, it is

essential that the content of the test is derived from the course content (Flowerdew &

Miller, 2012). In other words, it should indicate “representative and relevant samples of

whatever content or abilities the test has been designed to measure” (Brown & Hudson,

2002, p. 213).

This chapter will present the four stages used in the process of examining content

validity evidence: defining the construct that the test was supposed to measure, designing

the judgement protocol from the question items used in the speaking tasks, selecting

approaches for data analysis, and reporting the experts’ judgements of items constructed on

158

the basis of the course books used in the speaking classes and the course objectives

established by each institution’s official course outlines. I end this chapter with a summary

of key points presented in the results section of experts’ judgements on the test content’s

relevance across institutions.

5.2 Defining test constructs

Listening-Speaking courses were part of the B.A. training syllabus for EFL majors. Data

gathered from the course outlines showed that, as EFL majors, students in their first 2 years

(four semesters) at university had to take language skills courses, i.e. Listening, Speaking,

Reading, and Writing. Although the Listening-Speaking course accounted for the same

weighting (three credits) across the three universities, the course lengths varied from 45 to

60 periods. Each period was 50 minutes long (MOET, 2007). Each course lasted between

11 and 15 weeks, including a final week allocated for end-of-course assessment (around

four to five periods).

Common to these institutions was that they adopted the integrated-skills approach,

in contrast with the segregated-skills approach, in designing the course of English listening

and speaking skills for EFL majors. Speaking skills were not taught separately but

alongside listening skills in a single academic module, namely the subject of Listening-

Speaking skills. Advantages of integrating language skills are that it:

exposes English language learners to authentic language and challenges them to

interact naturally in the language. Learners rapidly gain a true picture of the richness

and complexity of the English language as employed for communication. Moreover,

this approach stresses that English is not just an object of academic interest nor

merely a key to passing an examination; instead, English becomes a real means of

interaction and sharing among people. (Oxford, 2001)

The EFL Faculty of each institution established its own course objectives and

assessment criteria accordingly. Each institution retained the right to select, use, adapt, or

change the course book if syllabus innovation and improvement were required. Table 5.1

presents a cross-institutional comparison in terms of the English Listening-Speaking course

length, language skills textbook, course objectives, expected outcomes, and supplementary

material resources.

159

Table 5.1 Comparison of the English-speaking courses across the institutions

University A University B University C Three-credit course length: 60 periods / 12 weeks 45 periods / 11 weeks 45 periods / 15 weeks Main course book of integrated skills: Skillful Listening & Speaking 4 by L. Clandfield and M. McKinnon, Macmillan Education 2014

Skills for Success 4 - Listening and speaking by R. Freire and T. Jones, Oxford University Press 2011

Lecture Ready 3: Strategies for Academic Listening and Speaking by L. Frazier and S. Leeming, Oxford University Press 2013

General themes: Language for social, academic and professional purposes

Language for everyday conversations

Language skills for academic success

Course objectives: For speaking skills, students will be able to: - Conduct a conversation - Use degrees of formality, agree and disagree in formulating a debate - Support proposals - Emphasise important information when making a speech by using repetition and contrastive pairs - Negotiate in organising a cultural program - Add points to an argument when holding a debate

Knowledge - Students will be able to communicate on topics central to everyday life and also less popular topics. Skills - Students will be able to develop their soft skills, especially critical thinking skills, self-study, and teamwork capability. Attitude - Students will be aware of using language correctly and appropriately in daily communication.

The course aims to: - Enhance academic listening and speaking strategies; - Develop students’ knowledge of several academic areas: business, humanities, and science; - Help students become an active and confident member of classroom discussions - Help students understand and use phrases, expressions in lectures; and - Give students the chance to lecture on a topic of their choice

Expected outcomes: C1 (CEFR) B2 (CEFR) High-intermediate Course policies: All grades are public and transparent for all students. Students must take part in a group oral presentation and/or written project. All members of a group will receive the same score; i.e. the project is assessed, and everyone receives this score.

Students are encouraged to participate in class activities. Bonus points are given to active participation during the course. Online self-study is required. Results from online assignments are counted in the end-of-course speaking test.

To ensure the maintenance of academic integrity, students are required to work independently on individual assignments, avoid plagiarism in any form, and work responsibly within a working group.

160

As can be seen from Table 5.1, University A’s Listening-Speaking course lasted the

longest (60 periods), and their expected outcome was Level C1 based on the Common

European Framework of Reference (CEFR). Universities B and C had the same course

length (45 periods each), but University B expected their students to achieve Level B2 upon

the course completion, whereas University C used the term “High-intermediate” to refer to

the language level goal in their learning outcomes assessment (Appendix B.4c). Clear goal

setting is crucial in language education given that goals help to pinpoint what needs to be

achieved within a predetermined timeframe, provide a way to measure learners’ progress,

and improve motivation to learn more (Donato & McCormick, 1994; Oxford & Shearin,

1994).

The institutions used different course books to teach Listening-Speaking skills to

second-year English majors. All of the books were published by world-renowned Bristish

publishers no longer than 5 years ago (up to the time of data collection): Skillful Listening

& Speaking 4 by Macmillan Education (2014), Skills for Success 4 – Listening and

Speaking by Oxford University Press (2011), and Lecture Ready 3: Strategies for Academic

Listening and Speaking by Oxford University Press (2013). These theme-based course

books integrate Listening (receptive) and Speaking (productive) skills into a study

programme with a variety of themes, e.g. games, nostalgia, change, etc. (Skillful Listening

& Speaking 4); the science of food, discovery, humans and nature, etc. (Skills for Success

4); and business, media studies, psychology, etc. (Lecture Ready 3). Each unit (a

combination of lessons) covers one theme or one field of academic study.

The course books adopt different approaches to introduce new lexical resources for

listening and speaking tasks. Each unit in Lecture Ready 3 provides learners with two

reading passages to build background knowledge for listening to lectures and discussing

related issues in small groups, e.g. Unit 3 (Science) contains two reading passages entitled

“What is Homeopathy?” (page 47), and “Artificial Voices” (page 57). Skillful Listening &

Speaking 4 sometimes employs shorter reading passages as a pre-task activity to provide

learners with essential vocabulary, or has them brainstorm the topic they are about to face

with listening or speaking tasks, e.g. Brainstorm before speaking (Unit 4, page 45); Read an

essay before listening (Unit 10, page 100), etc. Skills for Success 4 does not have reading

161

texts, but provides a vocabulary preview, grammar or pronunciation practice as language

input for listening and speaking tasks, e.g. Vocabulary exercise before listening to a radio

programme about students who take a “gap year” (Unit 7, pages 135-136); Indirect speech

review before telling a story (Unit 8, pages 164-165).

EFL Faculties of the institutions were responsible for selecting course books, and

designing and administering separate end-of-course examinations for Listening and

Speaking skills to assess students’ academic achievement after they had completed the

language skills course. For Speaking skills, Universities A and C were similar in terms of

their general objectives in that they aimed to assess students’ spoken language in an

academic environment. Universities A and B shared a common concentration on social

communication, and daily conversations. According to the official course objectives, the

speaking component at University A aimed to equip students with conversational skills, e.g.

how to manage a conversation, agree and disagree, win points in an argument, manage

conflicts by monitoring and reformulating, and so forth. University B paid attention to

developing students’ soft skills, e.g. critical thinking skills, self-study and teamwork

capability, besides communication skills for daily life topics. University C focused more on

knowledge about business, humanities, and science. University C’s English majors were

also exposed to understanding and using phrases and expressions from lectures in their

course books. Appendix B.4a-c provides additional information about the institutions’

Listening-Speaking course outlines.

5.3 Designing the content judgement protocol

Test items for this study were actual questions used in the speaking test. I collected copies

of speaking test items from the official testing material provided by the EFL Faculties after

the examinations. I grouped the test questions into three sets coded 01, 02, and 03,

respectively belonging to the three institutions A, B, and C, in order to obtain an overview

of the content and structure of the test questions.

University A designed 10 topics for oral performance corresponding to the 10

thematic units in the course book Skillful Listening & Speaking 4. Each topic contained two

pictures for individual candidates to talk about (Task 1), and two questions from which

each pair of candidates was required to choose one for their discussion (Task 2). These two

162

tasks of each topic shared the same theme, which made up 40 speaking test items in total

(Appendix B.4a). For example, the first four items (U1.Q1-4) belonged to Topic 1 whose

theme was “Risk” from Unit 4 of the course book (pages 37-46). However, each pair of

candidates interpreted the theme of the assigned pictures in their own ways since the task

instruction did not explicitly state what the theme of those pictures was.

University B employed 10 topics associated with the themes presented from Units 6

to 10 in the course book Skills for Success: Listening and Speaking 4. Each topic was

designed for one pair of candidates and covered three themes in two tasks. Task 1 contained

six questions on two different themes. The interlocutor used only three or four of these

questions to ask each pair of candidates at random because of the time limit. Task 2 was of

another theme with a six-prompt mind map accompanied by a discussion question. There

was a total of 70 test items in University B’s 10 sets of test questions (Appendix B.4b). For

example, the first set included two groups of questions in Task 1: 1A’s theme was “Food

and science” (U2.Q1-3), and 1B was “Success and gap years” (U2.Q4-6). Task 2 comprised

a mind map of “Ways to help reduce hyperactivity in children” and a discussion question

“Which one is the most important?” (U2.Q61). The order of University B’s test items used

in the protocol was slightly altered on a task-based sequence, i.e. all the test items for Task

1 came first, before those for Task 2. I made this re-arrangement of test items to ensure a

more reader-friendly layout of test items without causing any interruption in the flow of

question types.

University C designed 45 questions in total to use for a teacher-student interview

task (Appendix B.4c). The questions were based on the topics presented in the course book

Lecture Ready 3. These questions can be classified into two types: display questions – e.g.

What is the Enron scandal? (U3.Q22), What are the pros and cons of ‘neuronmarketing’?

(U3.Q19) etc. – for eliciting predetermined correct responses; and referential questions –

e.g. Is multitasking good or bad, in your opinion? (U3.Q14), What type of familty do you

come from? (U3.35), etc. – for encouraging candidates to produce meaningful and creative

language in response (Brown, 2004, p. 159). In each test session, the examiner asked the

candidate three to five questions selected at random from the list, and expanded to some

follow-up questions to elicit further elaboration or explanation when necessary. Table 5.2

163

summarises information about the quantity of test items and their thematic variations

employed at the tertiary institutions.

Table 5.2 Summary of test questions used for oral assessment across the institutions

Univer-sity

Sets of topical

questions

Question items

Questions types Variation in task themes

A 10 40 • Task 1: 20 picture-cued questions • Task 2: 20 discussion questions

Each set of questions was of the same theme.

B 10 70 • Task 1: 60 questions for oral interviews • Task 2: 10 mind-maps for discussion

In each topic, questions in Task 1 covered two themes, and Task 2 one theme.

C unspecified 45 • Display questions: 25 • Referential questions: 20

Themes varied within those covered in the course book.

I prepared a booklet of three sets of test questions for each EFL expert to make

judgements on the degree to which the items were relevant to the language domain

established in the course outline of each institution. Consideration should be paid to the

comprehensibility, representativeness, and difficulty of each test item. The experts recorded

their judgements on a 4-point Likert response scale of (1) Highly irrelevant, (2) Not

relevant, (3) Relevant, and (4) Highly relevant, which I adapted from the scale developed

by Davis, 1992. The experts could provide further comments on particular items and/or

suggestions for revision wherever they felt appropriate.

To facilitate experts’ judgement, I included the following documents in each

booklet: copies of the three course books; copies of the three course outlines excluding

information which would identify the institutions; a description of speaking tasks and

testing procedures; instructions for the experts to follow when completing their ratings;

three sets of test items with space to indicate their (ir)relevance and add further comments

and suggestions for revision if necessary; and a form for the experts to provide

demographic information (Appendix B.4d). I met each expert in person and gave him/her

164

an oral briefing on the speaking test formats and my expectations for their personal

judgements.

5.4 Selecting approaches to data analysis

There are multiple approaches to calculate the content validity index (CVI) based on expert

judgements (Lynn, 1986; Polit, Beck, and Owen, 2007; Wynd et al., 2003). For the purpose

of this study, I examined evidence of the test content validity at two levels: (1) the content

validity of individual test items (I-CVI), and (2) the content validity of the overall scale (S-

CVI). The I-CVI refers to the degree of content relevance judged on speaking test items

individually, and the S-CVI on the entire set of test items used at each institution (Polit &

Beck, 2006). Figure 5.1 presents the diagram showing the relationship between CVI levels.

Quantifying the experts’ responses this way dichotomised the ordinal 4-point scale

into two broader categories of ‘Relevant’ (ratings ‘3’ and ‘4’) and ‘Irrelevant’ (ratings ‘1’

and ‘2’). When there are six or more judges, it is recommended that the standard of I-CVI

needs to be .78 or higher (Lynn, 1986). That means with six judges, there should be no

more than one ‘Irrelevant’ rating because if there were two ‘Irrelevant’ ratings, then four

‘Relevant’ ratings would make the I-CVI only .67.

The I-CVI of an item was calculated as the number of experts rating that item either

‘3’ (Relevant) or ‘4’ (Highly relevant) divided by six – the total number of experts. The

calculation resulted in seven I-CVI levels valued from 0 to 1.0 in correspondence to the

number of experts giving the ratings ‘3’ and ‘4’ (Table 5.5). For example, if an item

received ‘3’ or ‘4’ ratings from all six experts, the I-CVI of that item would be 1.0

(maximum). By contrast, if no experts rated an item ‘3’ or ‘4’, its I-CVI would be 0

(minimum). If rated Highly relevant or Relevant by five out of six experts, it would be 0.83,

etc.

Computational procedures for the S-CVI for the entire scale were carried out in

terms of (2a) the average rating for all the test items on the scale across the various judges

(S-CVI/Ave), and (2b) the universal agreement of all the content experts in giving an

overall scale ‘Relevant’ ratings of ‘3’ or ‘4’ (S-CVI/UA). As their names suggest, the S-

CVI/Ave focuses on the average quality of the scales, and the S-CVI/UA lets us know the

inter-expert agreement on the content validity of those scales. I calculated the S-CVI/Ave

165

by summing the I-CVIs by all experts combined and dividing them by the total number of

ratings for each scale. Figure 5.1 presents a summary of these terms and definitions

regarding content validity (Polit & Beck, 2006, p. 493).

Note. I-CVI, item-level content validity index; S-CVI, scale-level content validity index; S-CVI/UA, scale-level content validity index, universal agreement calculation method; S-CVI/Ave, scale-level content validity index, averaging calculation method. Source: Polit and Beck (2006, p. 493)

Figure 5.1 Content validity index (CVI)

5.5 Relevance of test contents

In order to otain a comprehensive view, I employed multiple data sources to examine test

contents from various perspectives. The main source of data was from the EFL experts’

judgements on the extent to which test quesions were relevant to the course objectives and

the contents covered in the course books. Secondary sources were participants’ opinions

from interviews, oral test questions and candidates’ speech samples in real test

performance. In this section, I will discuss results from the EFL experts’ judgements on the

content relevance of speaking test items and link their opinions with other data sources of

transcripts from interviews and test performances.

CVI Degree to which an instrument has an

appropriate sample of items for construct being measured

(1) I-CVI Content validity of individual items: Proportion of content experts giving each item a relevance rating of 3 or 4

(2) S-CVI Content validity of the

overall scale

(2a) S-CVI/Ave Average of the I-CVIs for all

items on the scale

(2b) S-CVI/UA Proportion of items on a

scale that achieves a relevance rating of 3 or 4 by

all the experts

166

5.5.1 EFL experts’ judgements on test content relevance

All the experts invited to evaluate the items completed the judgement task and returned

their work within the planned timeframe of one or two weeks after receiving the booklets of

test documents. Each of the experts contributed a total of 155 ratings on the test items of all

three institutions. Statistical results show that 132 out of 155 test items received further

comments from at least one expert besides the numeric ratings. In addition to the comments

on each item, the experts provided further feedback on level of difficulty, wording,

language focus, relevance to the course outlines and/or made suggestions for revision

wherever possible. Table 5.3 displays a statistical chart of the data collected from the six

EFL experts.

Table 5.3 Information collected from EFL experts’ content judgement protocol

Expert

code

Total of

ratings

Comments

made on test

items

Further remarks on the entire set of test

items for each institution

Uni. A Uni. B Uni. C

01 155 24 P P

02 155 40 P P

03 155 50 P P

04 155 30 P P

05 155 8 P P P

06 155 54 P P P

As shown in Table 5.3, each expert provided additional remarks on the institutions’

entire sets of test questions besides comments made on individual test items. Experts 05

and 06 gave further comments on all three institutions’ sets of test questions. Experts’

comments focused on the relevance of test items to the course objectives and the contents

covered in each course book. Following are the judgement results by the experts at two

levels: the CVI for each test item, and the CVI for the whole scale at each institution.

CVI for individual test items (I-CVI)

As presented in the procedure for data analysis above, the I-CVI was understood as the

proportion of ‘Relevant’ ratings that the content experts gave to each test item. Based on

167

the course objectives and the contents of the course book, the experts’ ratings of ‘3’ and ‘4’

were counted as ‘Relevant’, and those of ‘1’ and ‘2’ were counted as ‘Irrrelevant’. Table

5.4 presents the CVI result of test items (I-CVI) across institutions in two categories of

content validity ‘Relevant’ and ‘Irrelevant’.

Table 5.4 Content validity index of test items (I-CVI) across institutions

Content validity

Number of experts giving

‘Relevant’ ratings

I-CVI level

I-CVI across institutions

Uni. A Uni. B Uni. C

‘Relevant’

6 1.0 17 (42.5%) 55 (78.6%) 20 (44.4%)

5 .83 14 (35%) 13 (18.6%) 13 (28.9%)

Subtotal* 31 (77.5%) 68 (97.2%) 33 (73.3%)

‘Irrelevant’

4 .67 4 (10%) 2 (2.8%) 9 (20%)

3 .5 5 (12.5%) - 3 (6.7%)

2 .3 - - -

1 0

.17 0

- -

- -

- -

Subtotal** 9 (22.5%) 2 (2.8%) 13 (26.7%)

Total of * and ** 40 (100%) 70 (100%) 45 (100%)

Table 5.4 indicates that the I-CVI level of 1.0 (all six experts gave ‘Relevant’ ratings)

was found for individual test items at all the institutions involved in the study. At this level

of absolute agreement across the experts, the majority of University B’s test questions

(78.6%) were rated ‘Relevant’. This proportion was 44.4% and 42.5% for Universities C

and A respectively. The I-CVI level of .83 (five of the experts issued ‘Relevant’ ratings)

accounted for 35% of the ratings for University A’s test items. The proportion of test items

receiving the I-CVI of .83 was nearly 19% at University B, and 29% at University C.

There was a remarkable difference in the second category of items judged as having

insufficient content validity (rated ‘Irrelevant’ by maximum four experts). University B had

the fewest test items rated ‘Irrelevant’ (2.8%), whereas this proportion of University A was

more than one-fifth (22.5%) and of University C was more than one-fourth (26.7%) of the

test items.

168

No test items had an I-CVI of lower than .5, i.e. each test item received at least three

‘Relevant’ ratings from the EFL experts. However, the range of I-CVI differed across

institutions. The I-CVI of University B varied from .67 to 1.0 while that of University A

and University C was from .5 to 1.0. At the lowest level (an I-CVI of .5), there were five

items (12.5%) from University A’s speaking test items and three (6.7%) from University

C’s. In general, most of the content experts gave individual test items Relevant or Highly

relevant ratings.

Table 5.5 presents typical examples of speaking test items of various task types that

received the I-CVIs of 1.0 and .83 found in the speaking test administered at all the

institutions. The experts’ comments on these items’ content relevance are included.

Table 5.5 Examples of test items with I-CVIs of 1.0 and .83

Rating I-CVI = 1.0 I-CVI = .83

Uni

vers

ity A

U1.Q22. Talk about the photo: Comment: This test item is relevant to the content covered in Unit 2 entitled “Games” (E5).

U1.Q36. Discuss the question: What factors contribute to human’s physical achievements, e.g. climbing Mount Everest or landing on the Moon, etc.? Comment: This test item is interesting (E6), but suitable for only those who are concerned about physical activities (E2).

Uni

vers

ity B

U2.Q16. Answer the question: Which meal is the most important of the day to you? Why? Comment: The question is adapted from a pre-listening activity on page 202 (E5), and interesting for learners (E6) as it provided an opportunity to talk about and express personal awareness of eating habits.

U2.Q28. Answer the question: What is the main difference between youth sports today and 150 years ago? Comment: The question is relevant as it is taken from the listening task on page 207 (E5), but it is likely that many students cannot answer because they do not know about what sport was like 150 years ago (E2).

Uni

vers

ity C

U3.Q4. Answer the question: What factors do you consider when purchasing a product? Why? Comment: The question is adapted from a pre-reading activity of Unit 1 (E5) to elicit candidates’ talk about personal interests in shopping.

U3.Q3. Answer the question: What is neuromarketing? Comment: This question is more suitable for testing reading comprehension skills (E2), but relevant in terms of language content as it is taken from a lecture on page 8 (E5).

169

CVI for scales (S-CVI)

The CVI for a 4-level scale was defined as “the proportion of items on an instrument that

achieved a rating of “3” or “4” by the content experts” (Beck & Gable, 2001, p. 209). As

indicated in Table 5.4, the universal agreement of all six experts on the content relevance of

individual test items, i.e. the subtotal** I-CVI (Table 5.4), varied remarkably across the

institutions. University B had the highest proportion of test items (97.2%) that the experts

gave ‘Relevant’ ratings. That means the S-CVI/UA was .79 for the entire scale used at

University B. This proportion was much lower at University A and University C, which

was estimated to be .43 and .44 respectively.

The average of the content validity for all items on the scale of each institution (S-

CVI/Ave) was calculated by summing all the items rated “Relevant” by all the experts

combined and then dividing by the total number of ratings for the test at each institution.

Results in Table 5.6 show that University B had the highest “Relevant” ratings by all

experts for all test items (S-CVI/Ave = .96). University A held the second rating (S-

CVI/Ave = .85), and University C the lowest (S-CVI/Ave = .83).

Table 5.6 The average of the I-CVIs for all items on the scale

‘Relevant’ ratings by Uni. A Uni. B Uni. C

Expe

rt co

de

01 39 70 34

02 35 66 24

03 35 61 42

04 34 66 41

05 30 70 39

06 30 70 45

Number of items rated ‘Relevant’ 203 403 225

Total number of ratings by all the six experts

240 420 270

S-CVI/Ave = 0.85 0.96 0.83

As shown in Table 5.6, the average of the S-CVIs for all items on the scale was

higher than .8, meaning that more than 80% of the total question items were judged content

170

valid, among which the content validity of University B’s overall scale obtained the highest

degree of acceptability for the S-CVI/Ave (96%). To a more liberal sense of congruence,

the S-CVI/Ave should be conceptualised “as the average I-CVI value because this puts the

focus on average item quality rather than on average performance by the experts” (Polit &

Beck, 2006, p. 493). By using this method of calculating CVIs, we can expect the S-

CVI/Ave to be always higher than the S-CVI/UA of any overall scale because the latter is

more dependent upon the universal agreement of all the content experts involved in the

judgement process, which may be biased by judges’ inevitable subjectivity.

The experts’ further comments on University B’s test items were in line with the

computation of content validity for the institutional overall scale, i.e. the whole set of

questions used in its speaking tasks. In a general evaluation of test content

representativeness with reference to the course content coverage, one expert remarked:

The content of the test items is highly relevant to the topics covered in the course

book. Most of the questions are taken from the speaking sections while some others

are from the listening sections (pre- and post-listening). (E5)

However, the experts acknowledged an imbalance in levels of difficulty and generality

across this institution’s oral test question items. This variation signified some inconsistency

in test design that may have affected candidates’ speaking performance. Another expert

provided additional comments:

Many questions are more difficult than the others because their required language

contents are not of the same degrees of difficulty. Some are too challenging, and

others are too general or too specific for second-year students. (E6)

This situation was echoed in candidates’ admissions that despite fairly good

speaking ability, their understanding and vocabulary range were not sufficiently broad to

give spontaneous responses to a number of questions (e.g. U2.Q41) requiring such “expert”

knowledge as damage caused by oil spills or shrinking habitat of polar bears (U2G1.S3).

The experts expressed concerns over other questions challenging candidates with either a

“too big topic” (E6) asking about the world before and after the industrial revolution

(U2.Q48), or an unpopular requirement to recommend a certain kind of food or drink for

someone with a cold (U2.Q60) – a “too specific” (E6) physical problem that not every

171

student had knowledge about. After a general remark, one of the experts made a suggestion

on improving the content validity of University B’s test items:

Most of the questions are suitable. However, there should be more questions

relating to real life activities which give students more interest to talk about. (E2)

The content experts made comments independently from their individual

perspective of professional skills and experience in designing EFL course syllabus and

evaluating L2 achievement tests. In general, the speaking test items from the three

universities received positive comments from the experts from other institutions. Table 5.7

presents some illustrative examples of items rated “Relevant/Highly relevant” by all the

experts.

Table 5.7 Examples of items rated ‘Relevant/Highly relevant’ with positive comments by all the experts

Item Topic Example Comments by experts

U1.Q23 Change Why are some people better than others at dealing with change?

“Relevant in terms of content” (E5). This item was adapted from a question in ‘Discussion point’ section (p.77).

U2.Q11-Q12

Discovery Are there any positive/ negative effects of chance discoveries in our lives?

“Relevant” (E6). This item was taken from a post-listening activity focusing on chance discoveries (p.161)

U2.Q22 From School to Work

If you could go anywhere in the world for a year, where would you go? Why?

“Interesting” (E1). This item encouraged candidates’ imagination to talk about their dreams and ambitions in the transition from school to work. It was taken from a post-listening activity in the text book (p.138).

U2.Q27 The Science of Food

Do you avoid the food that contains additives? Why or why not?

“Necessary knowledge” (E6). This item helped to raise students’ awareness to care about the food they eat every day.

U2.Q50 Human and Nature

What steps should the government take to deal with environmental issues?

“Authentic and contemporary topic” (E2). This item raised a local concern about environmental problems calling for the government’s taking action to resolve.

U3.Q2 Intellectual Property and the Music Industry

What is your attitude to music downloading and file-sharing?

“Relevant to the content covered in Chapter 4” (E5). This item was adapted from a post-reading activity in the text book (p.43).

172

My further exploration into the experts’ feedback, however, revealed content-

related issues in writing questions for oral assessment. These problems varied from simple

spelling mistakes and inaccurate wording to more complex issues of question design that

possibly confused candidates when asked.

Inaccurate wording

Some words were not used in an appropriate context. This might not have caused much

trouble to candidates for their comprehension, but could have contributed to forming habits

of inappropriate language use such as word choice, collocations, incorrect use of

prepositions, phrasal verbs, etc. Table 5.8 presents examples of items that need revising for

more accurate grammar or better word selection. The examples are accompanied with the

experts’ suggestions for item revision.

Table 5.8 Examples of problematic items and experts’ suggestions for revision

Item Topic Example Inaccuracy Suggestions for revision

U1.Q19 Change Should people make changes even when they have no problems with their current achievements?

Word choice “achievements” should be changed into “situations” (E3). “Achievements” has positive meaning while “problems” has negative meaning.

U1.Q35 Risk Humans have achieved many huge physical feats, such as landing on the Moon or reading the South Pole, etc. Can you name some other feats? …

Spelling “reading” needs to be changed into “reaching” (E3; E5).

U1.Q12 Sprawl If you were city mayor, what changes would you make to your city?

Authenticity “City mayor” does not exist in Vietnam whereas the candidate was asked to think about their Vietnamese city (E1).

U2.Q23 Work and gap years

Are there any disadvantages of taking a gap year off between high school and college/university?

Grammar “a gap year off” should be changed into “a year off” or just “a gap year” (E3).

U2.Q70 Human and nature

What should the government do to protect the animals of high extinction?

Grammar “of high” should be changed into “from” (E3).

U3.Q7 Business What do you think about the statement “Poor quality usually goes with cheap price”?

Collocation “cheap price” should be changed into “low price” (E6)

173

Problematic questions

Some questions were not logically designed. They exposed candidates to unnecessary

challenges of reading comprehension, making cognitive distinctions with provided clues

before being able to speak. Test items included redundant clues (e.g. U2.Q64). Some

questions were too general to facilitate candidates’ brainstorming to generate appropriate

ideas for responses (e.g. U2.Q48). There were some overly theoretical questions that

required memorisation of content knowledge to be able to answer (e.g. U3.23). Table 5.9

presents examples in terms of problem types with the experts’ comments and suggestions.

Table 5.9 Examples of confusing test questions with experts’ comments and suggestions

Problem Content-related issues of test questions Comments and suggestions

Confusing prompts

U2.Q64. From school to work Advantages and disadvantages of new models of climbing the career ladder: Workers are not very loyal to the companies; companies tend to be less stable; workers gain more experience, etc. Which advantage is the most important?

“This question is very confusing. There are both advantages and disadvantages (listed in the mind map), but students are asked to address only advantages. The mind map should list either advantages or disadvantages instead of both” (E3).

U2.Q67. Food Advantages and disadvantages of canned food and fast food: portable, less nutritious, time-saving, etc. Which disadvantage is the most serious?

This item is similar to the case of U2.Q64. The mind map (Appendix B.4b) should list just either advantages or disadvantages of canned food and fast food, not both of their advantages and disadvantages (E3).

Too theory-oriented

U3.Q1. What is a Target market? U3.Q23. What is the Ponzi scheme?

U3.Q37. How many types of intelligence are there? List them.

U3.Q40. What types of line are typical in design basics? How are they used?

U3.Q43. How many types of lecture language have you learnt so far? What are they?

Although a few experts rated these items as “relevant” to the course content, many others did not think they were suitable for assessing speaking skills. “These questions (designed for a final speaking test) seem simplistic” (E1). “Many questions are about the contents of the reading passages. Therefore, they check students’ reading (comprehension), and do not give chances for them to develop/ perform their speaking skills. The questions should be more productive” (E2).

174

Problem Content-related issues of test questions Comments and suggestions

High cognitive demand

U1.Q39. Flow “Not everything will go as you expect in your life. This is why you need to drop expectations and go with the flow of life.” Discuss this statement.

“More difficult than the other questions. Candidates need more time to think to be able to speak on this topic” (E6).

U1.Q35. Expanse Humans have achieved many huge physical feats, such as landing on the Moon or reaching the South Pole, etc. Can you name some other feats? What benefits have they brought to humanity?

“Sometimes students cannot remember any (other feats). This topic is suitable for some students only” (E2); “Too much in a question” (E6).

U2.Q48. Discovery How do think about the world before and after invention revolution?

Candidates did not know what aspect(s) about the world this item aimed to ask. It covered “too big a topic” for candidates to think of a proper answer (E6).

U2.Q60. Food “What food or drink would you recommend to someone who has a cold?”

“Sometimes students do not have knowledge to answer” (E2). Language learners should be tested language skills, not knowledge of how to deal with “too specific” health problem” (E6).

Inconsistency in task input

In order to examine the consistency of oral test questions, I categorised the question items

into three types according to the task input channel and form: visual non-language, visual

language, and aural language (Table 5.3). The visual non-language type took the form of

pictures as prompts for candidates to look at and give an oral response (University A’s Task

1). The visual language elicited cadidates’ response via written questions (University A’s

Task 2) and mind maps (University B’s Task 2) to look at and read. The aural language

type was spoken questions delivered by raters (University B’s Task 1; and University C’s

task).

My analysis of the experts’ judgements and the test questions themselves indicated

that there was remarkable inequality in degrees of difficulty across question items of the

same task type. The first types took the form of pictures as prompts for long-turn responses

(University A’s Task 1). The second type were spoken questions delivered by interlocutors

(University B’s Task 1 and University C’s Task). The third type elicited candidates’

responses via written questions (University A’s Task 2), and mind-maps (University B’s

175

Task 2). Picture-based material (type 1) varied in styles (black and white/colour photos,

cartoons, artistic/authentic) and challenges to candidates’ talk. Selecting different pictures

for each task was necessary in order to avoid the test contents being revealed to candidates

who took the test later, but inconsistency in picture selection could lead to unfairness to

candidates at random.

A comparison of the length of test questions demonstrated a significant difference in

total words and content words4 among test questions of the same task type. Content words

contain important meaning that has potential challenges of cognitive load to candidates.

Table 5.10 presents descriptive statistics of question lengths in total words (TW) and

content words (CW) across different task types adopted at the institutions.

4 Content words (CW) are open to new members (open class).Their number in the language is not fixed because of the introduction of new lexical items. Although CWs may comprise one morpheme only, they tend to have a complex internal structure as a result of processes such as inflection, deviation, and compounding. CWs are normally longer than function words. This group of words includes nouns, lexical verbs, adjectives, and adverbs. In spoken language, CWs are usually stressed (Bell, Brenier, Gregory, Girand, & Jurafsky, 2009; Krejtz, Szarkowska, & Łogińska, 2015).

Function words (FW) are resistent to new members (closed class) and constitute a less numerous group of lexemes compared to CWs; yet, they tend to appear more often. FWs are mostly very short with a simple structure and a single morpheme only. This group of words includes parts of speech such as determiners, pronouns, primary auxiliaries, modal auxiliaries, prepositions, adverbial particles, coordinating conjunctions and subordinating conjunctions, wh-words, the existential ‘there’, the negator ‘no’, the infinitive marker ‘to’, and numerals (Bell, Brenier, Gregory, Girand, & Jurafsky, 2009; Krejtz, Szarkowska, & Łogińska, 2015).

176

Table 5.10 Question lengths in numbers of total words (TW) and content words (CW) across different task types adopted at the institutions

Uni. A’s Task 2 Uni. B’s Task 1 Uni. B’s Task 2 Uni. C’s Task

TW CW TW CW TW CW TW CW

N Valid 20 20 60 60 10 10 45 45

Missing 0 0 0 0 0 0 0 0

Minimum 6 3 5 2 26 16 3 1

Maximum 32 15 27 14 61 34 19 10

Range 26 12 22 12 35 18 16 9

Mean 16.1 7.3 13.4 5.8 39.9 23.5 9.4 4.5

Median 13.5 6.5 12 5 36.5 23.5 9 4

Std. Deviation 7.6 3.7 4.8 2.5 13.6 6.3 4.0 2.1

Variance 58.4 13.9 22.8 6.5 184.1 39.6 15.7 4.2

Sum 322 146 805 349 399 235 423 202

% 100 45.3 100 43.4 100 58.9 100 47.8

Note. University A’s Task 1 is not included in this table because it uses visual non-language input.

As shown in Table 5.10, University B’s Task 1 used the most texual input (805

words in total). University A designed questions for Task 2 with the fewest words (322

words in total). The range of CWs used in test tasks fluctuated from 43.4% (University B’s

Task 1) to 58.9% (University B’s Task 2). The proportion of CWs in University B’s

discussion task accounted for the highest percentage because textual prompts used in the

mind maps were usually noun phrases with more CWs than FWs, e.g. “raising residents’

awareness” (U2.Q70), “high in sugar and fat” (U2.Q67), etc. The most significant disparity

in the numbers of words across questions of University B’s Task 2 was 35 for TWs, and 18

for CWs respectively. The longest discussion question of this task had 61 words in total

(U2.Q64), more than double of the number of TWs of the shortest, only 26 words

(U2.Q68,69). A similar situation was found with CWs. The highest number of CWs used in

177

this task type was 34 (U2.Q62), whereas the lowest was only 16 (U2.Q69) 5.

Variations in question length or loading of CWs in speaking test questions did not

help with “discriminating between higher ability and lower ability test takers” (Fulcher &

Davidson, 2007, p. 103) since any particular item difficulty was not applicable for all the

candidates (of the same test room or the same institution). Any particular test taker did not

have to answer all the question items written for the test. This feature makes writing

questions for oral tests different from that for a normal multiple-item test on Grammar or

Reading comprehension, where all candidates are required to give responses to all the test

items available. Correlations between the responses to each oral test question and the total

scores that the test takers awarded were impossible to estimate because candidates in direct

testing were required to answer different test questions. The chance of a question item

being repeated for candidates was low. For example, at Universities A and B, the number of

candidates’ turns (approximately 20 pairs) just doubled the total number of question sets

(10 topics) available for each test room. If the examiner tried to assign a different topic to

each pair of candidates, a speaking topic could be repeated only once. Unequal degrees of

difficulty of oral test items introduced inconsistency and possible unfairness in task content

assignment rather than functioning as a measure of students’ language level discrimination

as intended.

Task requirements were the same for all candidates, but content variations could

increase the possibility of unfairness in test task assignment if a very difficult (or very easy)

question item was delivered to a certain test taker. The interlocutor (or examiner) was

supposed not to deliberately deliver any candidate an item on the test that was easier or

5 Question items with underlined CWs:

U2.Q62: “Ways for parents to help their children develop sportmanship: Helping children decide when to start to play sports, Learning about the injuries and safety in sports, Being positive towards children’s games, Modelling how kids should react to their teammates, Discouraging kids from playing just one sport, Sending children to a special school. Which way is the most important?

U2.Q69: “What should a child do to get successful in playing sports? Financial support, Motivation, Passion, Time, Physical strength, Parents’ encouragement. Which one is the most important?

178

more difficult than other items, however unfairness was inevitable when all the test

questions were used. The unfairness would be higher if a less-able student was randonly

assigned more difficult questions, or vice versa.

Similar issues arose with other picture-cued questions for the monologic task. A

number of visual input prompts provided an ambiguous content focus that was challenging

to recognise despite of the relevance in language content (e.g. U1.Q6, Q10, Q37). The

selection of cartoons may be unfair for some examinees in that “they may generate too little

talk, or the humour may be difficult to express in words” (Louma, 2004, p. 167). Pictures of

various types (photographs, line drawings, computer graphics, etc.) can be used as task

materials as they can evoke creative ideas and elicit spoken language performance from

intensive to extensive levels (Brown, 2004; Louma, 2004). However, for effective

elicitation of desired speech samples for assessment, test designers are advised to make test

pictures clear enough so that they do not intimidate the examinees by their visual

complexity. Purposefully ambiguous pictures are a special case, as the intention is usually

to make the examinees hypothesise about the intended message. These are appropriate for

assessing hypothetical language, but there may be a need for a back-up solution to make

sure that it is not picture interpretation but the language demands of the task that make the

task difficult for some examinees (Louma, 2004, p. 167). Table 5.11 provides illustrative

examples of picture-cued questions that require detailed consideration to interpret their

meaning. The experts’ comments on each are included.

179

Table 5.11 Examples of hard-to-interpret picture-cued questions with experts’ comments

Question Picture prompt Experts’ comments

U1.Q6 Nostalgia

“The picture is not clear enough” (E4). Young learners might not have recognised the folded pieces of paper were to mean old-fashioned letters from the past.

U1.Q9 Sprawl

“Not very clear for students to see the relevance to what they have learnt. The photo should indicate the past vs. the present – changes spotted” (E5).

“Challenging to recognise the focused theme to talk about this artistic photo with light effects. Possible highlights: nightlife city, skyscrapers, street lights, modernisation, etc.” (E1).

U1.Q10 Sprawl

“Too abstract to recognise the meaning of this picture with funny figures” (E2). Students may find it challenging to guess who the characters are and why they are waiting for the city to come. It would be even harder if they did not understand the wording on the road sign. However, it is “relevant in terms of the content of Unit 5 or 8” (E5).

U1.Q18 Change

“Photo is hard to interpret at a glance” (E3). Candidates might not know what the picture focuses on: a variety in forms from a cocoon to a butterfly, or (the meaning of) the transformation process.

U1.Q37 Flow

“Difficult to understand the picture” (E2, E3). It is not clear whether the photo focuses on the canal in the middle or the (unfinished) roadwork on the two sides.

The issue of test content should be examined from such “a perspective in which

content relevance and representativeness can be considered in the light of construct

validity, rather than treating content validity as an aim and end in itself” (Fulcher, 1999, p.

234). Many of the EFL experts pointed out this issue when judging the question items used

for the institutional speaking test papers. What if a question item was relevant to the content

covered in the study program, but did not match with the construct that the oral test was

180

intended to measure – speaking skills? This was the case of University C where

approximately 44% (20 out of 45) test items were Wh-questions to elicit theoretical or

conceptual knowledge from receptive skills lessons (i.e. Reading and Listening) during the

course. These questions are categorised as display questions eliciting predetermined

answers in a responsive speaking task (see Chapter Two). However, according to the

experts, they did not “give chances for students to perform or develop their speaking skills.

Questions (for an oral test) should be more productive” (E2). With reference to the purpose

of integrating language skills as a means for interpersonal communication, one expert

commented that:

These test items are more appropriate to test students’ memorisation or knowledge

of subject contents. It is not appropriate to test speaking skills and discussion

strategies as indicated in the integrated course book. (E5)

The experts made suggestions how to achieve both theory and practice within a course

concentrating on strategies for academic lectures. According to the course book authors,

“lecture language” refers to “the discourse markers, speech features, and lexical bundles

that lecturers across disciplines commonly use to guide students in taking in information”

(Frazier & Leeming, 2013, p. iii). One of the ways for testing students’ knowledge of

lecture language was through their actual use of the language in their lecture or presentation

(E1), rather than checking by Wh-questions with “what” or “how many” (e.g. U3.Q43, 45).

5.5.2 Linking experts’ opinions with other data sources

The content of test questions was very important in eliciting candidates’ meaningful

speaking performance. The questions covered a wide range of themes of interests about the

surrounding world. The EFL experts noted strong points in terms of the contents of the

question items to be “relevant”, “interesting”, “necessary”, and “authentic” (E1, 2, 3, 6). On

the other hand, the experts pointed out deficiencies that should and can be improved about

content-related issues in test questions. These comments were echoed with data from

speech samples and interviewees’ opinions that helped to support the experts’ judgements.

Two sides of content-related judgements by the experts are now discussed: (1) effective

incorporation of relevant contents in test questions, and (2) content-related shortcomings in

test question design.

181

(1) Effective incorporation of relevant contents in test questions

Test questions were not soly designed to examine candidates’ language ability, but to

encourage learners to achieve necessary knowledge about the contents conveyed in the

course book, i.e. language for social purposes and daily communication (Table 5.1). For

example, questions about additives contained in fast food (U2.Q27), advantages and

disadvantages of preservatives added to food (U2.Q58), and stricter laws on food quality

control (U2.Q59) aimed to draw language learners’ attention and raise essential awareness

about food hygiene and safety, which has been a public concern in Vietnam (Khanh Linh,

2018; T. A. Nguyen, 2018; Nguyen Thuy, 2013). Many questions regarding behavioural

psychology (U1.Q27, 28; U2.Q7, 44; U3.Q29, 38) were very useful for young people’s

psychological and cognitive development. Excerpt 5.1 illustrates an example of positive

thoughts extracted from a paired discussion task on the question “Is it possible to avoid

conflicts?” (U1.Q27).

Excerpt 5.1 File U1.C15-16 5:16 – 5:42 1

2

3 --->

4

5

C15: I think that conflict is unavoidable (.) for animal, but

human we are something more > than just < animal. er we can

avoid conflicts (.) % and % by taking a step back, we show

the others that we appreciate the relationship between the

two more than (.) to find out who’s right, who’s wrong.

The candidate ’s response (line 3) indicated that the test question was effective not

only in eliciting a good oral language sample for assessment, but also creating an

opportunity for the candidate to express his mature thoughts and attitude towards social

problems. Regarding the educational content of the candidate’s oral performance, he

demonstrated the ability to apply academic knowledge learnt from the course book to his

response to the test question: “Avoid moral conflicts of interest and still be successful!”

(Skillful Listening & Speaking 4, p. 98). Such a way of thinking was very meaningful in the

current setting of the increasing trend towards violence in almost all walks of Vietnam’s

life (Rasanathan & Bhushan, 2011; Chi An, 2015; Quang Nhat, 2017; Dang Khoa, 2018).

Test designers took into account suitable contents that evoked young adult learners’

interests. Interesting themes such as games (U1.Q5, 21, 23, etc.), sports (U1.Q1, 2;

182

U2.Q14, 29, 45, etc.), entertainment (U3.25, 29), and technology (U1.16, U3.33)

encouraged candidates to produce natural talk with joy and comfort. I found many

examples in both types of long turn (monologic) and interactive speaking tasks (dialogic) in

candidates’ speech samples. Excerpt 5.2 is an example of a candidate’s response to an

individual long turn task talking about a picture epicting a scene of boys and girls

participating in an outdoor game (U1.Q22).

Excerpt 5.2 File U1.C7-8 2:38 – 3:26 1

2 --->

3

4 --->

5 --->

6

7

8

C8: this photo is also about games but an outside activities

and to >I think the purpose of< this game %is% to improve the

er strength, the physical strength and also %the% teamwork of

the student and er it is also a competitive spirit er but I

think it’s good for young children instead of just er sitting

in front of %the% computers or smartphone all days, but er

(when) outside >go outside< to have fun with their friend to

improve the solidarity and also friendship.

This candidate’s response demonstrates her achievement in terms of topical content

that a unit from the course book targeted: Games (Unit 2, Skillful Listening & Speaking 4).

The script shows the candidate had good knowledge about the advantages of teamwork

(lines 2 and 4), and young people’s appropriate attitude towards the use of hi-tech

communication devices (lines 5-8). The response did not give a description of the picture,

but a reflection on it.

Second-year students found it quite comfortable and productive talking about topics

famililar to young people’s world. Excerpt 5.3 is an extract from an examiner-candidate

interactive task performance in which the candidate demonstrated his ability to express

what he knew about the music market and the popularity of music downloading in

Vietnam, as elicited by the examiner’s follow-up questions after the main question about

music downloading and file-sharing (U3.Q29).

183

Excerpt 5.3 File U3.C8 3:43 – 3:26 1 --->

2

3

4

5

6

7

8

9 --->

10

11

12

13

14

15

IN: so do you think that downloading music for free online is

fair for musicians?

C8: uhm in my opinion, I think that uh, mm people should have

pay in order to download music because music, whether it’s live

or online, is product of er creative work by many people like

musicians, singer, music technician. I think good work should be

paid. payment will encourage people make more and more music

product. it’s not fair if these people are not paid.

IN: what do you think about music downloading from the internet

in your country?

C8: I think that downloading music is very popular in vietnam.

music is er available on many websites for free download. people

don’t have (.) habit to buy original cds or dvds. I think we (.)

we should have policies to manage this problem er to develop

music production.

Althought the examiner’s elaborating interrogation (lines 1 and 9) was not included

in the official list of question items prepared for oral assessment, it was well within the

“target language use domain” (Bachman & Palmer, 1996, p. 44) of a chapter from the

Listening-Speaking course book. In that chapter, students were involved in a class

discussion about whether they agreed or disagreed with a statement “Downloading music

off the Internet without paying for it is no different from buying a used CD or copying a

friend’s CD” (Chapter 4, Lecture Ready 3, p. 34).

Candidates’ opinions from interviews suggest that most of the test contents were

similar to those practised in language class activities (U1G2, U2G2, U3G1). There were no

recorded cases of candidates who could not demonstrate any speaking ability in the test.

Candidates were able to perform more or less spoken English to complete their speaking

test sessions. However, to achieve a good completion of the test tasks as required, they had

to make themselves familiar with the content covered in the course book being used.

Familiarity with those contents resulted from active engagement in class speaking practice,

and learning from the outside world such as English learning websites on the Internet,

English-speaking clubs, colloquial conversations with foreigners, etc. When asked about

the role of Speaking lessons in equipping students with necessary content for the end-of-

course speaking examination, one candidate reported as follows:

184

For me, the contents from the Speaking class were a partial support (for the end-of-

course exam). The rest laid in our own effort to make use of them. There were some

questions that were not taken from the course book which exceeded our ability. We

had to enrich our knowledge by going out for more practical experience to be able

to answer such questions. (U2G2.S3)

Oral raters acknowledged and appreciated the inclusion of questions whose contents

were associated with social life, students’ lives, and science serving society (U2.T2,

U3.T1). These themes fit in well with the course objectives of language use for daily

communication, social purposes (Universities A and B), and areas of humanities

(University C). Many questions (U2.Q7, 8, 42, 49) aimed to remind the young generation

of environmental problems thoughout the country, and increase awareness of the

importance of protecting the ecosystem for both the present and the future (Chi Hieu &

AnhVu, 2016; Clark, 2016; Ho, 2016; Anh Minh, 2018; Tran, 2008). By bringing real-life

issues into language testing and assessment, test designers show us that oral performance

“correspond[ed] to language use in specific domains other than the language test itself”

(Bachman & Palmer, 1996, p. 23). Test authenticity gave test takers a sense of achievement

and positive perceptions of the usefulness of their language study in that they could use the

target language for social purposes.

The experts’ judgements on test contents were echoed in my interviews with teacher

raters when the latter said that candidates could incorporate in their oral performance

vocabulary about social life that they had learnt from listening lessons in the course

(U2.T3). Content knowledge plays a crucial role in speaking apart from pronunciation. In a

teacher rater’ words:

The test helped teachers identify what students are strong and weak at. Students’

speaking informed (us) to what extent they could achieve the course objectives so that

we (teachers) will have appropriate methods to help them improve spoken English.

There were students who needed to improve their content knowledge more than their

(English) pronunciation. Learning a language, in my opinion, means not only being

able to speak that language but also knowing what to say. So knowledge is also a

very important part of making a good speaker. (U3.T2)

185

(2) Content-related shortcomings in the design of test questions

As the EFL experts pointed out, there existed some shortcomings and inaccuracies in the

question items across the three institutions. The occurence of these items was not frequent,

but they did affect candidates’ performance, and therefore the raters’ evaluations and

fairness across candidates within the same institution. Evidence from speech samples

demonstrated that experts’ notes on problematic question items were in line with the

problems encountered when candidates responded to those questions, regarding the

question design or the content of the questions.

Precise wording in constructing test items is very important because it represents the

standard of language students have learnt at school and has a direct impact on candidates’

performance (Osterlind, 1998; Heaton, 1975). Students (as candidates) could imitate and

include wording errors that were not recognised and corrected prior to task delivery. The

following excerpt is from a peer interaction task where a candidate re-read a discussion

question with a grammar mistake, from a prepared text that the interlocutor had read to

them as the task instruction. In this case, the non-finite form of the verb “be” was redundant

in “makes people be successful” (U2.Q68) and should be omitted (E3) in both spoken and

written English.

Excerpt 5.4 File U2.C15-16-17 5:46 – 9:26 1

2 --->

3

4

5

6

7

8

9 --->

10 --->

11

12

13

14 --->

IN: thank you, now I'd like you to talk together about what

makes people be successful in their career for about three

minutes, I'm sorry, five minutes >because this is a group of

three<, you should give your opinions, reasons, or examples.

after that you should decide which one is the most important,

first you will have some time to look at the task.

((50 seconds for preparation, some Vietnamese sounds in group

discussion))

IN: now, talk together.

C17: so the question is what makes people be successful in their

career. what do you think about (.)

C15: uhm in my opinion I think non-traditional education

educational path and determination are two reas... are two

causes to make people be successful in their career because er

186

15

16

35

36

37

38

39

40 --->

with >non-traditional educational path< they didn't they didn't

have to follow the traditional step.

. . . ((18 lines of transcript))

C15: %uhm and% I think er in in the life er people alway face

challenge or difficulty so if (you) they have a high

determination they will get over (the) difficulty and achieve

their goal. so for me determination is the most reason (causing)

make people be successful in career.

As can be seen from Excerpt 5.4, after completing Task 1, the interlocutor delivered

Task 2 by reading to the candidates the instruction exactly as it was printed in the testing

material booklet. The inclusion of a grammatical inaccuracy in the interlocutor’s task

delivery (line 2) was possibly not because of his/her carelessness, but the design of the

testing material that he/she was not allowed to change in the role of an interlocutor. After

the preparation time, the interlocutor asked the group to start their discussion performance

(line 9). Candidate 17 initiated the discussion by reading the question from the testing

material booklet (line 10). Candidate 15 re-used the phrase in her performance (line 14)

once in her first turn, and once again (line 40) in her conclusion after the partner’s turn. We

can see that three (the interlocutor, Candidates 15 and 17) out of the four speakers involved

in this scene unintentionally incorporated the inaccurate verb phrase into their speech. This

could become part of the learners’ habitual language use after the test. We cannot be certain

whether or not the co-examiner (the assessor) counted the grammatical error in scoring, as

grammar accuracy was amongst the assessment criteria (Appendix D.1). Speech samples

reveal similar occurrences for other test items with inaccurate wording that the experts

pointed out in their judgement protocol (U1.Q12; U2.Q23, 70; U3.Q7, 30).

The experts doubted the functions and ability of the display questions (U3.Q1, 3,

22, 23, etc.) to elicit the intended oral language samples for assessment. Although questions

of this type were relevant to the course content, they elicited predetermined correct answers

(Brown, 2004), and did not create two-way interaction for participants in testing speaking

skills. They were more appropriate for reading comprehension tests than for oral

assessment (E1, E2). The experts wondered why the course book (Lecture Ready 3) was

designed to learn how to communicate with peers in group discussion (Frazier & Leeming,

2013, p. iii), but the assessment description indicated in University B’s course outline

187

stated that the final (speaking) test involved “one interview where students will be asked

about how to organise a talk, how to use certain expressions of lecture language, or

knowledge conveyed by the chapters” (Appendix B.4c). The experts’ concerns were well

founded when I examined how these display questions worked in candidates’ oral

performance. The following examples (Excerpts 5.5-7) illustrate why the experts suggested

that test questions of this type should be revised to become more “productive” (E2) in

eliciting oral language samples from candidates.

Excerpt 5.5 File U3.C14 0:20 – 0:50 1

2

3 --->

4

5

6

7 --->

IN: so, can you tell me three kinds of lecture language you have

learned? and give examples for this.

C14: yeah, the first one is topic lecture language, erm second

is er transition lecture language, and (.) last is %kind of%

literature language.

IN: ok, can you give one example?

C14: let me start with, let’s start by

Excerpt 5.5 is an extract from a one-on-one interview task in which the examiner

asked Candidate 14 to tell him/her three types of lecture language (U3.Q43). The candidate

listed three types of lecture language as required (lines 3-5) and gave two examples (line 7)

without clarifying which type(s) the examples belonged to. The content experts suggested

lecture language be tested through candidates’ actual use in lectures or oral presentations

(E1) rather than by direct requirements to recite from “memorisation” (E5) of theoretical

knowledge. The function of a question requesting a performance of knowledge does not

always ensure its involvement in the use of functional language (Bachman, 1990).

I found candidates’ similar recitation in response to questions requiring a

demonstration of conceptual knowledge, e.g. business-related concepts. Candidates

attempted to recall some content knowledge they had learnt from the course rather than use

spoken English to express their own ideas or perform interactional competence.

Candidates’ difficulty in dealing with such display questions is illustrated in Excerpt 5.6.

188

Excerpt 5.6 File U3.C13 0:05 – 0:52 1

2 --->

3

4

5

6

7

IN: what do you know about neuromarketing?

C13: er er neuromarketing is an application of neuroscience, and

it use er it include the direct use of the brand er er the brand

image er marketing er and (.) er to deal with the (be fond) of

the brand and the researcher will er base on it er to evaluate

their (.) I think (.4) their %er% I think er (.) fitness of

their product.

As can be seen in Excerpt 5.6, Candidate 13’s response was almost 50 seconds but

contained very frequent pauses and hesitations (lines 2-7). There were signs of attempting

to find appropriate vocabulary and to remember a correct definition for ‘neuromarketing’

(U3.Q3). This candidate received an overall score of 7.7 out of 10.0, including 1.9 out of a

maximum 3.0 for the Content component. The situation of oral performance was not better

for stronger students who were asked this type of question at random. Excerpt 5.7 illustrates

an example of another candidate reponding to the same question as in Excerpt 5.6. The

candidate was awarded an overall score of 8.6, including 2.4 for the Content component.

Excerpt 5.7 File U3.C12 0:03 – 0:25 1

2 --->

3

4 --->

IN: okay. so what do you know about neuromarketing?

C12: um, neuromarketing is about um (.3) brands uh (.4) affects

brain on marketing on and how to work brain on %this, yes. I

think it's% very %simple%.

Candidate 12’s oral performance (lines 2-4) was not productive in response to the

examiner’s question. Her stretch of speech was quite short (about 20 seconds), featuring

long pauses of hesitation (3 to 4 seconds) and quiet talk (line 4) as a sign of lack of

confidence and/or a limited source of ideas. These examples indicate that display questions

were not very successful in creating opportunities for candidates to perform authentic

communication and might have decreased their verbal fluency if the required knowledge

happened to be in a specific content area they were not strong at. Such “exam questions”

(Searle, 1969, p. 66) are not considered to be real questions because they do not function as

genuine requests for information the questioner wants to know, but are used for examining

whether the candidate knows some information. Content check might cause candidates

189

anxiety and stress since they are not in the position of an active interactant with the

examiner, but of a learner being tested about what has been taught.

The experts pointed out many cases of test questions whose ambiguous contents

were confusing to candidates. The experts argued that the main purpose of the examination

was to assess speaking skills. The meaning of the test questions should have been clear in

order for candidates to understand them right away. Candidates should have used time for

demonstrating oral production ability rather than undergoing further cognitive proccesses to

understand complicated questions before being able to respond (E4, E6). The experts’

concerns on the ambiguity of test questions were in accordance with my speech sample

analysis. Excerpt 5.8 provides an example in which the candidate expressed her uncertainty

about the meaning of a question (U2.Q46) not because she did not understand its

vocabulary, but because its content was too general.

Excerpt 5.8 File U2.C15-16-17 3:43 – 4:26 1 --->

2

3 --->

4

5 --->

6

7 --->

8

9

IN: ((addresses Candidate 17 by her first name)) how do you know

that one invention is important or not?

C17: er one invention? er (.) I’m sorry but I mean what kind of

invention?

IN: just an invention. how do you know that an invention is

important or not?

C17: er the invention is important or not is depends on the

creator. I mean the inventor er and the ability that the product

or er the that invention can affect the human.

After the interlocutor read the question once (lines 1-2), Candidate 17 asked

clarification questions (lines 3-4) to make sure that she understood what kind of invention

the interlocutor was referring to by “one invention”. The candidate had good reason to ask

in this situation. Without knowing in what field the invention was made (medicine,

military, telecommunication, etc.) or who the invention was made to serve, it was hard for

the candidate to generate a satisfactory answer because an invention in some areas is

important for human kind, whereas one in another area may not be. An invention may be

important for the benefit of some people, but for other people or living things it might be

disadvantageous or even harmful. However, the interlocutor could not offer further

explanation but instead repeated the exact question, possibly in order to keep the task

190

delivery consistent with the interlocutor frame (line 5). The candidate seemed to be forced

to cope with this question by a short answer, providing a general way to justify whether or

not an invention is important (lines 7-9). Her response could not have further elaboration,

e.g. how the effect of an invention is considered to be important. The main content of this

question concentrated on approaches to justify the importance of inventions whereas the

focused contents of the course book lesson were about the world’s accidental inventions,

personal discoveries made by accident, and important factors in making any kind of

discovery (Unit 8: Speaking, Skills for Success 4, pp. 168-169).

5.6 Summary

In this chapter, I have discussed the relevance of test contents (content validity) through

objective judgements made by EFL experts outside the institutions involved in the study.

Statistical results of the experts’ judgements indicated that most of the test items showed

content validity evidence in accordance with the selected course books, and the expected

course objectives at each institution. However, there was discrepancy in terms of the

content validity index at test item level and for the whole scale. The majority of the test

items at University B were relevant with the content domain covered in the course book,

whereas relevant items at the other universities accounted for about three-quarters, leaving

approximately one-fourth irrelevant. Consistency and fairness to all the candidates was not

assured for those who had to perform oral language skills elicited by such irrelevant

questions. Further investigation into the relationship between experts’ judgements and

speech samples indicated that test question design could facilitate or hinder candidates’ oral

performance with regard to both quality and quantity.

The experts gave different ratings for each item and helped to point out those the

speaking test which needed improving with regards to wording, structure, orientation, and

difficulty level. On a larger scale, again University B obtained the strongest agreement of

the experts on the degree to which their test as a whole was relevant to the content covered

in the training programme and the outcome objectives established in the course outline.

The next chapter focuses on examining the development of test tasks adopted for

oral assessment at the intitutions. I compare the different types of speaking tasks in terms of

response formats, task purposes, time constraints for task completion, and channels through

191

which oral examiners and candidates communicate. I discuss the results from the

questionnaire surveys to generalise raters’ and candidates’ perceptions of the test tasks. I

illustrate the relationship between task design and oral performances via speech samples

collected during the speaking test.

192

Chapter Six

RESULTS: SPEAKING TEST TASKS

In the assessment of second languages, tasks are designed to measure learners’

productive language skills through performances which allow candidates to

demonstrate the kinds of language skills that may be required in a real world

context. (Wigglesworth, 2008, p. 111)

6.1 Introduction

Oral assessment involves making judgement on speaking performances within a specific

time constraint. When assessing speaking skills, the rater guides and elicits test-takers’ talk

by the prepared assessment task(s) he/she delivers to them (Louma, 2004). Appropriate and

sufficient elicitation of candidates’ speech samples are crucial for rating. Before making

decisions on how to rate candidates’ speaking abilities, testers need to establish what, and

how many, test tasks will be used for an assessment. Students’ task-based oral

performances reflect not only test takers’ language ability but also possible measurement

bias caused by the influential impact of task characteristics (Fulcher & Reiter, 2003). Test

tasks play an important role in that they shape what a candidate says and how a candidate

speaks in the oral test.

In this chapter, I concentrate on the characteristics of language tasks employed at

the three institutions under study, and their relationship between test tasks and candidates’

actual performance. The purpose of this chapter is to answer the Research Question 1c:

Were the characteristics of the test tasks appropriate and effective in eliciting sample

language for assessment? The data for my analysis derive from observational field notes,

speech samples recorded during tests, interviews, questionnaire surveys and the testing

materials provided after the exam by the Faculty of Foreign Languages of each institution. I

commence this chapter with a comparative analysis of key features in oral assessment

including response format, test purpose, time constraint, and channels of communication as

part of examining the context validity of spoken language testing administered at tertiary

193

level. Consequently, I present test raters’ and candidates’ perceptions about the test tasks

employed to elicit oral language samples for assessment. The final section summarises the

main points presented in this chapter.

6.2 A comparative analysis of speaking test tasks across institutions

My study involved three tertiary institutions offering language skills courses for EFL

majors. Each institution designed their own EFL syllabus and adopted different methods of

assessing speaking skills. Although all the universities applied the form of direct face-to-

face speaking tests, their tasks had different features in terms of response format, test

purpose, channels of communication, and time constraints. The following sections will

present a comparative analysis of those characteristics based on the data I synthesised from

observational field notes and spoken English production samples recorded during the test,

questionnaire surveys, and interviews conducted after the test. By linking the designed test

tasks with candidates’ authentic oral performances and examining stakesholders’ opinions,

this chapter examines the appropriateness and effectiveness of varied test tasks in shaping

candidates’ responses to speaking task requirements.

6.2.1 Response formats

In oral testing for EFL majors, Universities A and B used two tasks, and University C used

one single task. Each task type adopted a different method to elicit speech samples from

candidates for assessment in different response formats: monologue, interview, and

discussion.

Table 6.1 synthesises the response formats used in the oral assessment at the

educational institutions involved in the study. These response formats could be categorised

into two broad categories: ‘monologic’ without involving speaker interactions, and

‘dialogic’ involving verbal interactions between or among speakers (in a pair or in a group).

Following these criteria for categorisation, Task 1 of University A belongs to the

monologic category, and the other tasks belong to the dialogic category: Task 2 of

University A, and Task 2 of University B allowed oral interactions between two candidates;

Task 1 of University B, and of Task 1 of University C took an interview format between the

interlocutor/examiner (interviewer) and the candidate (interviewee). However, a distinctive

difference between University B’s interview format and that of University C was that at

194

University B, two candidates were involved in each speaking session, while at University C

it was a one-on-one interview format. Table 6.1 presents different response formats adopted

in oral assessment at the three institutions.

Table 6.1 Response formats used in oral assessment across institutions

Institution University A University B University C

No. candidates per test session

Two Two One

Task 1

Individual long turn Each candidate was given a picture to talk about. Pictures for each pair of candidates were of the same theme.

Interlocutor-candidate interview Candidates did not know the questions in advance. The interlocutor asked each candidate two or three questions. Questions for each pair were of two themes.

Examiner-candidate interview Candidates knew the questions in advance. The examiner asked each candidate three to five questions at random. Questions for each candidate were of various themes.

Task 2

Candidate-candidate interaction Each pair was shown a set of two questions and asked to choose one to discuss. The two questions were of the same theme as Task 1.

Candidate-candidate interaction Each pair was given a mind map to look at and use the prompts to discuss a related question. The theme of the mind map was different from the themes in Task 1. Candidates gave opinions, reasons, or examples to decide the final answer to the question.

Response format plays an important role in determining the quality as well as the

quantity of candidates’ oral production in a testing environment. A previous study

(O’Sullivan, Weir, & Saville, 2002) classified speech functions in oral tests into three main

categories: informational (e.g. describing, expressing opinions, providing personal

information, etc.), interactional (e.g. negotiating meaning, persuading, agreeing and

disagreeing, etc.), and managing the interaction (initiating, changing the topic, deciding,

etc.) that are useful for observing the presence or absence of particular functions in

speaking task performance (Appendix F.4). Candidate outputs engaged in monologic

discourse reveal other informational functions such as expressing a preference, justifying

and speculating upon an opinion, a choice, or a life decision (Lazaraton & Frantz, 1997). In

the following sections, I present how varied response formats (examiner-candidate

interaction, candidate-candidate interaction, and monologic long turn) determined

195

candidates’ actual speech “with respect to a blueprint of language functions representing

the construct of spoken language ability” (O’Sullivan, Weir, & Saville, 2002, p. 33).

(1) Examiner-candidate interaction

Data from the speaking test samples show that candidates performed informational

functions more frequently in the examiner-candidate interview format than in peer

interactions. Excerpt 6.1 provides an extract from a speaking session of an examiner-

candidate interview format (University A’s Task 1). Although two candidates were sitting

together for this speaking session, they did not interact with each other in the first task.

They did so in the second task. The interlocutor read questions at random from the test

book for each candidate to answer. The interlocutor would ask prompting questions

‘What/How about you?’ if he/she would like a candidate to answer the same question as the

partner candidate had done, or ‘What do you think?’, ‘Do you agree?’ if he/she would like

to elicit a candidate’s opinions about the partner’s given response. By asking these

prompting questions from the framework provided (Appendix D.2), the interlocutor was

able to assess paired candidates independently (as one’s response did not intervene in the

other’s) and consistently (as paired candidates listened to and answered the same

questions). Further, both candidates were involved in the interview because they were

expected to give spontaneous responses to the questions addressed to either of them at any

time.

196

Excerpt 6.1 File U2.C9-10 0:15 – 3:07 1

2

3

4

5

6

7

8 --->

9

10 --->

11

12 --->

13 --->

14

15

16 --->

17 --->

18 --->

19

20

21

22 --->

23

24

25

26 --->

27

28

29

30

31

32

33

34

35

36 --->

37

38 --->

39 --->

40

41

42

IN: good morning, my name is ((introduces

herself)), and this is my colleague ((introduces

the assessor)). what’s your name?

C9: my name’s ((introduces herself)).

IN: and you are?

C10: I’m ((introduces herself)).

IN: ((repeats the two candidates’ names to confirm

identity)) ok. first of all, we would like to ask

you some questions. ((addresses Candidate 9)),

would you love to protect the environment?

C9: yes, I’d love…

IN: in what ways would you do?

C9: I don’t throw the trash away. and in… at my

family, we throw the… grow the plants in the

garden, and I save energy when I use it.

IN: uhm huh, how about you, ((addresses C10))?

C10: I think as a student, I want to protect my

environment by er (.) I think the useful the most

useful way is to turn off the lights, or

electricity when we don’t use,

and to plant… to plant trees around our house.

IN: ok. and ((addresses C10)), if a close friend

were considering taking off a year between high

school and college, what advice would you give your

friend?

C10: I will tell him or her that taking a gap year

is er… good at all. it give er… it give us not only

experience to er (.) to take experience from

activity, from our real life, not just in er (.)

materials in school, but also we we through… during

the gap year, we also have a chance to er (.) to

know that what what our passion is. but er (.) I

think that (.) I will tell her that take a gap year

is good, but consider not to waste time (.) because

it is easy to waste time not doing at all.

IN: ok, thank you. and ((addresses C9)), how

popular is the concept of taking a gap in Vietnam?

C9: I think taking a gap in Vietnam is not popular

because (.) they have er an exam (.) entrance

examination to university, and not (.) not much

people to take a year off between high school and

university.

(asking for

opinion)

(description)

(description)

(personal

information)

(opinion)

(opinion)

(topic change)

(suggestion)

(elaboration)

(elaboration)

(opinion)

(elaboration)

(topic change)

(opinion)

(elaboration)

197

As can be seen from Excerpt 6.1, there were three speakers involved in this

speaking test scenario. The interlocutor employed elicitation questions for initiating verbal

interaction (line 8), asking opinions (lines 10, 16), asking clarification (line 12), or

changing topic (line 22). The interlocutor decided which question to ask each candidate.

For example, the interlocutor directed a two-part question (lines 10-12) to Candidate 9 in

the first round of asking questions. Then she asked Candidate 10 the same question (line

16) and continued with another question (line 22). She returned to Candidate 9 with a

different question (line 36). This way, the interlocutor was able to reduce the number of

questions being asked and get the pair of candidates involved in the conversation. Both the

candidates could not anticipate whether they would be asked the same or different

questions in the random sequence of the interlocutor’s available questions.

The candidates’ speech samples indicate a predominant use of informational

functions rather than interactional functions. For example, Candidates 9 and 10 performed

various informational functions in their speaking such as describing ways to protect the

environment (lines 13-15), providing personal information (line 17), expressing opinions

(lines 18-21, 38-42), giving advice (lines 26-35), and elaborating ideas (lines 39-42).

Although the oral sample represents a pattern of two-way communication (question and

answer), the interlocutor played a dominant role over the candidates. The interlocutor

controlled all of the interaction while the candidates did not have opportunities to interact

with each other or ask the interlocutor any questions. This finding was in line with previous

research on the interlocutor’s control of the conversation in interview-formatted tests

(Swain, 2001; Young & Milanovic, 1992), which made the dialogue less natural than

everyday conversations and may have caused candidates’ test anxiety.

I found a similar distribution of candidates’ speech functions in the one-on-one

interview format. Data analysis of the examiner-candidate interaction revealed that the

examiner asked for even more elaboration besides the predetermined questions because

there was not an interlocutor outline applied in task delivery. Excerpt 6.2 illustrates part of

an interview task in which the examiner could add some personal comments (lines 7, 10),

ask a single candidate unplanned questions for elaboration (lines 4, 10), or meaning

negotiation (line 16) with more flexibility than in Excerpt 6.1.

198

Excerpt 6.2 File U3.C24 2:20 – 3:2 1 --->

2

3

4 --->

5

6

7 --->

8

9

10 --->

11

12

13

14

15

16 --->

17

18

IN: so what kinds of intelligence do you think that you’re

strong at? why do you think so?

C24: uhm (2) maybe bodily kinesthetic.

IN: oh, why do you think so?

C24: yeah, ‘cos I have played football when I’m just 7 years

old (.) till now

IN: [ oh, yes.] I see.

C24: and I have won medal er (.) silver medal for my district

when I was at er (.) fifteen.

IN: ^really? but ^now … do you follow any kind of sport?

C24: I’m just love to be play football, but (.) er er when at

(.) I’m sixteen, I had an accident when I played football. I

crashed and my bone cracked at my shoulder. I just now to play

to reduce stress or play with my friends to have fun, not to

do any tournament any more.

IN: just for fun, not for medals, right?

C24: yeah, %(for some fun and relax after class hours)%

IN: I think that’s all. thank you so much!

In Excerpt 6.2, the interlocutor asked questions four times (lines 1, 4, 10, 16), of

which only once (the first) was a question prescribed in the test book (U3.Q38). The other

questions (Why do you think so? Do you follow any kind of sport? Just for fun, not for

medals, right?) were for probing for more specific and in-depth information related to the

previous response. The interlocutor guided and decided where the conversation was going

and stopped.

(2) Candidate-candidate interaction

A discussion task was introduced in the oral test at Universities A and B. As

presented in Table 6.1, the response format of this task requires candidates to perform their

speaking ability in peer-peer interaction. University C did not apply this format in their oral

assessment. This response format enabled paired test takers to discuss a question of their

own choice from the two provided (University A), or decide the final answer to a question

prompted with key words or phrases that candidates may use for discussion. What the

discussion task at two institutions had in common was that the examiner (or interlocutor)

did not intervene but let candidates manage their own communication. The examiner (or

199

interlocutor) just played the role of delivering tasks or signaling the candidates the timing

for their performance.

An analysis of recorded speech samples indicated that candidates performed a wider

range of speech functions of the three types. As illustrated in Excerpt 6.3, being allowed to

discuss freely enabled candidates to use interactional functions in addition to informational

functions, as in the long turn and the examiner-candidate format.

Excerpt 6.3 File U1.C27-28 3:27 – 6:12 1 --->

2

3

4 --->

5

6

7 --->

8 --->

9 --->

10 --->

11

12

13 --->

14 --->

15

16 --->

17 --->

18

19 --->

20

21

22

23

24 --->

25

26 --->

27

28 --->

29

30 --->

31

32

C28: first I will show my opinion. only a person

who risks is free, I mean that if someone live in

er isolated area and and is surrounded by the

nature like the sea, the ocean, or the mountain if

they risk themselves to get over the nature to

climb the mountain or er ship across the sea they

will find the land of freedom. do you think so?

C27: so do you agree with this statement?

C28: er yes, ah no no

it says only. I don’t think only, it depend in each

person

C27: uhm hum, er from my point of view of (.)

I totally agree with you so just follow your

argument I think that some people who are not risk

but they are maybe feel free but on the other side

I think that if we’re willing to take risk so it

bring me a lot of merit advantages maybe we can

escape some er tedious routine of life and we can

er discover ability of ourselves, what about you?

anything else you want to say?

C28: in the bad situation we may that person may

die, and he never enjoy the freedom ((both laughed))

C27: ok

C28: or he will go bankrupt get out of the market

and never get to again the the life

C27: yes I also think that when you take risk you

have to run into a lot of problems or obstacles

C28: so do you think free will give give us the

success?

C27: ohm most of the cases I think that if you wi

willing to take risk so you can be success to get

better success than other people, I think so

(initiating)

(elaboration)

(meaning check)

(asking opinion)

(conversation

repair)

(disagreeing)

(agreeing)

(opinion)

(elaboration)

(opinion)

(giving example)

(asking opinion)

(agreeing)

(giving example)

(opinion)

(asking opinion)

(changing topic)

(opinion)

200

33 --->

34

35

36

because er maybe er we can er as I mentioned

earlier, we can discover ability of yourself so you

can er a competence skills that you can learn

experience from …

(persuading)

Excerpt 6.3 indicates that the pair of candidates was capable of co-constructing a

comprehensible discussion about the given topic (U1.Q4) despite some grammatical

inaccuracy in their oral production. There was evidence of informative functions in the

speech sample including expressing opinions (lines 14, 16, 26, 30), giving examples (lines

17, 28), elaborating (lines 4, 15). In addition, the candidates showed their ability to interact

verbally with each using subskills like asking for opinions (lines 8, 19, 28), checking

meaning (line 7), conversation repairing (lines 9 -11), persuading (line 33), agreeing and

disagreeing (lines 10, 13, 23). In spite of the examiner’s presence, this response format

enabled the test takers to manage the conversation in their own way. They could initiate the

discussion after reading the topic questions (line 1), and change to another related question

(line 28).

Recorded data analysis indicated that a broader range of speech functions were

present in candidates’ performance of another type of discussion task, in which candidates

in each pair were provided with a mind map of textual cues to discuss a given question

(University B). It seemed to be beneficial for candidates when they could incorporate the

words or phrases provided into their speech and add their own related opinions or

examples. For example, there was a high frequency of giving opinions (informational

function) because candidates were exposed to six cues for each question they were required

to discuss. Evidence from the transcripts shows that peer-peer interactional opportunities

were equally distributed to each participant. However, the quality and quantity of each

candidate’s contribution to the co-constructed performance varied depending on each

candidate’s level of English. Excerpt 6.4 illustrates an example of a candidate-candidate

discussion task using the prompts provided (U2.Q67): ‘Advantages and disadvantages of

canned food and fast food’. Looking at the third column, we can see that candidates’

performances included all three categories of speech functions: informational functions

(providing information, giving opinions, comparing, elaborating), interactional functions

201

(agreeing and disagreeing, asking for opinions, modifying, correcting utterance), and

interaction managing functions (initiating the discussion, changing).

Excerpt 6.4 File U2.C9-10 4:43 – 7.28 1 --->

2

3

4

5 --->

6

7 --->

8 --->

9

10

11

12

13 --->

14

15

19 --->

20

21

22 --->

23

24

25

26

27

28

29

30

31

32

33 --->

34

35

36

37

38 --->

39 --->

40

41

IN: ok. you’ll discuss with each other in three

minutes. after that, you should decide which

disadvantage is the most serious. alright? talk

together.

C9: I think disadvantage the most serious is er

(.) the long-time…

IN: [oh, no]. talk together first.

C9: disadvantage of canned food and fast food is

less nutrition… less nutritious because er…(.) if

you fry er…(.) a product like fried chicken, it’s

er (.) it burn a lot of nutrition like er (.) vi…

vitamin or something like this. and er …(.) it’s

high in sugar and fat. it make the children or

make the people eat canned food and fast food

become fatter, and ...((3 lines of transcript))

C10: I agree with you that… that canned food and

fast food is high in sugar and less nutrition, and

it contains harmful section. but er…(.) I also

think that in the modern time like this, in a

modern world, when people don’t have enough time,

it is er… (.)nutrition… it is canned food and fast

food is a good choice because it is time-saving.

and it is also long-time preserved. Because it is

time-saving…(.) it means that the people who use

the fast food is perhaps to spend a lot of time to

prepare or to cooking, but er… can eat it easily.

and long-time preserved, it also convenient for

people who don’t have time to go to the market or

(.) supermarket to buy their fresh food.

C9: and (.) I think… disadvantage of canned food

or fast food is portable to…(.) to (.) to people

to use it. they can bring it to the company, or

the office to use, and because they have less time

less time to… to enjoy their meal, and… …(.)

IN: so which disadvantage is the most serious?

C9: I think disadvantage er…(.) the most serious

is high in sugar and fat because... I said that

it… (.)it contain a lots of chemical can cause a

(opinion)

(initiating)

(comparing)

(elaboration)

(providing

information)

(agreeing)

(comparing)

(opinion)

(modifying)

(elaboration)

(elaboration)

(opinion)

(changing)

(elaboration)

(opinion)

(explaining)

202

42

43

44 --->

45 --->

problem for teeth, for their health, like can er…

change the DNA, or can cause the cancer.

IN: uhm… ((signals C10 to speak)).

C10: I think the same . . .

(opinion)

As illustrated in Excerpt 6.4, the interlocutor made minimum intervention into the

paired discussion, only where necessary. For example, delivering the second task and

having the candidates prepare for their discussion in 30 seconds. The interlocutor signaled

the time to start, repeated the task requirement, and made sure the candidates were ready

(lines 1-4). However, Candidate 9 hurrying into an immediate answer beginning with ‘I

think disadvantage the most serious is…’ did not meet the task requirement to ‘discuss with

each other in 3 minutes’ before deciding which disadvantage is the most serious. The

interlocutor’s reminder ‘talk together first’ (line 7) was a necessary intervention to help

keep the candidate on track with the task. The candidate could repair that inappropriate

beginning by addressing one of the cues provided without having to hurry to the final

decision (lines 8-18). Again, seeing that the time allotted for the discussion was coming to

an end, but neither candidate had decided on the final answer, the interlocutor interrupted

Candidate 9’s and asked her the question again (line 38). Candidate 10 was also given an

opportunity to answer that question (line 44). Situations like this occurred repeatedly in

many other excerpts because candidates were caught up in the flow of discussion or did not

keep to the timing of the task. It was not always clear when a decision on the final answer

should be made, or whether they had to come to a mutual agreement with the other

candidate, or they may arrive at different final answers that they both thought reasonable.

This vagueness needs revising for future examinations.

Candidates tended to give more opinions when they were provided with more

textual cues for a discussion task. They took turns in giving opinions and elaborating their

own opinions from the prompts without reflecting on their partner’s opinions, e.g. agreeing

or disagreeing with the opinions, inviting the other to give opinions, checking meaning or

asking for clarification to show their involvement in the discussion. Within the entire

candidate-candidate interaction scenario (Excerpt 6.4), the two speakers gave personal

opinions at least five times in total (Candidates 9: lines 5, 33, 39; Candidate 10: lines 22,

45). There is only one point of agreement (line 19) and none of disagreement. There was no

203

evidence of asking for clarification or meaning check to show interest or engagement in the

conversation. Neither candidate invited their partner to give opinions on the topic either. I

will discuss this aspect later in the following section regarding the relationship between the

task purposes and the candidates’ actual performances.

I have examined the different response formats of the speaking tasks used at the

three institutions involved in this study to learn about the speech functions produced in

candidates’ oral performance samples. The analytical results indicated that individual long

turn and question-answer format mostly elicited informational functions while the

discussion format provided candidates with more opportunities to perform all of the three

functional categories. This finding supports O’Sullivan et al (2002)’s investigation on the

distribution of speech functions across different response formats. Further, the combination

of both spoken production and spoken interaction (University A) aligned well with the

CEFR’s advocacy towards authentic oral assessment practices in the real world. An

effective English speaker needs both skills to interact with other people and express

himself/herself clearly. For example, interactional capacity for Level B2 (CEFR) includes

the ability not only to “interact with a degree of fluency and spontaneity that makes regular

interaction with native speakers quite possible”, but also to “present clear, detailed

descriptions on a wide range of subjects related to the field of interest” (Appendix G.2).

Adopting the CEFR as a benchmark in measuring foreign language competence in

Vietnam, task designers in paired oral assessment should consider the influential role of

selecting appropriate response formats to elicit both test takers’ samples of productive and

interactive English skills.

(3) Long turn (no interaction)

My examination of monologic discourse produced in picture-based long turns

highlighted candidates’ lack of opportunities to perform interactional functions, but instead

frequent use of informational functions to complete this kind of task. The speech sample in

Excerpt 6.5 is an example extracted from a speaking performance of Task 1 administered at

University A. The excerpt illustrates a candidate’s oral performance in response to a picture

prompt (U1.Q29) before interpreting its themes of ‘gathering’ (representatives from

204

different countries gathering for an international conference) which she would discuss with

her partner candidate in Task 2.

Excerpt 6.5 File U1.C13-14 1:44 - 2:34 1

2 --->

3

4 --->

5 --->

6 --->

7

8 --->

9

10 --->

IN: can you start now?

C13: er when I look at this picture I see that

there are many representatives of many countries

and they come from many continents. there’s

europe, asia, so it should be some important

meeting and (.) I see that their faces are pretty

tense er so maybe they’re discussing something

that is (.) maybe they have meet some conflicts,

(.) mm maybe they must have been discussed on

something some important matters.

(description)

(description)

(elaboration)

(description)

(inference)

(inference)

As shown in the third column of Excerpt 6.5, the candidate performed informational

functions varying from describing what she could see in the photograph (lines 2, 4, 6),

elaborating a mentioned idea (line 5), to making inferences of what was seen (lines 8, 10).

The ability to make inferences and/or express the message conveyed by the picture was part

of the oral examiner’s expectation as specified in the Speaking assessment guidelines

(Appendix D.1). A rater from this institution acknowledged that candidates may have

perceived a picture in different ways. The ability to interpret pictured-based messages

served to differentiate candidates’ levels:

In this section [picture-based long turn], the examiner does not need to say

anything. The candidate just looks at a picture and speaks out the message of the

picture. What matters is people’s views are different. What one person can see is at

times totally different from what others can though they look at the same thing. So

this feature is to test candidates’ knowledge. There were candidates who could

interpret some messages, while there were many others who could not tell the

message, which helped to differentiate examinees’ levels. (U1FG.T1)

The assessment guidelines (Appendix D.1), however, did not clarify how to

evaluate the correctness in interpreting the picture message. The following quote is

extracted from a focus group interview after the test in which a rater confirmed that the

205

rater took into account in the scoring, the thematic association between the first task

(picture-based long turn) and the second task (candidate-candidate discussion), and the

candidates’ ability to interpret a clear message from the picture:

… it (interpreting the picture message) is associated with the discussion questions

that follow. If the candidate can recognise the topic of the picture, then they will be

able to go through the discussion question smoothly. But for some students who

cannot recognise it, that may affect their scores. (U1FG.T3)

There are several examples of diversity in interpreting the message intended for

each picture. Excerpt 6.6 gives an example of a monologue on the picture prompt (U1.Q30)

which shared the same topic of ‘Gathering’ in Excerpt 6.5 (U1.Q29) but conveyed a

different message.

Excerpt 6.6 File U1.C13-14 0:15 - 3:07 1

2 --->

3

4 --->

5 --->

6 --->

7

8

9 --->

10 --->

11

C14: this picture got me thinking about (.)

pressure, (great) pressure like especially people's

stress of test exam=examination and (.) nowadays

more and more students get stressed out of exams

because (.) maybe the the (.) range of not (4) the

amount of knowledge they have to obtain every day

and for the test and uh (.) the way they study,

they always um waiting for waiting till the de-

deadline to prepare to study for %(the)% test.

that's the reason why students always get stressed

out before (.) take an exam.

(inference)

(description)

(inference)

(description)

(summarising)

As illustrated in Excerpt 6.6, Candidate 14 was able to produces quite smooth

utterances to perform some informational functions of describing (lines 4, 9), making

inferences (lines 2, 5), and summarising her presented ideas (line 10). However, the main

theme she could recognise was ‘pressure’ whereas the test designer’s intended topic was

‘gathering’ or ‘teamwork’ to match with the topic of the follow-up discussion questions

(U1.Q31: What aspects of teamwork do you find the most challenging? and U1.Q32: ‘None

of us is as smart as all of us’. What do you think?) This case raises a question whether the

candidate was considered to have failed to recognise the theme of the picture, or the picture

206

selection failed to elicit the intended response. The rating scale did not offer any criteria for

assessing the relevance of a picture interpretation in response to a picture-based task. I

found these concerns in an EFL expert’s words “How can teachers assess a good

description (of a picture) which is not related to the next discussion?” (E6).

6.2.2 Task purposes

Written tests usually allow candidates more time to think about task requirements before

the task performance. Oral examinees have very limited time to think once the test task has

been delivered. Therefore, a quick and unambiguous understanding of a task’s purpose is

crucial for the test taker to complete the task as expected. This condition depends on how

task designers communicate with task performers via the task wording. Galaczi and ffrench

(2011, p. 124) argued that “a clear, precise purpose will facilitate goal setting and

monitoring – two key cognitive strategies in language processing – all will potentially

enhance performance”. Non-native speakers of a language might face communicative stress

where the interlocutor is more communicatively competent, more knowledgeable about the

subject-matter, and where the communicative task is not clearly structured “but focuses on

reasons for actions rather than actions themselves” (Nunan, 1989, p. 111). Written task

instructions need to make sure “the level of input language is well within the lexical and

structural range of the perspective level of the candidate, so that even weak candidates can

understand the requirements of the task and attempt it” (Galaczi & ffrench, 2011, p. 125).

Each type of speaking task has its own purpose and requirements. Task writers

decide what task purpose(s) to apply in an oral test, i.e. what candidates are required to do

with language, and how to write instructions so that test takers can understand clearly and

exactly. Consequently, the test can measure what it is intended to measure. Table 6.2

describes the underlying purposes of oral assessment tasks employed across the three

institutions. The description relies on my analysis of oral testing documents, test room

observations, speech samples, and interviews with examiners and examinees.

207

Table 6.2 Description of underlying purposes of oral test tasks

Institution University A University B University C

Test format Paired Paired One-on-one

Task 1 Picture-cued � to engage in a monologic long turn (1 to 2 minutes) � to use the target language to express what the candidate saw in and/or perceived from a picture � to talk about a human activity, a natural phenomenon, a place, a person, objects, etc. depicted in the picture.

Interview � to respond to questions the interlocutor selected randomly from the question list and delivered verbally. � to give opinions and viewpoints about daily life, interests, likes and dislikes, etc. � to demonstrate social understanding about the modern world.

Interview � to respond to questions the interlocutor selected randomly from the question list, and delivered verbally; � to talk about views, opinions, personal experience, etc. � to demonstrate knowledge and/or understanding about particular terms or topics covered in course book � to present theoretical understanding about lecture language covered in course book.

Task 2

Discussion � to interact with the other candidate to discuss a question or an issue associated with the topic in Task 1 � to exchange opinions and perspectives about a saying or a problem mentioned in Task 1 � to demonstrate understanding and knowledge about an area related to the topic in Task 1.

Discussion � to interact with the other candidate using everyday language � to work together towards a negotiated agreement at the end of the task � to express and justify opinions by providing examples and reasons for or against the prompts provided.

Test takers at Universities A and B were required to perform two tasks, and those at

University C performed only one single task (Table 6.2). A comparison of speaking task

characteristics shows that there were notable differences in these language tasks used in

oral production tests across the institutions.

As indicated in Table 6.2, the speaking tasks varied in terms of interactional

relationships between participants in each test session, e.g. long turn (no interaction),

interview (interaction between a candidate with the interlocutor or examiner), and

discussion (interaction between two candidates).

These tasks had language output requirements varying from monologue

performance (picture-cued description), responsive speaking (question-and-answer), to

interactive speaking (interview and discussion). Depending on the type of task, candidates

208

were paired or took the test individually. The adoption of the paired format not only aimed

at eliciting “more varied patterns of interaction during the examination” (Saville, &

Hargreaves, 1999), but also at “making efficient use of testing and scoring time in live

tests”. The second reason was very practical for Universities A and B since they had large

classes (35 to 40 students per class).

Table 6.3 Summary of characteristics of the test tasks across institutions

Characteristics

University A University B University C

Task 1 Task 2 Task 1 Task 2 Task

Task type Picture-cued Discussion Interview Discussion Interview

Planning time 30 seconds 30-50 seconds

none 30 seconds none

Response time 1-2 minutes 4-5 minutes 1-1.5 minutes 3 minutes 3.5-5 minutes

Format individual paired individual paired individual

Task orientation guided to open open open guided open

Interactional relationship

none two-way one-way two-way one-way

Goal orientation convergent divergent divergent guided guided

Interlocutor status high same high same high

Familiarity with the interlocutor

low high quite low high high

Topics social and academic life

social and academic life

daily life daily life business, modern life

Situations variable variable variable variable variable

Task input

visual (pictures)

written (questions to be read by the candidates)

aural (questions to be asked by the examiner/ interlocutor)

graphical (mind maps with textual prompts)

aural (questions to be asked by the examiner/ interlocutor)

The orientation of each test task was either ‘convergent’ towards a common goal of

communication, or ‘guided’ by prompts or clues from which the test-taker chose items to

talk about. The language tasks covered a wide range of topics, such as social and academic

life, work, nature, the business world, etc. dependent upon the topics covered in the course

book used at each institution. On average, each candidate had between 3.5 and 5 minutes to

demonstrate his/her oral language competence in the test room. The time allotted for the

preparation and performance of each task was calculated on the basis of sound recordings

209

and observational protocols. When the task required long stretches of performance, the

candidate was given preparation time (from 30 seconds to 1 minute); otherwise, no

preparation time was allowed for short utterances (Task 1, University B) or responses to

known questions (University C’s task). Table 6.3 summarises characteristics of the

speaking tasks applied in oral assessment across the three institutions.

Each task type was designed with its own purpose, delivered via the diversity of

words and phrases expressing task requirements, or “rubric words” (Galaczi & ffrench,

2011, p. 125). The wording utilised in the oral test at Universities A and B was much more

diverse than that at University C. Some rubric words required the candidate to demonstrate

a certain kind of oral performance for rating (e.g. talk, discuss, give opinions, decide, etc.),

whereas other words required some non-verbal action as part of the task completion, but

not for rating (e.g. look, think, choose, wait, etc.). Test takers needed to listen or read

instructions carefully to recognise the main rubric words indicating the task purposes. Table

6.4 presents rubric words and phrases indicating task purposes that the

examiners/interlocutor used to deliver tasks in the oral examinations administered at three

universities. I drew these rubric words and phrases from the official testing materials for

examiners (Appendices B.4a-c, D.2) and recordings of test samples.

Table 6.4 Rubric words and phrases indicating task purposes

Institution University A University B University C

Paired Paired One-on-one

Task 1 Individual long turn … look at the photo … think about it … talk about it … wait for your turn

Interlocutor-candidate question and answer using the questions prepared in advance … We’d like to ask you some questions…

Examiner-candidate question and answer … I’d like to ask (you about) … we’ll talk about … what do you know about … can you tell me

Task 2 Candidate-candidate interaction … choose (one of the topics) … discuss the question… … talk…

Candidate-candidate interaction … look at the mind map … talk together about… … give opinions, reasons, or examples … discuss with each other … decide which one…

210

The rubric words and phrases did not cause difficulty to second-year English majors

because they were well within the scope of their expected linguistic ability, i.e. High-

intermediate, Level B2 of CEFR, or equivalent (Appendix G.3). However, as pointed out in

several studies about the influence of task wording (Dudley-Evans, 1988; Moore &

Morton, 1999), candidates may interpret some words differently, which can result in

different levels of speaking performance. These authors illustrated with an example of the

word “discuss” that can lead to various interpretations in academic writing tasks if there is

not further specification. This issue was echoed with the tasks in the oral examination that

this study aimed to explore. The word “discuss” was used in the candidate-candidate

interaction task at both Universities A and B. More frequently, the word “talk” was used in

almost all of the task types listed in Table 6.4, except for Task 1 of University B. In the

following sections, I analyse further the extent to which candidates’ oral performance

aligned with the task purposes. I have organised the speaking tasks (Table 6.4) into three

categories according to the main purpose of each task type: (1) “Talk” in the individual

long turn task, (2) “Talk” and “discuss” in the candidate-candidate interaction task, and (3)

“Talk” and “answer the questions” in the examiner-candidate interaction task.

(1) “Talk” in the individual long turn task (University A’s Task 1)

This task requires each individual candidate to perform a long turn monologue within 1 to 2

minutes. As described in Table 6.1, the examiner showed each candidate a picture to talk

about after having one-minute planning time. Neither the examiners nor the candidate

partner intervened in a candidate’s talk since no interaction was required in the first task.

Excerpt 6.7 is an example of a candidate’s speaking sample in response to a picture-cued

question in the long turn task (U1.Q2).

211

Excerpt 6.7 File U1.C1-2 2:07 - 3:24 1

2

3 --->

4

5

6 --->

7 --->

8

9 --->

10

C2: my name is ((Candidate 2 introduces herself)) %(ja)% ahh

while looking at this picture, I see ahh for the whole

background it is ( ) snow (.) ahh maybe this is a snowy

mountain and the man er wears the er (light board) yeah, er

he’s playing an adventurous sport it’s like climbing the snowy

mountain in a very erm dangerous condition. erm er >I can see<

er he’s really brave and er enthusiastic while looking at the

maybe the camera scope to er and er this is an er adventurous

sport that er most of the er the youth now are like (.) quite

exite (.) quite excited on those kinds of sport.

Excerpt 6.7 illustrates a typical monologic performance of a picture-cued task in

which the candidate used spoken English in descriptive discourse to communicate what she

could see, and what she thought of it. Despite a limit in fluency with repeated hesitations,

the candidate could make the listener understand what there was in the picture. She could

mention at least seven details depicted in the picture: the snowy background, a snowy

mountain, a man wearing something, playing a dangerous sport, climbing the mountain, a

very dangerous condition, looking at a camera scope (lines 3-9). The candidate included her

reflections on what she saw, e.g. the man is playing sport in such dangerous conditions

(line 6), he is brave and enthusiastic (lines 7-8). The candidate closed her talk with an

expression of her social knowledge about young people’s interest in this kind of sport (line

9).

Many other monologic presentations did not exploit much of what was featured in

the pictures. Candidates utilised picture prompts provided as an inspiration to talk about

their deductions from or reflections on what they saw in the picture, rather than actual

content. Excerpt 6.8 illustrates another example of a long turn performance generated by

deductive reasoning rather than what the artistic picture displayed via images and colours

(U1.Q18).

212

Excerpt 6.8 File U1.C3-4 1:58 - 2:45 1 --->

2

3

4 --->

5 --->

6

7 --->

8

9

10

C4: the picture describe the (.) the development of er (.)

er a an kind of insect from the early stage of (.) a

cocoon to the lighter one like a butterfly, and (.) this

is like a (.) an evolution this provokes a thought (.)

about human being, some people they (.) they choose to

develop > into something new, something greater, something

better < but so some some choose to to be any cocoon to

have a safe (.) safe life, they don’t (.) they don’t well

they are not willing to take risk to find to discover new

things. (.) that’s all.

In Excerpt 6.8, the candidate used some key words denoting items he focused on in

the picture – ‘a kind of insect’, ‘a cocoon’, and ‘a butterfly’ – to describe what he saw (lines

1-3). He set a link from the theme ‘change’ conveyed in the photo to a social phenomenon

in human life (lines 4-5): some people are keen to make changes in their lives (lines 5-7),

while others are content with their current situations as they think trying new things is risky

(lines 7-10). The sample shows some signs of hesitancy (pauses and self-correction), but

demonstrates evidence of high-level vocabulary such as ‘cocoon’, ‘evolution’, ‘provoke’,

‘thought’ (noun), etc. Candidate 4 in Excerpt 6.8 did not, or was not able to, describe the

assigned picture in the way that Candidate 2 in Excerpt 6.7 did because it would require

more “cognitive complexity” (Louma, 2004, p. 167) to talk about the six stages of the

gradual change from a cocoon to a butterfly. Candidate 4 must have understood that it was

a mini-talk for an EFL speaking exam, and examiners would not expect it to be a detailed

presentation like in a biology lesson. There was more doubt about the appropriateness in

the selection of picture than the candidate’s ability to interpret the visual prompt to generate

oral production. This case raises a question regarding consistency and authenticity in

choosing pictures for picture-based tasks. Louma (2004, p. 167) suggested an adequate trial

of pictures prior to using them in order not to “intimidate the examinees by their visual

complexity”. Candidates might not see a picture in the way the test designer and/or the

examiner anticipated them to see.

Other pictures were not effective in prompting clues for the candidate to generate

ideas from, but required the candidate to have definite background knowledge to be able to

talk about them. Excerpt 6.9 provides an example of a monologic performance in which

213

background knowledge was definitely essential for the candidate to complete this task. The

given picture (U1.Q13) selected to for the candidate to discuss turned out to elicit what the

candidate knew about the person in it.

Excerpt 6.9 File U1.C11-12 0:59 - 2:16 1

2 --->

3 --->

4

5

6 --->

7

8

9 --->

10

11

12 --->

13

14

C11: uhmm the picture show a man which is bill gates, one

of the most successful, if not say that the most

successful man in the world, and he's a (.) the chairman

of the microsoft windows companies and (.) from (.) ah as

my concern, he is um, before that he is, um (.) um s-

student (.) student at harvard university, but they (.) he

dropped it, um, the school and started his business in

such early year, young age. um, till now he's um, still

stay (.) has stayed the most successful man uh, in the

world and the (.) one of the most earlier bill (.)

billionaire in the world and beside his um businessman (.)

um business field um in daily life, I think that his uh

very like, sympathetic man. he is um has a charity

organisation with his wife.

In Excerpt 6.9. the candidate demonstrated good knowledge about the character

named ‘Bill Gates’ in the given picture. She could talk about the man’s reputation (lines 2-

3), work (lines 3-4), schooling (lines 6-7), achievement (lines 9-11), family (line 14), and

give some personal thought about his characteristics (lines 12-14). All these pieces of

information were not generated from what the candidate saw in the picture, except for the

name ‘Microsoft Windows’ appearing on the background behind the character. Again, the

actual picture was too dull to function as a stimulus for the descriptions associated with it.

Unintentionally, this question was testing candidates’ knowledge rather than EFL speaking

skills. This candidate’s partner gave a similar knowledge-based performance (Appendix

F.3a) when looking at the picture of a character who she recognised was ‘Steve Jobs’

holding an Apple iPhone in his hand. It would have been unfair for candidates who did not

have knowledge of these characters but were required to talk about them by the examiner’s

random selecting the pictures available from the test book. Presenting knowledge about

‘technology legacies’ (the title of an integrated Listening-Speaking lesson from the course

book) was not the construct this task claimed to measure.

214

My analysis of speech samples showed that many times candidates’ performance

failed to match the task purpose. The language input was too ambiguous to be interpreted in

only one way. I learnt that the responsibility of ensuring the clarity of task purpose lies in

the decisive role of the task writers in choosing test material (e.g. picture, content) and

giving instructions (e.g. rubric words) appropriately. These examples support Shaw and

Weir’s (2007) argument that candidates’ clear and precise understanding of the

performance task purpose will facilitate the rating process and enhance reliability in

scoring. The test book for oral examiners should include additional information and

clarification of the intended purpose(s) underlying each test task as rubric words and

phrases may be interpreted differently by the rater and candidates.

(2) “Talk” and “discuss” in the candidate-candidate interaction task (Universities A

and B’s Task 2)

Two of the three institutions involved in the study included a paired format as part of their

oral assessment. Table 6.1 indicates that paired candidates did not interact with each other

throughout their speaking sessions, and the candidate-candidate response format was

applied only in the second task. Both the institutions provided similar instructions for this

discussion task: students had to look at the task, read its content, prepare, and talk. At

University A, each pair of candidates had to decide to choose one of the two questions that

the examiner showed them. Each pair had between 30 seconds to 1 minute to prepare for a

3-to-4-minute discussion. At University B, each pair of candidates looked at a mind map

with six textual prompts associated with the same theme to answer a discussion question.

The interlocutor gave each pair 30 seconds to prepare for the question, and 3 minutes to

discuss the topic question with the suggested word cues. The examiner or interlocutor did

not participate in the candidate-candidate discussion, but let the candidates manage and

decide their own interaction.

A turn-by-turn analysis of the speech sample transcript revealed that there was

evidence of more natural conversation when the test takers were prompted by a discussion

question than a mind map. Examining how candidates ‘talked’ and/or ‘discussed’ in these

interactions, I noticed “aspects of collaboration and negotiation in the construction of the

emerging topic” (Johnson & Taylor, 1998, p. 34). The previous speaker laid the ground for

215

or yielded to the next speaker’s turn to talk either by asking for an opinion, checking

understanding, modifying, agreeing or disagreeing, etc. The outcomes of the discussion

were unpredictable because it was not guided by any prompts or the examiner’s

intervention. For example, at the beginning of Excerpt 6.3, Candidate 28 initiated the

discussion by showing what he understood about the statement ‘Only a person who risks is

free’ (lines 1-7) and then checking the meaning with Candidate 27 (line 7). However,

playing an equal role, Candidate 27 did not give an immediate answer to the question but

asked Candidate 28 for his opinion (line 8). Candidate 28 gave a hesitant agreement ‘er yes’

but made an immediate clarification with the detail ‘only’ (line 9). Candidate 28’s

conversational repair (lines 9-10) was an invitation for Candidate 27 to express her total

agreement (line 14) and add another aspect associated with the topic being discussed: risk

taking may bring a possible escape from the tedious routine of life (line 17). At the end of

her turn, Candidate 27 did not forget to ask her partner for an opinion (lines 19-20).

Excerpt 6.10 illustrates another example of Candidates 3 and 4’s successful task-

based interaction in about 3 minutes (University A). In this scenario, the pair was

discussing whether people should make changes even when they have no problems with

their current achievements (U1.Q19). After the planning time for both, Candidate 3

initiated the discussion.

Excerpt 6.10 File U1.C3-4 3:39 - 7:00 1 --->

2

3

4

5

6

7 --->

8

9

10 --->

11

12

C3: uhm about the question here, in my opinion when we are in ( )

we are satisfied with our current achievements, I think we

shouldn't make any changes er because that if we are good with what

we are happy so er maybe changes can take you into another risk and

lose everything, and also that can bring you down when you r have

your achievement

C4: I understand your point, its I think it is similar to the

former picture you talked about.

C3: %yes%

C4: because you are in the current path and and you are afraid of

the death if you make changes, but I think I think er inferred from

my picture you have to be a butterfly, you have to make changes

216

13

14 --->

15

16

17

18

19

20 --->

21

22

23

24

25

26 --->

27

28

29

30

31

32 --->

33

34

35 --->

36

37

38

39

40 --->

41

42

even if there is no problems (.) %because%

C3: [yeah] I see your point here but

er if we have the current achievement it means we are succeeding in

something, so that we are good we don't have any death end or

something like that, that we have to change so that I think that if

we just kept with the current achievements here then we can be good

and without any changes.

C4: you see many factories they make changes every day because they

want to catch up with the modern life. they don’t want to be left

behind (.) because people make change every day, they have rival

companies, like the rival factories they have to make change to

better themselves to to become something newer, to to to the the to

the conSUMers.

C3: but also but that also means that we have to take a risk (and)

we have to face a lot of (.) different situations like it would be

better and no one will argue about that, but if it is worse than,

it will cause a big loss for the company or for that one so that

(.) in my opinion maintain your your current achievement is the

best.

C4: I understand but (.) making change doesn’t mean that you are

careless, you are reckless to get into the change, but you have to

calculate the (catalyst) before you making change

C3: but we can’t we can’t make change for better if we can't

calculate it carefully, but for some some people they don't they

could just have some achievement and they don't want to like they

don't want to risk themselves for something like that, and so that

it is not a good way or it is good if they have achievement.

C4: and I believe that depends on individual’s perspectives and (.)

lifestyle, but to me I I want to make change every day to better

myself even if I am in the ( ) current situation.

Both the candidates in Excerpt 6.10 presented their viewpoints about the question,

and elaborated on their reasoning with supporting ideas and examples. Although the

candidates did not arrive at a mutual agreement on the issue, each expressed convincing

arguments supporting their opinions. For example, Candidate 3 initiated the conversation

by expressing her opinion of not supporting making changes when one is satisfied with

one’s current achievement for the reasons of safety (lines 1-6), and Candidate 4 expressed

his understanding of her opinion when relating to the former picture that Candidate 3 had

talked about (lines 7-8).

217

The paired discussion indicated a joint performance in which there was interactive

relationship between the speakers’ turns. Candidate 3’s low-tone backchannel ‘yes’ (line 9)

was understood by Candidate 4 as a continuer signaling comprehension rather than an

interruption to talk. Therefore, Candidate 4 continued to elaborate his point that a natural

change could be made ‘even if there is no problem’ (lines 10-13). As Candidate 4 seemed

to finish his ideas with a short pause and a quiet indicator of discontinued elaboration (line

13), Candidate 3 expressed her empathy with her partner’s opinion but implied a slight

disagreement by saying she thought the current achievements are still good without

unnecessary changes (lines 15-19). Candidate 4’s turn came when he took factories’

continual changes as an example to ‘catch up with the modern life, better themselves and

become something newer to consumers’ (lines 20-25). Based on Candidate 4’s reasoning,

Candidate 3 mentioned that changes might not always be good but potentially risky for

companies (lines 26-31). Candidate 4’s response to this concern was that making changes

does not mean recklessness. A successful change needs careful calculation (lines 32-34).

Candidate 3 reasoned that people do not want to risk themselves if they already have

current achievement that they are content with. She did not support making change unless it

is made with care (lines 35-39). The discussion closed with Candidate 4’s expression of

maintaining his belief that whether to make changes ‘depends on individual perspectives’

(lines 40-42). Despite his willingness to make change every day, Candidate 4 respected his

partner’s preference for staying safe if the current situation is good.

The two interactants in Excerpt 6.10 proved their skillful use of expressions to

soften their disagreement in the discussion by saying ‘I understand but…’, ‘I see your point

here but…’ before a counter argument. Their discussion shows a fairly good balance of

each candidate’s contribution to “the co-construction of discourse” (May, 2010, p. 1). Each

candidate took five turns. The average lengths of each turn are 52 words for Candidate 3,

and 40 words for Candidate 4. In general, both candidates demonstrated an attempt to

ensure the topical coherence throughout the flow of their discussion.

University B’s interactional task adopted another form of delivery procedure in that

the candidate-candidate interaction was prompted by the clues (words, phrases, or

sentences) provided in a mind map. Rubric words in this task specified the oral skills that

218

candidates were required to perform: talk, discuss, give opinions, (give) reasons, and (give)

examples, and decide (Appendix D.2). Conversation analysis of the speech samples

revealed that there was less coherence in test takers’ discussion when they were exposed to

many textual prompts provided via a visual channel. Candidates tended to cover as many

prompts in their talk as possible but showed quite weak reflection on and connection with

what their partner had mentioned. Test takers were prone to taking turns to talk about

different aspects of the issue on the basis of the prompts suggested in the mind map.

Excerpt 6.11 illustrates the situation of turn-taking in a discussion about ‘what makes

people successful in their career’ (U2.Q68). This test session had three candidates because

there was one odd candidate in the student list for this test room.

Excerpt 6.11 File U2.C15-16-17 8:49 – 10:25 1 --->

2

3

4

5

6

7

8

9 --->

10

11

12

14

15 --->

16

17

18

19

20

C16: I think that the most important thing to make people

successful in their career is they need to have softskill

because you know today there are many company that require

er their employees need to have some some skills like

talking in the public place, and er, for example if two

people they have the same experien, but there’s one have a

better softskill I think that they have a more benefits in

getting the job

C15: and I think er in in the life er people alway face

challenge or difficulty so if (you) they have a high

determination they will get over their difficulty and

achieve their goal. So for me determination is the most

reason causing make people be successful in career.

C17: but in my opinions, communications is, communication

skill is one of er the character to make career the

successful person, even though you are a person who has a lot

knowledge about some (???) but er if you don't have er a good

communication skill to communicate with people or with your

partner and it is not I mean good for your career.

In Excerpt 6.11, Candidate 16 expressed her opinion beginning with ‘I think’ to talk

about soft skills as the most important thing to make people successful in their career (lines

1-2). Candidate 15 added her opinion beginning with ‘and I think’ again to support

‘determination’ as the most important reason (lines 9-14). Candidate 17 talked about her

219

choice beginning with ‘but in my opinion’ without referring to any reason why she did not

agree with the other two; she was in favour of ‘communication skills’ (lines 15-20), another

prompt provided in the mind map.

A similar situation happened in Excerpt 6.4 when the candidates incorporated

available clues in their speaking. Candidate 10, for example, mentioned some

disadvantages of canned food and fast food by using the prompts ‘high in sugar’, ‘less

nutrition’, and ‘harmful’ (lines 11-21). On the other hand, she was able to elaborate on fast

food’s advantages in that it was ‘time-saving’, and ‘long-time preserved’ (lines 24-32).

Candidate 9 used the prompt ‘portable’ to talk about a disadvantage of the issue raised in

the mind map. There was no signal of interaction in the transition from Candidate 10 back

to Candidate 9 (lines 32-33). Candidates could provide information (lines 11-15), give

opinions (lines 22, 39), and modify ideas (lines 24, 27), but they seemed to be confused

about the purpose of the central question, i.e. ‘decide’ which item in the set of prompts

provided is the most important or the most serious. In the paired speaking test, the

interlocutor had to intervene with the repeated question ‘which disadvantage is the most

serious?’ (line 38) to elicit what judgement the candidates were required to arrive at by

reasoning in this task.

The main question in the mind-map discussion, which was virtually the same

(‘should decide which one is the most…?’) across topics, was not clear for test takers in

their decision-making task: Which candidate should make the decision? Would it have to

be a personal or mutual decision? What if the candidates had different opinions and could

not come to a common decision? Were candidates required to cover all the prompts

provided in their discussion before arriving at a final decision?... In Excerpt 6.6, the pair of

test takers were able to cover all the prompts provided while in Excerpt F.3d (Appendix

F.3), another pair of test takers used only half (three) of the given prompts in their

discussion about ‘ways for parents to help their children develop sportsmanship’ (U2.Q62).

Candidates 26 and 27 took turns to talk about the prompts ‘sending children to a special

school’ (lines 4-5), ‘helping children decide when they want to play sport’ (lines 12-13),

and finally ‘learning about the injuries and safety in sports’ (lines 23-24). However, the

220

candidates did not decide which one was the most important when the interlocutor signaled

that the time was over and they had to finish their discussion.

(3) “Talk” and “answer the questions” in the examiner-candidate interview task

(University B’s Task 1 and University C’s task)

This type of interactional task required individual candidates to listen to the interlocutor (or

the examiner) carefully and give spontaneous responses to each question. An examiner-

candidate interaction task was the only single task applied at University C. At University B,

this task came before the candidate-candidate discussion section with the purpose of

“allow[ing] candidates at all levels to complete at least some of the initial tasks/items”

(Galaczi & ffrench, 2011, p. 133). Most of the questions were developed on the themes

covered in the course books. These questions can be categorised into five types (Lowe,

1988) according to the structural complexity and the type of response they aimed to elicit.

As can be seen in Table 6.5, the most popular form was information questions, i.e.

questions beginning with Wh-words such as What, Who, Where, How, etc. Some questions

designed to elicit hypothetical or supported opinion responses were found at University B,

whereas University C did not have these types of questions.

Table 6.5 Elicitation question types used for interview tasks

Question types University B University C Examples

(1) Yes/No question 23 6 Is fast food good for people’s health?

(2) Choice question 4 1 Which situation do you think is more

serious: the shrinking habitat of polar

bears or the damage done by oil spills?

(3) Information

exchange

46 47 How popular is the concept of taking a

gap year in Vietnam?

(4) Hypothetical

question

4 0 If you could go anywhere in the world

for a year, where would you go?

(5) Supported

opinion question

5 0 Why do you (not) avoid the food that

contains additives?

In an interview task, the examiner/interlocutor randomly asked candidates questions

predetermined from the test book. The questions were either independent (e.g. ‘What does

221

losing teach children about life?’), or combined in a sequence (e.g. ‘Did you play sports or

games as a child? Did you enjoy them? Why or why not?). Not only did predetermined

series of questions help the examiner structure the interview (O’Sullivan, 2012) to elicit

intended speech samples, but they also facilitated candidates’ elaboration of their response

if they did not have many ideas to answer a certain question. Recorded data revealed that

University B’s interlocutors followed both the questions and the interlocutor outline

designed for each topic in the test book exactly, whereas University C’s examiners were

much more flexible in asking questions. Depending on the candidate’s response, the

examiner could ask follow-up questions, for clarification or elaboration, that were not

included in the test book. Flexibility in asking questions, Excerpt 6.2 for example, might

have encouraged candidates’ to talk, but simultaneously created inconsistency in task

delivery across students in the same test room when one candidate had more opportunities

than others to demonstrate his/her oral production ability. Extreme discipline, on the other

hand, restricted the candidate from receiving the interviewer’s assistance if the question

happened to be challenging. Excerpt 6.12 gives an example of an interview situation in

which a candidate needed some extra clues to understand the question, but the interlocutor

did not give further explanation.

Excerpt 6.12 File U2.C15-16-17 3:38 – 4:25 1 --->

2

3 --->

4

5 --->

6

7 --->

8

9

IN: ((addresses Candidate 17)) how do you know that one

invention is important or not?

C17: (.) er one invention? er (3) I’m sorry but er I mean

what=what kind of invention?

IN: just an invention. how do you know that an invention is

important not?

C17: er the invention is important or not is depends on the

creator. I mean the inventor er and the ability that the

product or er the that invention can affect the human.

After hearing the interlocutor ask the question (lines 1-2), Candidate 17 expressed

her uncertainty in understanding what kind of invention it referred to and wanted the

interlocutor to clarify the question (lines 3-4). It was reasonable to ask this question

because inventions can be categorised into three kinds (Wikipedia, 2017): scientific-

technological, sociopolitical, and humanistic (or cultural). Different kinds of inventions

222

play different roles in human life. It was challenging to provide a concrete answer to such a

general question. However, the interlocutor did not give an explanation but simply repeated

the question implying that no further explanation was allowed by including ‘just’ (lines 5-

6). The candidate showed a little hesitance before giving the answer that it depended upon

whom the inventor was and what the invention was (lines 7-10). Limited to just two

sentences, the candidate’s answer was not elaborated further, possibly because she was

uncertain of her understanding about the key word of the question (‘invention’).

Simple but necessary assistance from the interviewer in task delivery may give

candidates an opportunity to complete their oral performance, and therefore, a more

sufficient speech sample can be elicited for assessment. The extract in Excerpt 6.13 is an

example illustrating the role of the interviewer in facilitating the candidate’s answer to a

question whose meaning she was not very certain about.

Excerpt 6.13 File U3.C12 0:28 – 1:21 1 --->

2

3 --->

4 --->

5 --->

6

7

8

9

10

11

12

13

14

IN: so what factors, in your opinion, do you consider when

you're purchasing a product and why?

C12: purchasing a ^product?

IN: purchasing or buying product.

C12: um, when I buy a product I care about price, about the

packaging, about the quality of this. um I believe that uh (.)

when you really wanna buy something you have to, you have to

care first about the price. because it’s very expensive I would

not buy it anymore. so if uh price okay, I will check about the

quality. if the quality is is simi=is similarity with the uh

the price (.) I will say, okay with this. and about the

packaging is not, uh, really important, but, uh, when I see

something cute I really wanna, sh (.) I really wanna buy it.

I'm a girl ((chuckles)) yeah. I really like it.

Candidate 12 seemed not to understand the meaning of ‘purchase’. She repeated the

phrase she heard in the examiner’s question (lines 1-2) to ask for clarification (line 3). The

examiner gave the synonym ‘buying’ as an alternative for ‘purchasing’ in this context (line

4). Certain that she understood correctly the candidate was able to produce a well-

elaborated answer to the question (lines 5-14). Not only did the candidate give a straight

223

response mentioning three things she cared about when shopping (price, packaging, and

quality), she also explained the reasons for her personal habit.

The purpose of this task was for candidates to respond to the questions as soon as

they heard them. Rubric words used to elicit speech samples were quite similar between the

two institutions: ‘I’d/We’d like to ask you…’ (see Table 6.4). Without an interlocutor

frame, the examiners at University C used various ways to deliver the task, but they all

referred to the same purpose of eliciting candidates’ immediate oral responses to questions

about viewpoints, interests, knowledge about the modern world, etc. Nevertheless,

candidates’ answers demonstrated significant differences in processing the questions they

received, and the degree to which test takers elaborated their ideas to meet the requirements

of each aural question. Excerpt 6.14 illustrates an example of candidates answering the

same question about their favourite food without elaboration.

Excerpt 6.14 File U2.C5-6 0:05 – 0:44 1

2 --->

3 --->

4

5 --->

6

7

8 --->

9

10

11 --->

12 --->

13

IN: uhm first of all we'd like to ask you some questions.

((addresses Candidate 5)) what are your favourite foods?

C5: er my favourite food is some (.) fresh fruit er

IN: [uhm hm]

C5: I like food without too much oil.

IN: [uhm]

and ((addresses Candidate 6))?

C6: my name er my favourite food is er grape (.) and veget

vegetable,

IN: [uhm]

C6: and fruit er (2)

IN: [uhm hm] ok. and er ((addresses C6 again)) do

you avoid the food that contains additives?

In Excerpt 6.14, Candidate 5 could understand the interlocutor’s question (line 2)

and made prompt responses (lines 3, 5). The interlocutor showed her engagement in the

candidate’s answer (line 6), and directed the same question to Candidate 6 (line 7) without

having to repeat that question. Candidate 6 listed some kinds of food she liked (lines 8, 11)

and was thinking of how to continue when the interlocutor interrupted with ‘ok’ and asked

224

her another question (lines 12-13). Both Candidates 5 and 6 were able to give appropriate

answers with enough information as asked, but nothing more.

At both institutions (University B), other candidates understood that the interview

task required further explanation and/or elaboration besides a suitable response to each

question. Conversation analysis of oral performance recordings revealed that most test

takers knew how to expand their answers with additional explanation and supporting ideas

although the interlocutor’s instruction for the task may not have explicitly mentioned that

requirement. The extract in Excerpt 6.15 demonstrates how Candidate 4 responded to a

question that the interlocutor had asked her partner already in the previous minute, ‘what

does being successful mean to you?’.

Excerpt 6.15 File U2.C3-4 1:04 – 1:38 1 --->

2 --->

3

4 --->

5 --->

6

7

8

9 --->

10

11

IN: what about you ((addresses Candidate 4))?

C4: uhmm to me ahh when I ahh I set a goal for myself

IN: [mm]

C4: and I try my best to reach a goal, I mean I feel

successful. uhm you know uhm some people nowsadays they making

serious ((some noise)) (with belief that) to be a director or

to have a high social position is a (.) is successful.

IN: [mm]

C4: but I think that I %(should)% better to just simplify

everything > so that my life would be easier <

IN: ok thank you.

Candidate 4 showed effective engagement in the interview when she could

understand what the interlocutor meant by asking ‘what about you?’ (line 1). Her response

was sufficient when she said reaching her goal brought her a sense of achieving success. In

addition, she could demonstrate her social knowledge about what success means for most

people at the present time (lines 5-7). She expressed her opinion that setting simple goals to

achieve would make life easier in every situation (lines 9-10).

At both the institutions B and C the face-to-face interview format did not go

through a sequence of questions divided into four stages as suggested in the literature of

structured oral interview testing: Warm-up, Level Check, Probes, and Wind-Down (Clark

225

& Clifford, 1988; Valette, 1992). Whether the interview task involved verbal interaction

between two or three participants, it took “the pattern of question-answer-question-answer”

under the tester’s dominant control, and lacked features of “natural conversation” (Johnson

& Taylor, 1998, p. 37).

6.2.3 Time constraints

One of the variations in task setting across the tertiary institutions was time constraints on

candidates’ oral production. This task specification included both the time for candidates to

plan prior to each task (pre-task planning time), and the time available for candidates’

response to speaking task requirements (response time). This section addresses issues

related to time constraints in the oral assessment conditions that might affect candidates’

speaking performance, and the uniformity in task delivery across candidates in a testing

event.

Table 6.3 shows that response time varied according to each task type, and not all

the tasks allowed planning time. University A’s paired speaking test sessions fluctuated

between 7 and 9 minutes, and University B from 7 to 8 minutes in total. Each test session

for individual candidates at University C took place between 3.5 to 5 minutes. The picture-

based task and discussion task allowed test takers some planning time while the interview

task did not. Whether tasks should involve time for preparation, and how planned and

unplanned tasks should be combined in an oral test have been a controversial topic

regarding L2 speaking task design. In order to bring authenticity to oral performance,

assessing speaking skills should include tasks set under both planning and non-planning

conditions (Skehan, 1998). Although, it is noted that “the majority of ‘real-world’ tasks

involve spontaneous speaking with no planning opportunities” (Galaczi & ffrench, 2011, p.

135).

There were differences in timing across the institutions because each adopted their

self-designed task(s) with different specifications on time constraints. Timing was more

likely to fit the intended time frame when the interlocutor followed the prescribed step-by-

step interlocutor outline (Appendix D.2), and more flexible if candidates had freedom of

when to finish their task performance (University A), or if the examiner wished to elicit

more speech samples from test takers with interview questions for further elaboration

226

(University C). Discrepancy in timing was due to many raters’ not using a timer during the

test. Table 6.6 presents variations in task timing across the institutions.

Table 6.6 Time constraints in minutes for speaking tasks across institutions

Institution Uni. A Uni. B Uni. C

Task 1 Task 2 Task 1 Task 2 Task

No. candidates each turn Single Paired Single Paired Single

Pre-task planning

time

Min 0.5 0.6 0 0.5 0

Max 1.1 1.0 0 0.6 0

Variance 0.6 0.4 0 0.1 0

Mean 0.8 0.8 0 0.55 0

Response time Min 0.8 3.6 *1.2 3.0 *3.5

Max 1.5 4.6 *1.8 3.6 *5.2

Variance 0.7 1.0 *0.6 0.6 *1.7

Mean 1.15 4.1 *1.5 3.3 *4.35

Note. Speaking sessions observed at

University A: N = 36 ; University B: N = 32 ; University C: N = 33

* including interlocutor’s/examiner’s question time

As can be seen from Table 6.6, University C’s interview task lasted the longest in

comparison with the other task types. Its highest variance (1.7 minutes) implied that the

time constraints varied most remarkably across candidates, ranging from 3.5 to 5.2 minutes

per candidate. This variance was much higher than that of the same type of task adopted at

University B (0.6 minute) since University B’s interlocutors had predetermined questions

and did not use unplanned probing questions. Like University B’s interview task,

University C’s time constraint was not merely candidate talking time, but included the

interlocutor/examiner’s question time. The shortest oral performance (0.8 minute) was the

monologic task when candidates did not make full use of the allowable time (between 1 and

2 minutes) and completed their individual performance in less than 1 minute.

Non-timing in oral assessment, on the one hand, allowed examiners a flexibility to

manage the completion of the testing event for the whole test-room by possible extension or

reduction of candidates’ response time (and planning time) as desired. On the other hand,

227

inconsistency in timing for each speaking session may have been a threat to test fairness

and reliability when some candidates had more time and opportunities to demonstrate their

language ability than others. Acceptable variations in timing should be discussed and

agreed to by all the raters involved in the rating process prior to the test event.

Previous studies on planning time for speaking tasks have indicated that there was a

tentative relationship between pre-task planning and fluency. However, planning before

oral performance was found to enhance grammatical accuracy and discourse management

only for candidates with high proficiency levels when the cognitive demand of the speaking

task was high (Wigglesworth, 1997). Evidence from a mixed methods study revealed that

candidates performing monologic tasks did not have remarkable benefits from planning

time in terms of the quality of oral performance and test scores (Iwashita, McNamara, &

Elder, 2001). This was the case of University A’s picture-based task when each candidate

was allowed up to 1 minute to observe a picture for generating descriptive ideas and

interpreting the message conveyed in the picture.

Candidates’ oral production seemed to depend more upon the suggestiveness of the

picture than upon the length of planning time. Analysis of recorded data which

demonstrated how or how long candidates could perform their monologue was affected by

the complexity of the picture assigned to them (i.e. the channel of task input), and their L2

competence, rather than how much time was allotted for their planning. This result was in

line with what the EFL experts pointed out: there existed inequality in selecting pictures

that affected candidates’ language production skills (see Chapter Five). Some pictures were

more difficult to talk about than others. For example, there were not many details from

which candidates could generate ideas in the picture depicting the scene of a city at night

(U1.Q9). Although they were given the same planning time as those with other pictures,

candidates could find very few ideas to describe that picture. Examples of candidates’

talking about the same picture suggest that less able candidates did not benefit much from

their pre-task planning for a long turn compared with stronger candidates.

228

Excerpt 6.16 File U1.C5-6 2:01 – 3:05 1 --->

2

3 --->

4

5

6

7

8

C5: uhm there are many skyscrapers, and there are many er

vehicle on the road. it seems that there may be heavy traffic,

and many lights too. umm this picture remind me of the modern

er modern life with er er with modern building and devices.

It’s very different from in the past where there are many

cottage and not many light as today. we can see that er er city

develop er ra- rapidly. and nowaday leave behind er (4) leave

behind cottage re forest or %> something like that <%

Within an approximate 1minute long turn, what Candidate 5 (Excerpt 6.16) could

see emerge from the picture was skyscrapers, many vehicles in motion (heavy traffic), and

streetlights (lines 1-3). The rest of her monologue was the theme this candidate could

interpret from the picture. The average scores awarded for Candidate 5’s live performance

on this task were 8.0 and 7.0 by Rater 1 and Rater 2 respectively. Excerpt 6.17

demonstrates another candidate’s difficulty in completing this task with the same picture

prompt.

Excerpt 6.17 File U1.C33-34 0:43 – 2:19 1 --->

2 --->

3

4

5

6 --->

7

C33: at first glance I can see a lot of tall building, with a

lot of storeys. the (3) there are lots of lights % in the

picture % uhm maybe it’s a (.14)

IN: is that all?

C33: (12) the (.) there are a lot of (.) lights on the road.

maybe it’s a (.) it’s the movement of something which is very

fast.

Given almost the same amount of planning time, Candidate 33 was not able to

exploit much of the response time available for her. This candidate’s ideas were repeated

(line 5), and at the end of her monologic performance showed many signs of hesitance and

very long pauses up to 14 seconds (lines 3, 5, 6) before she could mention details similar to

what Candidate 5, a stronger speaker, could do in Excerpt 6.16, i.e. tall buildings (a lot of

storeys), lights, and something moving very fast. For this live performance, Candidate 33

was awarded the scores 4.5 and 5.0 by Rater 1 and Rater 2 respectively. It seems that

Candidate 33’s lower scores compared with Candidate 5’s were not caused by the pre-task

229

planning time, but mostly by the prompt difficulty, the candidate’s understanding about the

task purpose (Table 6.2), or the candidate’s English speaking ability itself.

The inclusion of pre-task planning time for the paired tasks should be examined to

make sure that the time allowance would be beneficial for test takers to plan together for a

co-constructed performance. Each candidate had to spend two periods of silent planning

time for Task 1 in a paired speaking session (University A). One candidate keept quiet and

did nothing while the other candidate was planning for his/her own long turn in silence. The

same procedure was repeated when the speaker’s role was switched between the two

candidates. The other candidate’s presence in this paired format seemed to be redundant

because there was no candidate-candidate interaction during this task. If each candidate

used up to 1 minute for pre-task planning, and 2 minutes for oral task response (as they

were allowed), then the silent time would be up to 3 minutes, which might increase

pressure and test anxiety for candidates. A recent study suggested that the provision of

planning time had a positive influence on both the quality and quantity of oral language

production. One-minute planning time was sufficient to lead to significant differences in L2

learners’ speaking test performance (Li, Chen, & Sun, 2015).

Both the institutions that adopted a discussion task in oral assessment allowed

candidates to have planning time before responding to the task. The planning time was

essential for an interactional task because candidates needed preliminary negotiations with

partners to generate ideas prior to their speaking. At University A, each pair of candidates

was given 1 minute to read two questions, and make a choice of one question they liked.

The two candidates may have spent more time deciding which question to discuss together

if their choices had been different. After that, the candidates could share the outline of what

they were going to talk about or allocate their roles in the discussion. The candidates were

not obliged to use spoken English in their planning time. Planning time at University B,

was shorter than that at University A. Each pair was given 30 minutes to look at the mind

map. The candidates needed time to read and negotiate with their partners the meaning of

the six prompts in the mind map. They might draft what to say about each prompt provided

and/or plan their decision on the discussion question. Like University A, candidates at

University B could speak Vietnamese in a low tone in their pre-task planning time if they

230

felt more comfortable doing so. However, the candidates did not have a choice of task as at

University A. There was only one compulsory mind map for each pair to use as prompts.

It was impossible to learn about how candidates’ interactional skills were rated

based on the test scores, either because interactive performance was not included as an

assessment criterion (University A), or because the discussion task was scored in

combination with the interview task (University B). Therefore, the exploration into how

pre-task planning influenced paired oral performance was limited in this study in terms of

test scores. Conversation analysis, however, raised concerns that candidates might not have

exploited the planning time to prepare for collaborative performance but instead

demonstrated “solo versus solo” interaction (Nitta & Nakatsuhara, 2014, p. 149) as

illustrated in Excerpt 6.11.

The response time for each task could be flexible depending on the examiner’s

timing (Table 6.5). Candidates’ response time was possibly allowed to extend longer or

stopped earlier if the examiner thought sufficient speech sample had been elicited. Being an

institutional assessment, the examiner could shorten or lengthen the time constraint

according to the time availability for the test event. In a large test room, the total time

constraint for each speaking session tended to be shortened to complete the rating process

on time. The range of timing for interview tasks varied more than for discussion tasks

because the examiner/interlocutor was more active in timing by reducing or increasing the

number of questions to ask. Nevertheless, asking unplanned questions was synonymous to

introducing inconsistencies into task delivery, and so might have affected candidates’

performance and raters’ marking. To ensure candidates have sufficient time for oral

production in a speaking test, Galaczi and ffrench (2011, pp. 136-137) recommended that

“the time allocation for each task is always checked and trialled during the test

development stage before the task type is finalised for inclusion in a live exam”.

6.2.4 Channels of communication

All the speaking tasks adopted at the three institutions took the direct face-to-face format,

which allowed reciprocal interaction between those participants involved in the test session.

Unlike computer-based testing (semi-direct format) where test takers receive language

input from a recorded device, the direct channel enables co-constructed performance to be

231

assessed. However, the delivery of tasks was different and could be categorised into three

types of communication channels: aural (linguistic input from examiner to be heard by the

candidate), written (textual input to be read by the candidate), and visual (drawings or

photos to be seen by the candidate). Table 6.7 presents the task input through the various

channels, some of which overlapped across the institutions.

Table 6.7 Channels of communication in EFL oral assessment across institutions

Institution University A University B University C

Task 1 Long turn: monologue � Aural: examiner delivered prompt � Visual pictorial: cartoons or photographs with/without artistic effects

Interview: interlocutor-candidate interaction � Aural: interlocutor delivered questions

Interview: examiner-candidate interaction � Aural: interviewer delivered questions (planned) and, where necessary, asked some more probing questions (unplanned)

Task 2

Discussion: candidate-candidate interaction � Aural: examiner delivered instruction � Visual: two written discussion questions for candidates to negotiate to choose one of them

Discussion: candidate-candidate interaction � Aural: interlocutor delivered instruction � Visual: a mind map comprising six prompts associated with one theme, plus a discussion question written above the mind map

Examiners played the role of task deliverers but did not always participate in co-

construction of task performance. University C’s examiners participated in evolving

candidates’ oral production from the beginning to the end of the test session. The examiner

played a decisive role in which questions would be asked, to what extent candidates’

response would be elaborated, how fast the examiner’s speech rate was, how the

examiner’s attitude and behaviour were, how the questioning strategy facilitated the

interview, etc. University B’s interlocutor’s contribution to candidates’ oral performance

was found in the first task only when the interlocutor delivered aural questions to

candidates. In this task, the interlocutor’s roles were similar to University C’s examiners’

except for providing flexibility in asking elaboration questions to assist candidates who had

difficulty understanding the questions.

232

University C’s examiner roles as interactional contributors were blurred from

beginning to end because candidates performed a monologic long turn in Task 1, and were

involved in a peer discussion in Task 2. University C’s examiners did not show any

remarkable influence in the delivery of the task input because candidates received the task

instruction via a visual channel (picture-based prompts), and written channels (text-based

questions). In this case task designers played an important role in the quality and quantity

of the linguistic (or non-linguistic) input that had direct influence on candidates’

comprehensibility prior to the oral performance.

Analysis of the recorded data of speech samples from actual performances of visual

prompts (pictured-based) and written prompts (text-based) indicated that these different

input channels resulted in notable differences in candidates’ output in terms of fluency,

lexical and syntactic complexity, and topical diversity. First, picture prompts caused more

hesitation and reduced fluency in the flow of candidates’ performance. Disfluency in oral

production could be characterised by hesitations consisting of filled (with noises like ‘erm’)

or unfilled (silence) pauses, repeating words or syllables, changing words, correcting the

use of cohesive devices, and changing the grammatical structure of predictable utterances

(Fulcher, 1996). Excerpt 6.18 illustrates an extract from a picture-cued performance

showing a frequent lack of fluency.

Excerpt 6.18 File U1.C17-18 0:55 – 1:56 1

2 --->

3 --->

4 --->

5 --->

6

7 --->

8

9

10

11

12 --->

13

14 --->

C17: … in this picture this man is attempting er rock climbing.

It’s a very dangerous sport in that er it er it contains many

risks and er I can I can infer that this man is a risk taker,

and he er gains excitement from er take from er taking risks er

this sport is er this sport can er been er posh can potentially

be deadly as you could die from er the falling rock, from the

cold and from the lack of er nourishment in a ve er in a long

time er however he seems to enjoy it a lot, and er he seems

that he is professional er as he can climb without help from

other people, and er the view from the from his point of view

is beautiful as well, and er I s… I suppose that er this man is

enjoying this sport, and he love its he loves it because of its

excitement and its er view er in spite of all the danger er

danger exposes.

233

Although Candidate 17’s flow of speech went on without silent pauses, there are a

large amount of hesitations such as pauses filled with noises (like ‘er’, ‘erm’) in almost

every line. These noises occurred when the candidate was thinking of appropriate terms to

fit with what was depicted in the picture, e.g. ‘gain excitement’ (line 4), ‘potentially’ (line

5), ‘nourishment’ (line 7), ‘exposes’ (line 14). Words, phrases, and syllables were repeated

(lines 2,3,5). The candidate changed words ‘take’ into ‘taking (line 4), or ‘love its’ into

‘loves it’ (line 12). Uncertainty of the grammar structure after ‘in spite of’ made the test

taker repeat the word ‘danger’ before adding the verb ‘exposes’ (line 14).

The same candidate, however, did not show much hesitation in performing the

discussion task prompted by a written question. As Excerpt 6.19 points out, his speech was

more fluent with almost no difficulty both in questioning (lines 1-6) and responding to his

partner’s argument (lines 14-18).

Excerpt 6.19 File U1.C17-18 4:39 – 5:05; 5:45 – 6:15 1 --->

2

3

4

5

6

14 --->

15

16

17

18

C17: . . . and how much do you think it’s enough. let’s take a

look at a certain situations, supposing that you are in er in a

dangerous situation, and your company needs you to make a quick

decision but it could (actually) ruin your company (.) so

should you take a risk or should you consider?

. . . ((8 lines of transcript))

C17: on that I agree. however %I’d like to add that% beside

taking risk unnecessarily, we perhaps have to er take a minute

and think about it. ah for example, we should think about the

possible consequences for our actions, and er perhaps we should

make a list ( ) before we make a decision.

Second, written prompts facilitated more lexical and syntactical complexity than

picture-based prompts. Reading a discussion question could bring about more vocabulary

and opportunities to use cohesive devices to put the vocabulary together. Visual thinking

seemed to limit test takers’ breadth of thinking when they had to generate appropriate

words from their lexical range or think of a message to fit with what they could see in a

picture. Excerpt 6.8 gave an example of a candidate’s vocabulary being restricted to the

change of an insect and its link with human lifestyle and behaviours. In this candidate’s

long turn, simple structures of utterances were found rather than complex ones. For

234

example, ‘this is like an evolution’ (line 4), ‘this provokes a thought about human being’

(lines 4-5), ‘some people choose to develop into something new, something greater’ (lines

5-6), ‘some choose to have a safe life’ (lines 7-8), ‘they are not willing to take risk to

discover new things’ (lines 9-10).

In contrast, conversation analysis of speech samples indicated that test takers used a

broader range of vocabulary and more complicated structures in the discussion task when

prompted. A two-turn exchange from Excerpt 6.10 (lines 16, 29) illustrates an interactive

discussion performance by the same candidate from Excerpt 6.8 and his partner. There

were fewer indicators of hesitation (silence pauses), and the interactants used much more

complex structures connected by ‘if’ (lines 16, 20), ‘so that’ (lines 18, 20), ‘because’ (lines

24, 25), and ‘then’ (line 21). Comparing Candidate 4’s performance in Excerpts 6.8 and

6.10 again I can see that the speakers were more fluent in oral production with a written

prompt than with a visual prompt. However, raters’ evaluation of fluency in candidates’

oral performance was not reflected in the scoring sheets as the raters used a holistic scale

for task-by-task rating.

Third, pictorial prompts stimulated more chances for possible topics and ideas than

written ones. Candidates’ creativity was maximised when they performed the monologic

task without the examiner’s intervention. Excerpt 6.4 presents an example in which the

candidate did not interpret the visual prompt in the way intended. A comparison of

students’ speech samples revealed the possibility for various opinions and topics to emerge

from the same picture. For example, Candidate 6 in Excerpt F.3b (see Appendix F.3)

related the cartoon (U1.Q10) to a life story that many people want better things to come to

them, but they are too lazy to take any action. She supposed the message of the cartoon was

that no improvement will happen to those who just sit and wait. On the other hand,

Candidate 34 in Excerpt F.3c interpreted the meaning of the same picture as

industrialisation causing many people to leave the countryside for more opportunities in

cities. She expressed a concern that rural areas would no longer be inhabited in the future.

Differences in language input resulted in variations of challenges that candidates

had to handle during their oral performances, even though the input was all written.

Candidates who were required to choose a question to discuss (University A) had to

235

generate vocabulary and ideas on their own to perform the discussion task, whereas

candidates who were provided with a mind map (University B) could re-use or incorporate

the key words, or phrases from the prompts of the mind map to perform the task assigned.

In analysing Excerpt 6.6, for example, it can be seen that both the candidates incorporated

available phrases from the mind map of Topic 7: ‘less nutrition’ (line 9) ‘time-saving’ (line

27), ‘long-time preserved’ (line 30), ‘portable’ (lines 34), ‘high in sugar and fat’ (lines 20,

40).

6.3 Raters’ and candidates’ perceptions of the test tasks

I have analysed in the previous sections the relationship between the test tasks adopted at

the institutions and candidates’ actual task performances. By comparing how each task type

worked in eliciting relevant speech samples for assessment, I have partly answered the

research questions regarding the effectiveness of the particular form of oral language

testing applied at each university. In the following section, I continue with a presentation of

teachers’ and students’ perceptions of the test tasks they had experienced in the end-of-

course speaking examination. Data for this section derive from separate questionnaire

surveys for EFL teachers and students and face-to-face interviews conducted with

individual teacher raters and groups of candidates after the test event.

6.3.1 Teachers’ perceptions of the test tasks

After the completion of the institutional speaking test, I invited the EFL teacher raters

involved in teaching and/or rating English majors’ oral skills to complete a questionnaire

regarding their rating performance and perceptions about the test. Table 6.8 presents the

results of descriptive statistics on teachers’ opinions about the test tasks.

236

Table 6.8 Means and standard deviations (SD) for EFL teachers’ perceptions of test tasks

Statements N Mean SD

(1) Students had opportunities to use the language they had learnt in this course to perform the test task.

35 4.31 .63

(2) The test task engaged students by using spoken English interactively.

35 4.26 .61

(3) The test task was designed on the basis of the course objectives.

35 4.20 .68

(4) The information required to complete the speaking task was within the course syllabus.

35 4.14 .55

(5) The test task was highly structured. 35 3.97 .51

(6) The speaking task helped to elicit adequate students’ speech samples for evaluating their speaking ability.

35 3.91 .74

(7) The test task was authentic. 35 3.74 .74

(8) Students were already familiar with this kind of test task. 35 3.63 1.00

Valid N (listwise) 35

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree

The means of the eight items indicating raters’ agreement with the statements

ranged from 3.63 to 4.31. Item (1) received the highest mean (M = 4.31) suggesting that

test tasks provided candidates with opportunities to use the language they had learnt from

the course. Item (2) proved that candidates were highly engaged by using spoken English

interactively (M = 4.26) with the examiners and candidate partners. Most of the teachers

agreed with Item (3) that the test tasks were designed on the basis of the course objectives

(M = 4.2). An experienced rater commented on the oral test requirements at her institution

consisting of key aspects included in the course objectives:

Students have to be able to use spoken English to persuade people to agree with a

viewpoint. Students need to know how to disagree with partners, and how to get

involved in a debate with each other. Learners need to pay attention to appropriate

phrases or formal language to use in counter argument with partners. For the

picture-based task, the candidate is required to describe a picture from his/her

viewpoint. There might be different ways to describe the picture depending on

his/her viewpoint, and he/she needs to recognise what the theme, or the topic of that

picture is. (U1.T1)

237

When asked to evaluate the extent to which candidates achieved the course

objectives via their oral test performance, a teacher rater estimated that successful

candidates could achieve about 70% of the prescribed objectives, whereas less able

candidates did not. She explained that:

The majority of the latter group did not perform well in three areas: first, language

content; second, language knowledge; and third, language functions. Because

students did not pay attention to or got involved in classroom activities, they were

not familiar with that content. When in the test, the entire test contents were from

the course book. Students did not present well. Secondly, regarding language

knowledge, e.g. grammar, vocabulary, students were not accustomed to the content,

so they could not think of related terms. Grammar was not accurate. Students tried

to present their speaking skills but with the structural competence that did not meet

expectation… Some candidates had very good pronunciations. Some others did not

pay attention to or practise their pronunciation. In addition, once candidates did not

grasp the content, or their language knowledge was not good enough, then how the

candidates used (spoken) language or turn taking skills was even worse. (U2.T1)

These comments were in line with the survey result for Item (4) that all the

information required to complete the speaking tasks was within the course syllabus (M =

4.14). This teacher emphasised that “the group (of candidates) achieving 70% (of the

course objectives) were those who regularly attended class activities, and they could

achieve all the three (language aspects)” (U2.T1).

Teachers’ responses did not show a very high level of agreement (M = 3.74) on the

task authenticity as Item (7)’s concern. This result could be as identified by the EFL

experts’ judgement that some parts of the test contents (at all the institutions) were rated

low relevance with reference to the course book contents and the course objectives. Some

test questions concerning issues that Vietnamese students rarely met with, e.g. playing

chess (U1.Q21), effects of climate change on polar bears (U2.Q40). Display questions that

learners would hardly ever have to answer in real life, e.g. ‘What is the Enron scandal?’

(U3.Q22), ‘Give examples of phrases or expressions to show contrast’ (U3.Q42).

238

The lowest agreement from the teacher respondents (M = 3.63) for Item (8)

indicated that not all the raters perceived that candidates were familiar with speaking tasks

of these types. A number of candidates might not have had enough preparation for the

demands of a particular task possibly because they did not regularly attend class, or the

teacher in charge did not familiarise their students with the task types they would be

required to perform. Some teachers decided not to specify task requirements because they

aimed to make the test a little challenging for the students as they were English majors. The

teachers supposed that students at that level should know what an oral test for that level

would be like (U1.T1). This viewpoint receives Weir’s (2005, p. 55) support that “if the test

lends itself to test taking strategies that enhance performance without a concomitant rise in

the ability being tested then there must be some concern”.

6.3.2 Candidates’ perceptions of the test tasks

I invited candidates to complete a questionnaire regarding their attitude and opinions about

the oral test they had participated in. The test takers were given the questionnaire in

Vietnamese to make sure that they were clear about what was being asked. Table 6.9

presents the results of descriptive statistics on candidates’ perceptions of the speaking tasks

they had performed in the test rooms.

Table 6. 9 Means and standard deviations (SD) for EFL students’ perceptions of test tasks

Statements N Mean SD

(1) I needed to attend class regularly to be able to do the task well. 352 3.84 .83

(2) The test task was related to what I had learnt in class. 352 3.80 .83

(3) I had an opportunity to use the language I had learnt in the listening-speaking course to perform the test task.

352 3.72 .81

(4) I understood what I was required to do in the speaking task. 352 3.68 .75

(5) The test task was practical. 352 3.63 .83

(6) The speaking task evoked my interest. 352 3.51 .78

(7) I had enough time to demonstrate my speaking ability as the test task required.

352 3.26 .84

(8) The test task was too difficult for me. 352 2.90 .81

Valid N (listwise) 352

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree. 5 = Strongly agree

239

The means of eight items indicating candidates’ agreement with the statements

ranged from 2.90 to 3.84. Item (1) received the highest mean (M = 3.84) which echoed the

results found in the teachers’ questionnaire that candidates needed to attend class regularly

to be able to perform the speaking tasks well. Item (2) received the second highest mean (M

= 3.80) indicating that the test contents accorded with what had been taught in the speaking

skills lessons. However, my document analysis indicates that the institutions incorporated

the course contents into the test contents at different levels. University A reused the themes

covered in the course book but no questions or visual prompts were exact copies from the

course book. University B reused and adapted many aural questions and written prompts

from the course book. University C reused many questions and designed some new ones to

test content knowledge from the course books. The question list was made known to

students at one institution, but unknown at the other two.

Item (3) demonstrated a fairly strong agreement (M = 3.72) that candidates had an

opportunity to use the language they had learnt in the listening-speaking course to perform

the test tasks. Data from interviews with candidates confirmed that the topics they were

required to talk about resembled those learnt in class (U1G2.S2). Another student expressed

another perception regarding the lexical range he could employ in his speaking test session:

This examination was actually an opportunity for me to demonstrate what I had

learnt, but the vocabulary used in the oral test had very little association with what

was learnt in class. I chiefly relied on my personal experience of what I spoke. I

spoke as I could remember about it. I told what I know about it. (U2G2.S4)

Item (7) did not receive a very high level of candidates’ agreement (M = 3.26). This

result shows that not many candidates thought they had enough time to demonstrate their

speaking ability as the test tasks required. Students expressed regret that they did not have

enough time to generate as many ideas for speaking as they could.

As soon as I completed my speaking session, I regretted why I thought of it that

way and talked about it like that. The time was too short. I was not able to think

thoroughly to tell as much as I could. (U1G1.S1)

Sharing this regret with candidates, a rater told me that it was because of the large numbers

of candidates scheduled in each test room. Raters had to limit the time of candidates’

240

performance to ensure the speaking test event finished within the intended time frame

(U1.T3).

Item (9) received the lowest mean (M = 2.9), which shows that the oral test was not

very challenging to most of the English majors. Results from descriptive statistics of the

test scores demonstrated that portions of candidates who received below-average scores

(less than 5.0) for the speaking component were quite small. This ratio of University A was

3.5% (the lowest). The results were 11.7% and 9% at the other two institutions. The overall

ratio of below-average scores was less than 10% for all three institutions.

6.4 Summary

In Chapter Six, I have presented characterisics of different assessment tasks that the

institutions adopted to initiate candidates’ oral production in the context of a school-based

end-of-course examination. The integration of data from documents and test room

observation revealed that there were similarities and differences in test task design across

institutions in terms of interactional relationship between participants in each speaking

session that enabled varied speech patterns: monologic long turn, candidate-candidate, and

candidate-interlocutor. In general, test tasks provided opportunities for test takers to display

their oral ability applying familiar skills they had practised in their Speaking classes.

However, inconsistency was found in timing for speaking sections and task delivery across

different candidates, as a result of a lack of an interlocutor outline for some examiners, and

varied interpretations of task instructions that led to discrepancies in the elicited speech

patterns across candidates performing the same kind of task. Diversity in factors affecting

candidates’ task performance during the limited time for a speaking session contributed to

challenges for the rater’s judgement and scoring. Many of these factors were not relevant to

the constructs being measured in the oral test.

My next chapter is a deeper exploration into the oral rating process. This is the most

complicated but decisive stage in oral assessment when the rater carried out important

analysis to decide an appropriate score for each candidate. Subjectiveness may be

introduced into raters’ judgement and interfere with the candidates’ test results and the test

validity. I examine the test score distribution across institutions and estimate the extent to

which pairs of raters agreed with each other’s rating (inter-rater reliability) when they rated

241

the same candidates’ speaking performances. Convergent results are presented from the

integration of data from the questionnaire surveys and interviews with test raters and test

takers to gain insights into stakeholders’ perceptions of the practices of oral assessment

they were involved in.

242

Chapter Seven

RESULTS: RATER CHARACTERISTICS AND RATING ORAL SKILLS

7.1 Introduction

In Chapter Six, I presented a critical analysis of characteristics of the test tasks used in oral

assessment across the tertiary institutions involved in my study. The tasks differed in their

response formats, contents, input (channel of communication), and time constraints.

Variations in task types contributed to shaping candidates’ oral performance, and the

speech patterns elicited for assessment; and therefore, influenced raters’ perceptions,

judgements, and marking of candidates’ oral language ability.

Raters played a decisive role in direct (live) oral assessment in that they structured

the speaking sessions, delivered test tasks, managed interaction with/amongst candidates,

observed candidates’ language performance, and undertook complicated analysis of the

quality of oral performance in order to decide appropriate scores for each student. While

physical conditions for oral assessment require particular investment and administration

(e.g. noise-free room, waiting area separated from test rooms, suitably-sized desk/table and

sufficient number of chairs for all test participants), “the human resources are even more

challenging” (Fulcher & Davidson, 2007, p. 124). Speaking tests are classified as

“subjective assessment” because they involve “the judgement of the quality of a

performance” (Council of Europe, 2001, p. 188) by human raters, which is different from

judgement based on the number of correct answers (the quantity) in measuring linguistic

competence, e.g. multiple-choice test.

In this chapter, I continue to explore evidence of scoring validity as reflected by

rater characteristics and the process of rating oral performance. The data used for my

analysis derive from the questionnaire survey and face-to-face interviews with

representative raters after the test day. Test-related documents and observational field notes

provided sources of essential information for me to learn about the rating and scoring

process. I collected test scores to examine the test score distribution to confirm whether the

243

test outcome and the candidates’ performance met the test score users’ expectations (R.

Green, 2013). Correlations of test score sets provided by paired raters helped me to estimate

the extent to which pairs of raters agreed on their ratings for the same candidates of the

same test rooms (inter-rater reliability). This chapter begins with a presentation about the

characteristics of oral raters as key subjects whose performance, including interaction with

the candidates, led to the outcomes of the oral assessment. In language testing, “practical

matters are also important for theoretical reasons” because oral test scores are “in some way

affected by the interlocutor (also the teacher rater in my study) and the nature of the

interaction” (Fulcher, 2003, p. 138). I present results regarding the operational practices of

rating and scoring applied at the institutions participating in my study. The purpose of this

section is to answer Research Question 2: To what extent are the oral raters consistent in

their rating procedures? My presentation focuses on the assessment criteria described in the

test guidelines and the scoring rubrics (rating scales) used to measure candidates' speaking

ability. My results section explores further the analysis of test score distribution and

estimating inter-rater reliability in paired scoring of performance at each institution. I

integrate test raters’ and candidates’ perceptions and opinions with my statistical analysis to

obtain multi-dimensional insights into the oral rating and scoring process across the

institutions. I close this chapter with a summary of key points from the results presented.

7.2 Test rater characteristics

Thirty-five EFL teachers participated in my survey. Participants were teacher raters in the

speaking test, or teachers in charge of Speaking classes with experience in oral

examinations as administered at the three institutions. Eleven of them (31.4%) were male

and 24 (68.6%) were female teachers. The majority of the teacher participants were

Vietnamese (94.3%). The others were foreign teachers (5.7%): one was a native English

speaker, and the other was a native French speaker. All of the teachers had obtained

professional qualifications in TESOL at Masters level or higher, amongst whom were M.A.

Degrees in TESOL (91.4%), and a Doctorate or Ph.D. (8.6%). Most of the teachers were

aged between 31 and 40 years. Table 7.1 presents age profiles of the teachers in three

groups with none below 25 years.

244

Table 7.1 Age profile of EFL teacher participants in the survey

Age group

University A University B University C Total

Frequency % Frequency % Frequency % Frequency %

From 25 to 30 4 21.9 2 30.7 2 20.9 8 22.9

From 31 to 40 7 58.3 6 54.5 8 66.7 21 60

Above 40 1 8.3 3 27.3 2 16.7 6 17.1

Total 12 100 11 100 12 100 35 100

In general, most of the teachers and students shared the same first language

(Vietnamese). This can explain why English was not the only language used in Speaking

classes. Teachers may switch to Vietnamese when necessary. Figure 7.1 illustrates a

statistical result from the survey regarding the frequency of EFL teachers using spoken

English in their English classes. The majority of the teachers (68.6%) estimated that they

used spoken English most of the class time. Less than one-fifth of the teachers (17.1%)

responded that they always used spoken English in class. A very small number of the

participants reported that they seldom spoke English in class (5.7%) or used spoken English

about half of the class time (2.9%).

Figure 7.1 EFL teachers using spoken English in Speaking classes

5.7 2.9 5.7

68.6

17.1

0

10

20

30

40

50

60

70

80

Seldom About half thetime

Often Very frequently Always

Rate

of a

gree

men

t (%

)

245

Using spoken English was completely appropriate in English Listening-Speaking

classes for EFL majors. Although the official course outlines did not mention the

mandatory use of 100% English, communication in these classes was expected to be in

English as often as possible. Some of the Listening-Speaking classes were taught by

English native teachers. Depending on the class level, Vietnamese teachers may switch into

Vietnamese to clarify some teaching points.

The vast majority of the teachers had EFL teaching experience of 5 years or more

(85.7%). A small number (11.4%) had 3 years but fewer than 5 years’ experience of

teaching English. There was only one teacher (2.9%) with fewer than 3 years’ experience.

Teaching English speaking skills and acting as oral test raters were amongst the

teachers’ most frequent activities at tertiary level education. Thus, most of the teachers

(85.7%) responded that they had acted as raters in oral assessment more than five times.

Only one of them (2.9%) was new to this activity. The others (11.4%) had experienced at

least twice and up to five times being oral raters in their English teaching career.

Oral raters naturally take their subjectivity into the rating process which raises

concerns about their decision-making processes. That is why rater training and

standardisation for oral examiners are crucial to reduce rater variability and to ensure test

consistency (Taylor & Galaczi, 2011). Statistical results show that the teachers had received

varied amounts of training in oral assessment. Most of the teachers (45.7%) reported that

they had been trained for rating spoken language for one month or less. Some of them

(17.1%) had more than one but less than six months’ training. However, the same number

of teachers (17.1%) admitted that they did not have any training at all. A slightly higher

percentage (20%) responded that they had received 6 months’ training or more.

When I examined the extent to which the teacher raters were satisfied with their

regular rating practice, I found that the proportion of raters’ complete satisfaction with their

rating was quite low. Most of the teacher raters responded that they were not completely

satisfied with their rating and scoring for the candidates’ speaking performance, but were

instead “fairly satisfied” (80%), “not very satisfied” (11.4%), and even “unsatisfied”

(2.9%). Only a modest number of the teachers indicated that they were “completely

satisfied” with their rating of learners’ speaking ability. These respondents may be amongst

246

the very few teachers who had shared with me in personal interviews that their casual work

was Cambridge ESOL Speaking Examiners, and had received professional training in oral

assessment.

7.3 Rating and scoring

The speaking test under study was an end-of-course exam for English majors who had

completed an integrated Listening-Speaking skills course as a compulsory subject in their

training programme. The main purpose of the test was to evaluate students’ academic

achievement after a course of instruction. Each institution applied their own assessment

method and score calculation policies, which varied in the weighting for each component,

the number of testing times during the course, and mandatory conditions to achieve a Pass

credit for this subject. Table 7.2 summarises the weighting of the Listening and Speaking

components in assessment and marking schemes adopted at the three universities.

Table 7.2 Assessment weighting of the Listening and Speaking components

Uni. Assess-ment

Listening Speaking Subtotal weighting

Total Score calculation policies

A Mid-term

15% 15% 30% Passing grade: 5/10 points Conditions: Neither of the L/S component is below 5.0 points. Minimum class attendance (80%) is required.

End-of-course

35% 35% 70% 100%

B Mid-term

Class engagement: 5% Online exercise: 5%

10% 100%

Passing grade: 5/10 points

End-of-course

50% 40% 90%

C Mid-term

15% Mini-test 1: 15% Mini-test 2: 25%

55% 100%

Passing grade: 5/10 points

End-of-course

22.5% 22.5% 45%

Note. All three institutions adopted a 10-point grading scale for calculating component scores (of each test paper) and total scores (of both Listening and Speaking components).

Test development across the institutions highlighted remarkable differences in

assessment criteria, rating scale design, and procedures applied in marking English

speaking skills. These aspects are discussed in the following sections.

247

7.3.1 Oral assessment criteria

Prior to testing, assessment criteria were handed out to raters (Appendix D.1a-c). These

assessment criteria were predetermined at the beginning of the course, and included in the

course outline disseminated to teachers and students. The set of assessment criteria at each

institution was decided based on the course book in current use and from which a rating

scale would be developed for rating and scoring. Each institution selected their own course

book. The criteria for assessing speaking were different from each other to suit the needs

and purposes of assessment at each school. However, Table 7.3 shows some common

criteria across the institutions when assessing Speaking skills such as pronunciation,

vocabulary (lexical resource), and grammar accuracy.

Table 7.3 Assessment criteria across institutions

Assessment criteria University A University B University C

Pronunciation ü ü ü

Vocabulary/Lexical resource ü ü ü

Grammar accuracy ü ü ü

Fluency ü ü ü

Interactional skill ü ü

Coherence/Discourse management ü ü

Lecture language ü

In general, speaking skills were assessed with regards to principal features for oral

production (Figure 2.1): Pronunciation, Vocabulary, Grammar, Fluency, Coherence, and

Interactional skills. The difference in the assessment criteria among these institutions was

the unequal attention paid to the use of interactional skills and cohesive devices (discourse

management). Two of the institutions included these features in their assessment whereas

the other did not. Only raters at one institution paid attention to test takers’ use of lecture

language as it was included in the description of assessment tasks (Appendix D.1c).

The weighting of each component in the criteria assessment were not explicitly

stated, nor was it clear what criteria were applied for each test task (or both) in case of more

than one task being applied in each speaking test session. University A intended to measure

248

the six criteria (Pronunciation, Vocabulary, Grammar, Coherence, Interaction, and

Fluency), but did not specify how much each criterion was counted in the overall score.

The assessment criteria did not specify which components or aspects were more important

than others. University C established a different weighting for each component, specifically

Pronunciation 20%, Fluency 20%, Vocabulary and Grammar range 30%, understanding

and use of lecture language 30% in the total score. In contrast, University B had the same

weighting of 25% for each of the four components of assessment: Grammar and

vocabulary, Discourse management, Pronunciation, and Interactive communication

between pairs of candidates. Nevertheless, it was not specified which criteria the raters used

for each task.

Inadequacy in the description of assessment critera was very challenging for raters

to ensure consistency in their rating and scoring. For example, interactional skills

assessment could not be applied in University A’s first task as this picture description task

required each candidate to perform a monologue without any interaction with his/her

partner or examiner. Relying on the criteria of interactive communication when rating

University B’s first task would be unfair for candidates because this question-and-answer

task involved only one-way responses from individual candidates – the interlocutor (also

the rater) asked, and the candidate answered.

7.3.2 Rating scales

Unlike objective scoring when examiners base their ratings on the answer key provided to

decide whether an answer is correct or not, assessing speaking skills depends upon

subjectivity as the rater judges a candidate’s performance in reference to a rating scale to

decide scores. In this sense, the rating scale served as a practical device that every rater

needed to understand clearly so as to ensure consistency in scoring the same test and

thereby ensuring that meaningful inferences can be made from the scores awarded. This

section explores what rating scales were used at the institutions and raters’ reflections on

using these devices.

University A adopted a double-rating format in oral assessment, i.e. each candidate

was independently rated by two examiners. These pairs of examiners used the same rating

scale for their scoring. The rating scale had six bands (Appendix D.1) with a minimum of

249

zero (no points) and a maximum of 10 points. The scoring scheme accorded with the

common 10-point scale applied to the majority of other university subjects. However, only

even-number scores (2, 4, 6, etc.) were demonstrated on the scale. The odd-number scores

(1, 3, 5, etc.) were left out. The rater first decided which band best described the

candidate’s overall performance. A candidate was awarded an even-number score (e.g. 6.0)

if his/her performance (almost) completely fitted with the descriptors of that band score.

Depending on the extent to which the candidate’s performance fitted with the band

descriptors, the rater made his/her own subjective decision on odd-number scores for the

candidate (e.g. 5.0 or 5.5). Table 7.4 presents a summary of descriptors for the six band

scores from University A’s rating scale.

Table 7.4 Features to be assessed and scored in a rating scale

Score Features

0 2 4 6 8 10

Overall performance

not enough language to assess

very weak weak satisfactory good very good

Message difficult to understand

sometimes difficult to understand

mostly clear clear clear and complete

Grammar and vocabulary

limited use

appropriate use

wide range

Errors frequent basic errors

frequent errors

some errors without impeding communication

few errors few

Coherence device

some appropriate

mostly appropriate

appropriate

Pronunciation weak rarely prevent understanding

facilitate understanding

help understanding easily

A candidate’s oral performance may present (or the examiner may perceive)

different levels in various aspects of the rating scale. For example, a speaking sample may

reach a high band in grammar and vocabulary accuracy, but fall into a low band for clarity

of the message conveyed. However, there was no uniformity or agreement among raters on

250

how to deal with this issue. Further, a discussion task was included in the oral test, but the

assessments of “communication” and “interactional strategies” were not considered in

developing the rating scale. The products of the rating process were not sub-scores for the

components described in the rating scale, but two separate scores for two tasks. The final

score was the average of these two scores that did not demonstrate how the rater actually

rated each component. In spite of reflecting the most important aspects of speaking ability,

“this type of (holistic) scoring is problematic because it does not take into account the

constructs that make up speaking” (Fulcher, 2003, p. 90).

University B adopted a different rating method using a 10-point scale. Two raters

were assigned for each test room. One of the raters as an interlocutor used a holistic

(global) scale to make a general evaluation of each candidate’s speaking ability based on

the given tasks. The other rater as an assessor, sitting a little further away from the

interlocutor and candidates, used an analytic scale consisting of ratings on Grammar and

vocabulary, Discourse management, Pronunciation, and Interactive communication. The

holistic scale for the interlocutor to rate candidates’ overall performance is presented in five

band scores as shown in Table 7.5. In similarity to University A’s scoring method, the

interlocutor decided the extent to which a candidate’s oral performance fitted with the

descriptor of each band and awarded an appropriate score, e.g. 6.0, 6.5, or 7.0 (within the

third band).

Table 7.5 Example of a holistic rating scale for the interlocutor

Score

Global assessment

£ 4 4-5 6-7 8-9 10

Communication below 4-5 frequent

hesitation

hesitation, ability

to handle everyday

situations

some

hesitation,

familiar

topics

above

8-9

Utterances separate

words, no

development

very

short

longer utterance,

but not able to use

complex language

extended

discourse

above

8-9

251

The analytical scale used at University B considered a candidate’s speaking session

in four equal components for rating (2.5 points each). Although the scale could provide a

more detailed evaluation of a candidate’s strengths and weaknesses, it did not help the scale

user to decide specific scores for each component (see Appendix D.1). For example, the

maximum score for Grammar and vocabulary was 2.5 points. The scale did not specify how

much the weighting of each component was, or whether grammar was more important than

vocabulary, and vice versa. The term “good” (describing the degree of controlling simple

grammatical forms) and “appropriate” (describing the range of vocabulary) were explicit

for subjectivity when there were no specifications on how it was to be rated “good” or

“appropriate”. It was also not transparent how a “good” and/or “appropriate” candidate

would be awarded a 0.5, 1.0 or 1.5 for the category of Grammar and vocabulary in the

analytic scoring rubric. Each rater estimated and made decisions by themselves.

Raters at University C did not have a specific rating scale to guide the rating

process. Each test room had one examiner who was also the teacher in charge of speaking

lessons. Although the raters used the assessment criteria as defined in the course outline,

each made a judgement of test takers’ oral performance by mostly relying on their

estimation of attributes and personal testing experience.

Each institution’s testing guidelines did not provide clear instructions on how raters

should use the rating scales. There was no meeting prior to the oral test for raters to

understand what the assessment criteria meant, and whether the candidate’s performance

should be rated task by task or if the same criteria would be applied to all tasks. Even at

institutions where both holistic and analytical criteria were used, my findings from test

room observation, test scores and interviews with raters indicate that there was not a mutual

agreement between pairs of raters whether to compare raters’ scores after each candidate

finished their session or at the end of the test day. It was not apparent whether each pair of

raters adjusted their scores to accord with the other’s after discussion, or if both raters

would agree on the final score by averaging the two sets of scores.

7.3.3 Oral rating process

The practice of rating oral performance at all the institutions was based on direct

(live) speech samples. The rater elicited candidates’ speaking samples via test tasks, and

252

decided scores on the spot because no audio recording was made for later assessments.

They had to make many evaluations on various aspects and come to the final score for each

candidate immediately after each speaking session. In the next section, I present raters’

awareness of the problems associated with oral assessment that they may have brought into

the marking process. I then outline the difficulties for examiners at each institution and the

issue of consistency in assessing speaking skills supported by evidence collected from the

questionnaire surveys, interviews and observation hours in test rooms.

7.3.3.1 Raters’ awareness of oral assessment

Rater arrangement varied across the institutions. Raters (examiners) involved in the oral

assessment were EFL teachers at the Faculty of foreign languages who had experience in

teaching and testing speaking skills. At one institution, each teacher played the role of

examiner and scorer for the speaking class(es) he/she had been in charge of. At another

school, all the teachers of speaking classes took part in the test as raters together with other

teachers. There was a switch among the teachers so that a teacher was not the interlocutor

or assessor of the students in his/her own class. The other institution did not assign the

teachers in charge to be involved in the test, but invited other teachers in the EFL Faculty to

be examiners.

When asked about the purpose of oral testing for English major students, all the

teachers agreed that this was a multi-purpose test (Table 7.6). The main purposes identified

were to measure the degree to which students had successfully learnt the material and skills

covered in the course (80%), measure students’ spoken language proficiency (77.1%), and

determine the strengths and weaknesses of each student in their learning process (60%). More

than half of the respondents (60%) thought the test aimed to achieve information about the

students’ strengths and weaknesses in their learning process.

253

Table 7.6 Purposes of the oral test for EFL majors

Purposes of oral assessment F %

The test was designed to…

(1) measure students’ spoken language proficiency 27 77.1

(2) measure the degree to which students have successfully learnt the material and skills covered in the course

28

80

(3) decide whether they should be promoted to the next level of study 22 62.9

(4) determine the students’ strengths and weaknesses in their learning process 21 60

(5) learn whether the current curriculum fits with the students’ English levels 13 37.1

(6) identify students’ levels of English to put them into the right English class 5 14.3

(7) identify students that have aptitudes for English learning 5 14.3

(8) decide whether they are qualified to study abroad 4 11.4

(9) compare the university students’ English levels with those at other universities

3 8.6

Note. F = frequency

Oral assessment was a regular school-based practice upon the completion of each

course. Despite yearly adjustments in the test contents and methods to suit the course

objectives, all the interviewed teachers asserted the legitimate importance of testing

speaking skills in EFL pedagogy. The reasons were not merely because speaking is one of

the four essential language skills for English majors that need to receive equal attention

(U2.T1, U3.T1), but in a broader sense,

In regard of educational aspect, I find testing (speaking skills) is quite important in

that, first, we (teachers) can know where our students’ spoken English level is.

Second, we can evaluate whether our teaching process is effective or not. Further, it

reveals useful clues to what aspects in the current syllabus need improving. (U1.T2)

In line with these opinions, an experienced teacher rater shared her personal

viewpoint regarding the purpose of the test for different stakeholders:

I think it [assessing speaking] is very important. It is an opportunity for those

involved (in the test) to review the process of teaching and learning. Especially for

the learner, they can evaluate their own study, how they have learnt, what skills and

knowledge they have achieved. For the teacher, they can revise their teaching

practice, how the syllabus works, whether the course book is good. And for the

254

faculty, they can review their professional management, whether it has met the

requirements of the standardised outcome quality for particular language skills

levels. (U1.T3)

7.3.3.2 Difficulties for examiners in the rating process

Most of the raters perceived challenges in oral rating. Findings from the survey indicate

that the majority of the raters found oral assessment was somewhat difficult (57.1%) or

fairly difficult (31.4%) for them. Some of them (5.7%) even found oral rating very difficult.

This result demonstrates why the majority of raters did not obtain complete satisfaction

with their rating. Figure 7.2 shows raters’ observations about challenges they faced in oral

rating.

Figure 7.2 Challenges for raters in oral rating

As can be seen from Figure 7.2, class size, time pressure and lack of training are

three of the most problematic factors. Rating too many candidates (30-35 students in each

test room) might result in insufficient samples of candidates’ speaking being elicited or

inaccurate judgements being made within such a limited amount of time (from 2 to 2½

hours). Lack of training contributed to making oral rating an uneasy task for many raters.

Almost half (45.7%) responded that they had received just one month or less of formal

40

71.4

28.6

54.748.6

14.32.9

0

10

20

30

40

50

60

70

80

Students'English level

Class size Noisy testingenvironment

Inadequatetime

Lack oftraining

Insufficientaids andfacilities

Other

Rate

of a

gree

men

t (%

)

255

training in oral assessment. A number of them (17.1%) did not even have any training at all

before sitting as raters for oral tests.

The above results were echoed by raters’ opinions when interviewed. Many teacher

raters said that they had difficulties completing the oral rating within the period of time

allotted. At two institutions, the average number of candidates in each test room was

between 30 and 35, and the raters had to finish everything from checking candidates’

identity, delivering tasks, listening to candidates’ responses, comparing scores with the

other rater (in the case of double rating), to completing the final scoring sheet and returning

it to the examination board. Candidates each had an average of 4 minutes to perform their

speaking tasks. One rater reported that she had to be an oral examiner for the whole

morning. Then in the afternoon, after 1 hour for the listening test, she and her colleagues

had to be raters again, which was quite “physically challenging” (U1.T3).

Time pressure not only affected the raters’ physical condition, but also impacted the

quality of oral assessment. When I expressed my interest in knowing about difficulties in

rating, an experienced teacher commented:

The time was not enough for the whole class. Each candidate had to speak less

because big classes are very common at Vietnamese universities. The smallest class

has at least several tens of students, so the oral test is limited to some degree. The

examination would be administered more carefully, and the rating would be more

accurate if the number of students could be smaller. (U3.T1)

To achieve rating objectivity, test administrators at one institution did not let the

teacher in charge of a speaking class take the role of the rater for his/her own class, but

invited other teachers to be the rater. This arrangement gave a sense of fairness to all the

candidates because they were assessed by raters who were supposed not to be teaching

them. Teacher’s possible prejudice towards particular students in class was minimised.

Further, the teacher may take more responsibility for their teaching as their students were to

be examined by other teacher raters. The test provided rater teachers a good opportunity to

familiarise themselves with the EFL students’ levels in the Faculty. However, this caused

considerable trouble to some raters who did not understand the candidates nor the speaking

256

skills programme they had learnt. The following extract from an interview with a teacher

rater illustrates this difficulty:

The biggest challenge for me was that I was not the teacher-in-charge [of a

Speaking class] and did not teach that program, so I did not know what the course

objectives were. When rating, I had to associate it with my personal experience,

which was too far out of date. I do not know what the student’s current level is, so I

just estimate… I obviously know such oral performance for 3B [the level being

tested] was so good, so I gave a maximum score, whereas my co-rater was more

severe in scoring (U1.T3).

7.4 Raters’ consistency in oral rating

All the raters were Vietnamese EFL teachers who had experience in teaching and assessing

oral skills in the EFL Faculty of each university. However, institutions had different ways

of assigning teachers to the rater positions. No meetings were scheduled prior to the test

time for the raters to discuss to understand the criteria and agree on methods for oral

testing. Therefore, variations in rating were inevitable across examiners, even in the same

institution using the same testing material and scoring rubric. The following sections

present the ways raters performed rating and scoring, what aspects in candidates’ speaking

the rater assessed, what factors were not rated but could help candidates to score bonus

points, and whether or not the familiarity between the rater and candidate had an influence

on scoring. The data for analysis in this section were gathered from test room observation,

the EFL teacher questionnaire survey and interviews with individual teacher raters.

7.4.1 Scoring methods

Scoring and calculating scores revealed remarkable similarities and differences across

institutions. All used the format of face-to-face oral testing. By this method, the rater

assessed candidates’ live performances in the test room and the test scores were decided

immediately after the speaking test. Candidates were called into the test room either in pairs

(Universities A and B) or as individuals (University C). The number of raters and their

roles in the test room were different across the institutions.

257

University A adopted a double-rating format with two raters using the same rating

scale for the same scoring method. The two negotiated with each other to decide who

would be the interlocutor to deliver 2 tasks to each test taker. It was also the interlocutor

that checked each candidate’s identity and randomly selected a set of test tasks for each pair

of candidates. Each rater listened to the candidate speaking and used the 0-10 scoring rubric

with 0.5 band increment (e.g. 5.0, 5.5, 6.0, and so on) to record scores for each task

independently onto two separate copies of the scoring sheet (Appendix D.1). Each rater

awarded a candidate three scores: one score for each test task, and one averaged from the

task scores. At the end of the test, the two raters compared their sets of scores for all

candidates. Both raters in the test room had an equal role in scoring. The final scores

recorded on the official scoring sheet for each candidate were averaged from the scores of

each examiner. The scoring sheet did not reflect sub-scores for each rating category (e.g.

Pronunciation, Vocabulary, Coherence, etc.) but a holistic score for each test task.

University B applied a double-rating format with two raters using different rating

scales for different scoring methods. There were two raters in each test room. One rater

played the role of an interlocutor performing similar steps to University A: checking

candidates’ identities and choosing and delivering test tasks to pairs of candidates. The

other, playing the role of an assessor, did not interact with candidates but listened to their

oral performances and scored. The assessment sheet contained two parts: one for the

assessor (Examiner 1) to give four detailed sub-scores based on the categories described in

the rating scale, e.g. Grammar and Vocabulary (2.5 points), Discourse management (2.5

points), Pronunciation (2.5 points), Interactive communication (2.5 points); and the other

for the interlocutor (Examiner 2) to award each candidate a single score based on holistic

judgement (see Appendix G.1). The final score for each candidate was the average of the

score given by the interlocutor and the total of the scores awarded by the assessor. All

applied a 0-10 band scale, with a 0.5 band increment.

University C employed a single-rating format with only one rater deciding scores

for all the candidates in each test room. As mentioned in the test administration section

(Chapter Four), there was only one examiner in each test room at University C. The

examiner in each test room was the teacher of each speaking class being assessed.

258

Examiners played all the roles of interlocutor, assessor, and scorer in order to complete the

rating task. The scoring sheet (Appendix D.1) contained ratings for the four categories

which were weighted differently: Pronunciation and intonation (20%), Fluency (20%),

Understanding and use of lecture language (30%), Content including vocabulary and

grammar range (30%). However, there were no descriptors to help raters to judge and score

candidates’ responsive performances in reference to these categories. The teacher raters

made their own estimation and decision on the scores to be given. In both of the two test

rooms observed, scoring was between 0 (minimum) and 10 (maximum), which was quite

similar to the other institutions. However, one rater applied a 0.5 band increment, whereas

the other recorded the speaking score with a 0.1 band increment (e.g. 5.0, 5.1, 5.2, and so

on).

I have analysed how oral rating methods differed across the tertiary institutions.

Differences in examiner arrangement and scoring procedures were inevitable variables

contributing to the effectiveness of school-based oral assessment. Lack of rater training and

well-defined descriptors for the rating scales made the process of marking oral skills even

more challenging for large-sized classes. Insufficient testing guidelines and raters’

consensus in understanding the assessment criteria affected consistency across raters’

judgements in the same institution. I now explore how testers and testees perceived rating

and scoring oral skills in the next section.

7.4.2 Aspects of rating in oral assessment

A central question in scoring validation is what aspects the rater actually pays attention to

when assessing and marking candidates’ performance of speaking ability. This may be the

information that many candidates want to know, and should be permitted to know prior to

the examination so that they can have effective strategies in their language study and for

test preparation.

Rater participants’ responses in the survey indicate that pronunciation, vocabulary,

and fluency in spoken language were the most important categories that the overwhelming

majority of examiners paid attention to in the testing environment. Among those aspects,

fluency in oral performance received unanimous support (100% agreed) as speaking skills

played an integral part in the quality of the EFL majors’ outcomes. The raters all expected

259

tertiary graduates to be able to speak the target language fluently to serve their future

careers. In addition, most of the speaking raters (80%) agreed that good intonation was also

an essential aspect of pronunciation that Vietnamese students needed if they wanted to get

high scores in the oral test.

Contrary to candidates’ belief that grammar was an important standard in speaking

skills making them very anxious of making grammatical mistakes in the oral test (see

6.2.3), raters paid less attention to grammar accuracy (74.3%) than the content (77.1%) and

the use of cohesive devices in speaking (80%). The rater respondents added no other

comments to this survey question. Figure 7.3 illustrates a comparison of the raters’

attention to different aspects in face-to-face speaking assessment.

Figure 7.3 Aspects in candidates' oral performance the rater paid attention to

Most of the interviewed raters were aware of adhering to the predetermined criteria

for oral assessment. The following quotes extracted from interviews with individual raters

help to support results from the questionnaire survey and the above-mentioned document

analysis regarding to oral rating.

I had to follow the assessment criteria the (EFL) Faculty established. First,

pronunciation. Second, lexical range and accuracy. Third, grammar range and

accuracy. Finally, candidates’ ideas and (idea) organisation. (U1.T2)

97.1

8074.3

85.7

100

40

80 82.977.1

0102030405060708090

100

Pronunciatio

n

Intonation

Grammar accu

racy

Vocabulary

Fluency

Appropria

te body language

Coherence

Interactional c

ompetence

Content

Rate

of a

gree

men

t (%

)

260

Raters found descriptors included in the scoring rubric useful for as they would know what

aspects of candidates’ speaking were to be rated and scored:

While rating candidates’ oral performance, I relied on the descriptors in the scoring

rubric provided by the group leader of listening-speaking teachers. The descriptive

table comprised such categories as vocabulary and grammar, discourse

management, pronunciation, interaction, and global achievement. (U2.T1)

Other raters relied on the criteria for assessment established by their institution that they

were not allowed to change:

I relied on the criteria for rating provided by the school. The first criterion is

pronunciation. Pronunciation includes intonation and stress – word stress and

sentence stress. The second is idea development. The third is accuracy in

vocabulary and grammar. And the fourth I’d like to tell you is a focused criterion,

that is lecture language, that requires students to know expressions lecturers

commonly use. (U3.T1)

As shown in Figure 7.3, candidates’ use of body language in the oral test received

the least consideration from raters compared to the other aspects. This was not among the

raters’ main concerns because most the second-year students were mature enough to have

appropriate postures and behaviours in non-verbal communication, and inappropriate body

language was rarely found in the raters’ experience (U1.T2, U3.T3). Further, some teachers

gave their classes reminders of demonstrating suitable behaviour prior to the test.

I reminded the students that they needed to be properly-dressed on the test day, pay

attention to their body language, and the way they were sitting when face-to-face for

oral tests. Besides teaching English language, I would also guide communication

skills, English-using skills in real life. (U3.T1)

Many raters affirmed that candidates’ facial expressions or body language might

have distracted their attention for the first moments when the examinees entered the test

room, but these factors did not affect their scoring at all (U1.T2, U2.T1, U2.T3). However,

to other raters inappropriate behaviour could result in some minus points. When asked

whether or not candidates’ appearance or body language had impacted rating, a rater said:

261

Some students had the habit of shaking hands and legs like sitting at home, or as

comfortably as in a café. In such cases the teacher reminded gently so that the

student’s gestures would be more formal again. Or while taking the test, some

candidates shook their heads left and right, or swayed on the chair. There were so

many situations like that. If the rater gave reminders again and again, there would

be some slight deduction in marking, usually between 0.1 and 0.5 per cent of the

score. (U3.T2)

Appropriate use of body language could also be a positive way to earn candidates

bonus points. Such flexibility in rating outside of the scoring rubric could be explained

because some raters believed that “it looks more natural to use body language while

talking. In the future in real communication, it will help students communicate more

confidently” (U1.T3). Awarding bonus points to candidates for using appropriate types of

non-verbal communication is analysed in more detail in the next section.

7.4.3 Giving bonus points

Although not explicitly stated in any of the scoring rubrics or test guidelines, bonus points

awarded represented the rater’s recognition of some positive aspects of a candidate’s

performance associated with the speaking skills being tested. Results from the

questionnaire survey indicated that the majority of raters appreciated candidates speaking

with a clear voice. This was the first condition necessary for the rater to obtain sufficient

speaking samples to assess since many candidates spoke too quietly; either because of their

natural voices, or lack of confidence, or nervousness during the oral test. If a candidate’s

voice was not clear enough, then it was difficult for the rater to make a proper assessment.

Live speaking tests enabled raters to have more insights into non-verbal

communication including facial expressions, (hand) gestures, eye contact, posture, tone of

voice, etc. More than half of the raters (57.1%) tended to award bonus points for

candidates’ appropriate use of these wordless signals while speaking as they were obvious

indicators of the ability to communicate effectively. This viewpoint received strong support

from many raters in that “unlike interaction with the computer screen, face-to-face

communication needs to be accompanied with gestures, manners, facial expressions, and

the like” (U1.T3). Raters’ bringing these perspectives into their practice of oral scoring was

262

in line with experts’ advice on communication skills for maintaining and developing

successful everyday relationships, both personally and professionally (Doyle, 2017; Segal

et al, 2017). Table 7.7 presents situations in which oral raters tended to give the candidate

bonus points in addition to those for his/her oral performance.

Table 7.7 Raters’ tendency of giving bonus points

Bonus points given to Frequency %

Clear voice 29 82.9

Punctuality 3 8.6

Good behaviour 10 28.6

Good appearance 0 0

Being smartly-dressed 0 0

Good listening comprehension 19 54.3

Non-verbal communication 20 57.1

Other 3 8.6

More than half of the rater respondents (54.3%) admitted that they were ready to

offer bonus points to candidates who demonstrated good listening skills (Table 7.7). Two

reasons for this are speaking and listening skills were integrated in one course, and a good

communicator, in the theory of language competence, does not only need to understand

what another speaker is saying, but also make good decisions about when to respond

(Buck, 2001).

Raters possibly added some bonus points on finding some complementary factor in

the candidate’s speaking performance, e.g. clear voice, native-like intonation, quick

responses. Rewarding bonus points was not included in the institutional scoring guidelines,

but came from raters’ own decisions. Raters admitted their subjectivity in scoring (TQ8).

“Some teachers did not use the scoring rubrics at all” (TQ5) but relied on personal

impulsive estimation.

Table 7.7 indicates that no raters expressed a preference for a candidate’s good-

looking appearance or being smartly-dressed because these aspects were deemed personal

traits. Raters were aware that this language test was carried out in an educational setting

263

and for academic purposes. Test takers were language learners, not candidates for a job

interview. However, one rater advised that “dressing appropriately is a type of non-verbal

communication, students should be aware of what they wear to communicate through their

outer appearance” (TQ19). Recordings from observational field notes show that there were

no cases of badly-dressed candidates attending the tests, which is in accordance with a

rater’s comment:

Most of the students dressed a little casually, or in a relatively comfortable way. I

have never seen any students who dressed slovenly to the test room. (U1.T2)

Rater respondents recalled several circumstances in which candidates’ speaking

performance was worth considering for bonus points. For example, candidates who

demonstrated “uncommon lexical resources” (TQ23) beyond the course book, or superior

to the level of the average learner, usually received additional marks from the rater. As did

test takers who showed good knowledge about the topic they were talking about (TQ24).

More importantly, many raters sought evidence of candidates’ confidence in speaking

during the test (U2.T2, U3.T3). Unlike speaking practice in class that occurs more naturally

and free from the pressure of being assessed, speaking in a testing context requires

examinees to exhibit not just oral skills but the psychological factors necessary to be able to

perform test tasks as expected. Confidence in performing speaking skills was an aspect that

raters usually considered awarding bonus points for. Below is what one rater appreciated

and shared with me in an interview after the test:

Speaking with an attitude of confidence is important in oral tests. If a candidate is

confident, then though his/her performance is not good, people are able to

understand what he/she wants to express. The candidate’s confidence keeps

himself/herself calm so he/she can speak better. (U2.T2)

7.4.4 Familiarity between the rater and candidates

My investigation into oral test administration highlighted different ways the institutions’

human resources departments nominated examiners. One institution nominated teachers

who had not taught the listening/speaking course for second year students to be oral test

raters in that semester. Another institution nominated the teachers teaching the course to be

264

raters but not for the classes they had taught. The other let the teachers-in-charge be raters

for the classes they were teaching.

The degree of familiarity between raters and candidates varied across institutions

from low to very high. Two of the institutions made a rotation of teacher raters among

classes with an attempt to minimise possible familiarity between raters and candidates

within test rooms. It was impossible to assert that the scorers and test takers were

completely unknown to each other because all the raters were EFL teachers working in

those institutions, either officially or as guest teachers. If they were not teaching the

candidates that semester, they might have taught them already in previous semesters.

To achieve consistency in rating, oral raters adhered to the assessment criteria and

the rating scales proved, in which there were no descriptors associated with candidates’

performance in class or anywhere else they had known before. If a rater happened to score

a candidate he/she was familiar with, familiarity with the examinee’s speaking ability may

have interfered with their rating. Results from the survey demonstrate that the frequency of

rating affected by candidates’ performance in class varied from ‘Never’ to ‘Very often’.

Almost half of the raters (48.6%) admitted that their rating was sometimes influenced by

candidates’ class performance. Table 7.8 presents a joint display of raters’ agreement on

how often their rating was influenced by their familiarity with candidates’ class

performance and examples of further comments made on each category of frequency.

265

Table 7.8 Oral rating affected by candidates’ performance in class

Category F Percent Comments

Always 0 0% -

Very often 3 8.6% In-class performance was important; test time was too short to judge; learning English is a process; a small oral test cannot tell students’ speaking ability, etc.

Usually 6 17.1% Speaking in class could tell how well a student can perform in exam; long-term evaluation was more reliable than short-term; students’ speaking ability needs time to improve, etc.

Sometimes 17 48.6% In-class performance better reflected language competence; students might not be well-prepared for the test; students’ attitude to learning and their friends/teachers was important; test time was not long enough; acquaintanceship with a candidate makes the rater form a certain grade or score to fit the candidate without even listening to his/her performance in the test, etc.

Rarely 4 11.4% Tried to rate independently of other factors; I want to be objective and fair in rating, etc.

Never 5 14.3% Just paid attention to test task performance; performance at different times may vary, etc.

Total 35 100% 23 comments made in total

Note. F = frequency

Familiarity may not have been arranged purposefully by the institutional test

administration, nor was it the raters’ and candidates’ intention, but it did exist in school-

based assessment. The following rater’s quote demonstrates that familiarity did intervene in

normal rating, for this case, in score adjustment. The score awarded would have been

different if the rater had not related the candidate’s performance to what he/she already

knew about the candidate’s ability.

If I happen to be a rater for a class I had taught, it is true that my rating is affected at

some degree by what I have known about the class speaking ability during the

course. For example, I have a good impression of a candidate’s ability, but for some

reason possibly a health problem, or a difficult topic, or something else. Usually I

know his ability is very good at 9 points, but here he can perform just at 7 points.

That is not right. So I can adjust it (his score) to 8 points. (U1.T2)

266

Another rater from the same institution reported a case when familiarity helped her

confidently decide a score for a candidate’s speaking. However, that scoring did not correlate

well with her co-rater, who was not familiar with the candidate. In her words:

I admit that familiarity did affect rating. It was impossible not to. There was a case

in which familiarity (with a candidate) affected my rating and scoring, as I’ve told

you, the fellow rater gave a severe mark, but I gave a maximum of 9.5 or 10. As I

said, the degree of familiarity did not mean that I was flexible to give a favourable

score to that student, but because I understand her language ability. I must admit

that when I gave a 10, I myself was confident that was because I know her so well,

and where her ability was. (U1.T3)

Comments and stories from raters suggest that a variety of potential biases in

measurement may interfere with direct speaking tests. These influential factors were

derived from the design of rating scales, training raters, and determining scoring procedures

for the test. Raters played the most decisive role in the rating process which therefore may

be affected by numerous subjective factors including misinterpretation of the assessment

criteria, personal preferences, performance of co-raters, and familiarity with candidates. In

the next section, I present survey results on test raters’ and candidates’ perceptions about

the practice of rating and scoring which they had experienced.

7.5 Test raters’ and candidates’ perceptions about the practice of rating and scoring

Part of the questionnaires for raters and test takers was designed under the subheading

‘Rating and scoring’ to find out the stakeholders’ overall opinions and perceptions about

the operational practices of assessment and marking at their institutions, which directly

involved examiners and examinees. The following sections present results from the data

integrated from the surveys, interviews, and document analysis.

7.5.1 Test raters’ perceptions of the rating process

Table 7.7 presents the outcomes of descriptive statistics of test raters’ perceptions of the

rating process. The seven items received means ranging from 2.8 to 4.57. The highest mean

value (M = 4.57) from 35 raters’ responses suggests that ensuring consistency in scoring

was very important to raters (item 1). More than half (62.8%) even said they strongly

267

agreed. This is in line with the principles of language testing that an effective oral test

needs to produce consistent results at different times, across different raters, and different

test takers (Bachman, 1990; Brown, 2004).

The rater respondents agreed that they concentrated on the task without disturbance

or distraction during the rating process (M = 3.69). This result reflects the seriousness of

rating performance although the oral test under study was not a high-stakes examination. It

was an institutional exam and its results were for internal use only. This positive attitude

was supported by a general tendency of raters keeping the scoring process consistent from

the beginning until the end of test. A very low mean (M = 2.8) indicates that not many

raters (31.4%) agreed that their rating performance was more consistent at the beginning

rather than the end of the test day (item 7). However, interviewed raters admitted that in

practice, ensuring consistency in marking was challenging. To achieve absolute consistency

in marking was impossible (U2.T2). Individual raters’ consistency in marking could be

estimated to a relative degree, approximately 60-70% (U1.T1, U2.T1, U2.T3, U3.T1,

U3.T2). The reasons for this inconsistency varied across raters, e.g. a rater’s different

interpretation of the descriptors in the rating scale, fatigue caused by a long list of

candidates, personal impressions of a candidate’s swift wit, raters’ unequal interest in

candidates’ topics, raters’ support in giving hints when candidates had difficulty

understanding a test question, etc.

Few raters (M = 3.14) found it difficult to decide scores awarded to those who were

close to the Pass/Fail boundary (item 4). The raters understood that the scores for speaking

skills would be counted together with those for listening skills to decide whether a student

would be promoted to the next level. At the time of scoring a candidate’s speaking

performance, the rater could not determine whether a candidate had passed or failed the

course. A candidate may not receive a good score for speaking skills, but if he/she

performed well in the listening test, he/she would be able pass the course. One of the

institutions combined the listening and speaking test results, but set a pass condition that

neither of the components (listening or speaking) was allowed to be lower than 50%.

Otherwise, the student would fail the entire listening-speaking course and had to enrol in

the course again. In general, candidates reported that the speaking test scores affected them

268

a great deal personally in terms of anxiety and emotional tension (56.8%), and also their

overall academic result (81.8%). The most important thing was students’ motivation to

learn English (82.4%). Impacts of test scores in particular and the current oral assessment

are discussed further in the next chapter.

In line with features identified as indicators of fluency in spoken language testing

(Ellis &Yuan, 2005; Gan, 2016), the second highest mean (M = 3.83) suggests that most

raters estimated the degree of fluency by considering the amount of repair (reformulations),

and hesitation (pauses) in students’ speech. Table 7.9 presents the means and standard

deviations for raters’ perceptions of the oral rating and scoring.

Table 7.9 Means and standard deviations (SD) for raters’ perceptions of the oral rating and scoring

Statement N Mean SD

(1) Ensuring consistency in scoring was very important to me. 35 4.57 .70

(2) I was able to work without any disturbance or distraction during the rating process. 35 3.69 .87

(3) The amount of repair and hesitation in students’ speed was counted in my evaluation of their fluency. 35 3.83 .77

(4) I found it difficult to decide the score awarded to those who were close to the Pass/Fail boundary 35 3.14 1.03

(5) Candidates’ non-verbal behaviours affected my judgement. 35 2.94 1.14

(6) I focused on the number of errors candidates made per stretch of speech 35 2.91 .98

(7) I tended to do the rating more consistently at the beginning than at the end of the test. 35 2.80 .96

Valid N (listwise) 35

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree

A focus on the number of errors that a candidate made per stretch of speech was not

strongly agreed to (M = 2.91) in measuring accuracy, although accuracy in oral production

was represented by the proportion of error-free clauses (Skehan & Foster, 1997; Iwashita,

Brown, McNamara, & O’Hagan, 2008). This scoring method can be explained by tolerant

attitudes towards learners’ errors and/or their limited foreign language capacity. A rater

269

shared her personal viewpoints after the test when asked about whether she had an error-

focused or achievement-oriented tendency in marking oral performance:

I usually seek good points for scoring rather than looking for weaknesses for

subtracting marks. Candidates were not so good for me (as test rater) to count minus

points. Normally for students at that level, I am trying to find plus points, but not

minus points. (U3.T1)

Raters’ judgement affected by candidates’ non-verbal behaviours (item 5) received

a low mean (M = 2.94) but the highest standard deviation (SD = 1.14) received for the

statement reveals that most raters paid attention to the constructs being assessed. Not many

raters took examinees’ wordless behaviours into consideration, but there were critical

differences in raters’ perceptions of whether or not candidates’ appropriateness (or

inappropriateness) of non-verbal signals was an issue that could affect their test scores.

Further analysis of raters’ tendency in giving bonus points for appropriate use of non-verbal

communication has been presented in previous sections of this chapter.

7.5.2 Candidates’ perceptions of the rating and scoring

I investigated the candidates’ perceptions of the rating and scoring process by seven

statements included in the EFL students’ questionnaire. Table 7.10 shows the results of the

descriptive statistics of candidates’ responses. The items received mean values ranging

from 2.84 to 3.82. In general, candidates had positive perceptions of the practice of rating

and scoring adopted at their institutions.

270

Table 7.10 Means and standard deviations (SD) for candidates’ perceptions of the oral rating and scoring

Statement N Mean SD

(1) The judgement should be done independently by two raters. 352 3.82 .94

(2) The rater was fair in scoring. 352 3.75 .78

(3) The speaking session should be audio-recorded in case reconsideration will be needed later. 352 3.63 .93

(4) The rater was consistent in asking questions. 352 3.61 .78

(5) I would like my classmate(s) to take part in assessing my speaking performance. 352 3.11 1.09

(6) My teacher was generous in his/her scoring. 352 3.02 .76

(7) My speaking performance in the test should be used as a solo determiner of my score, not including my speaking performance in class.

351 2.84 1.12

Valid N (listwise) 351

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree

The first sentence with the highest mean value in this group of questions (M = 3.82)

indicates that test takers were in favour of having paired raters in each test room. The

inclusion of two independent raters in oral assessment has been widely supported in theory

(Bachman, 1990; Taylor & Galaczi, 2011), and in previous research to enhance inter-rater

consistency once the raters were trained properly and regularly (Restrepo & Villa, 2003).

When interviewed about a preference of whether each candidate’s speaking session should

be rated by one or two raters, most of the candidate interviewees said they thought having

paired raters was better for scoring. The followings are comments and reasons the

candidates provided in group interviews after the test. Candidates understood that human

raters were different in perceiving their oral production ability:

In my opinion, there should be two raters. Each has his/her own perceptions, so will

look at particular features of a candidate. I think two raters will be more diverse,

and have multi-dimensional and multi-faceted evaluation. Then it (the assessment)

will be more accurate, and I can learn more experience. (U1G1.S1)

They believed double-rating helped to increase objectivity. In a candidate’s words:

271

I prefer being assessed by two raters because it is more objective. A single rater’s

personal opinions are usually not very accurate. (U2G1.S3)

Other candidates were more confident with the results by double-rating than single rating:

For me, I like having many examiners involved in the oral rating. If unluckily one

rater lowers my score, then the other will explain the good aspects in my

performance that he/she can see. With two raters, my speaking performance will be

analysed more carefully than with just one. (U3G2.S3)

For candidates who supported only one rater making judgements, they provided

different reasons for their preferences. Some candidates believed one rater would take more

care about test takers’ comfort to enhance their performance. Extracts illustrating

preferences of oral assessment with one rater are presented below:

I mean there should be only one rater because my ability is not very high. It is

impossible for me to handle two examiners at a time. (U2G2.S3)

Other candidates trusted single rating. These candidates thought they would be more

stressed if they had to perform in the presence of more than one rater:

I think only one rater is fine. When the rater is listening to us, he/she can make

notes of what we are saying. They can perceive where we are good and weak at and

record those points. It is not necessary to have two raters. The more raters, the more

pressure on candidates. I do not like that. (U2G1.S1)

There were a few cases of candidate interviewees who did not care much about the

number of raters, but about what raters did in the test room to facilitate students’ speaking.

The following extract from a group interview is an example.

In my opinion, it does not matter how many examiners participate in the

assessment. What I think important is the examiner needs to have appropriate

manners and attitude. In the test room, there should not be one (examiner) standing

at this side, another one standing at that side, another one looking at the candidate

from the other side. I do not like that way. Raters should sit down at a table and talk

like a normal person. Then the candidate would feel more comfortable and perform

better in the test. (U3G2.S1)

272

My analysis demonstrated that the advantages of having two raters outweigh its

disadvantages, and remarkably outweigh the advantages of having one single rater. Table

7.9 synthesises candidates’ preferences for and against having one or two raters assess their

oral performance. More interviewed candidates expressed a preference for their speaking

sessions to be assessed by two raters rather than a single one in each test room.

Table 7.11 Candidates’ opinions for and against the number of raters in each test room

F For F Against

One single rater 2 1 1

less stressful time-saving manageable

3 4 2

inaccurate sometimes subjective risky for candidates’ scores

Paired raters 2 1 3 6 5 3 2

2

diverse judgement dimensional comments multi-faceted evaluation more objective more accurate more carefully analysed harmonising score discrepancy more chances for personal appreciation

2 1

1

more pressure unnecessary more questions to handle if both ask

Note. F = frequency

Assigning two raters for each test room would cause difficulty for administration

since more human resources would be needed (U1.T3). An experienced rater emphasised

that independent rating should be done for each candidate’s performance by two raters

(U2.T1).

7.6 Test score analysis

In this section, I present a statistical analysis of the scores that students were awarded in

their oral test. This is not a complete description of the entire sets of speaking test scores of

all the candidates at the three institutions involved in my study, but an analysis of the

sample scores of the students who gave me consent to use their scores for the study

purpose. My analysis aimed to learn about the distribution of the test scores. It is important

to know about this because it serves to confirm whether or not the test and the candidates’

273

performance met the test score users’ expectations (Green, 2013). The statistical score

analysis goes further into examining the relationship between sets of scores awarded by

different raters in the same institutions. Correlating test scores helped to describe the extent

to which paired raters of oral assessment agreed with each other when rating the same

candidates’ performances (inter-rater reliability) (Bachman, 2004; Gwet, 2012). The results

contribute to learning about how the raters were consistent in their rating process (Research

Question 2). Scores were part of my study regarding oral test reliability as they were

calculated in students’ academic achievement, and impacted on their motivation to study

English as discussed in the next chapter.

7.6.1 Distribution of test scores

I received a total of 314 students’ consent to my using their scores for this study (89.2% of

the entire sample of student participants). I analysed the test scores per institution rather

than per test room because the student participants were scattered in different test rooms.

Some test rooms had a number of participants varying between 15 and 20 while others had

fewer than five. Table 7.12 displays descriptive statistics of the official oral test scores

across the three institutions in terms of the mean, the highest score (maximum), the lowest

score (minimum), range, variance and standard deviation. Each candidate’s official score

was the average of the total scores awarded by two raters at Universities A and B. At

University C, the scoring was different where only one rater decided an official score for

each candidate.

Table 7.12 Descriptive statistics of oral test scores across the institutions

University N Mean Min. Max. Range Variance Std. Deviation

A 86 7.44 3.5 10.0 6.5 1.43 1.20

B 128 6.57 2.0 10.0 8.0 1.98 1.41

C 100 6.91 3.6 9.3 5.7 1.79 1.34

3 institutions 314 6.92 2.0 10.0 6.7 1.88 1.37

In general, there was not a significant difference in the mean scores across the

institutions: University A (N = 86) obtained the highest mean (7.44); University C (N =

100) had the second highest (6.91); and the lowest mean was at University B (6.57).

274

University B (N = 128) had the widest range of scores (8.0), varying from 2.0 to 10.0. The

other universities’ ranges in scores were 5.7 (University C) and 6.5 (University A).

Differences in the range of scores demonstrate a remarkable discrepancy in candidates’

speaking ability according to the teacher raters’ evaluation at each institution. For EFL

majors, this variation needs taking into serious consideration because graduates are

expected to achieve Level C1 within the CEFR (Appendix G.3) Vietnamese education

policies. Mixed ability classes require appropriate approaches to groups of students of

different levels, e.g. differentiated curriculum, teachers’ pedagogical practices (Hallam &

Ireson, 2005).

Table 7.13 presents a cross-institutional comparison of the test scores in five

categories: ‘Good’ (Grade A, from 8.5 to 10), ‘Fairly good’ (Grade B, from 7.0 to below

8.5), ‘Fair’ (Grade C, from 5.5 to below 7.0), ‘Weak’ (Grade D, from 4.0 to below 5.5), and

‘Very weak’ (Grade F, below 4.0). Under this classification scheme of tertiary students’

academic achievement6, the letter grades A, B, C, D are ‘Pass’, and the grade F is ‘Fail’

(MOET, 2007). The highest proportion of scores came from University B (N = 128).

University C was second (N = 100), and University A was the third (N = 86) of the scores

used for my statistical analysis.

Table 7.13 Comparing categories of test scores across institutions

Grade/score

Uni. A Uni. B Uni. C Total

N % N % N % N %

A. Good (8.5-10) 19 22.1 9 7.1 16 16 44 14

B. Fairly good (7-8.4) 48 55.8 53 41.4 36 36 137 43.6

C. Fair (5.5-6.9) 14 16.3 41 32 30 30 85 27.1

D. Weak (4-5.4) 4 4.6 20 15.6 17 17 41 13.1

F. Very weak (<4) 1 1.2 5 3.9 1 1 7 2.1

Total 86 100 128 100 100 100 314 100

6 The conversion into letter grading is applied when students have all the numerical scores awarded for a unit of study (refered to as a subject for short), including mid-term tests, class engagement, attitude for class discussions, mini-projects, end-of-course examination. The total numerical score of a unit of study (or module) is rounded to one decimal place. The score of the end-of-course exam is mandatory and should account no less than 50% of the total mark for each unit (MOET, 2007).

275

As shown in Table 7.13, the majority of candidates (43.6%) achieved ‘Fairly good’

scores (Grade B). University A had the highest number of candidates awarded ‘Good’

scores (Grade A) of the three institutions, accounting for 22.1% of the total number of the

scores collected from this institution; while University B had a lower rate (7.1%) compared

with the total rate (14%) for all the institutions.

The highest number of candidates achieving ‘Fairly good’ scores in their institution

came from University A (58%). The number of candidates achieving ‘Fairly good’ scores

was lower at University B (41.4%). University C’s ‘Fairly good’ scores (36%) accounted

for a lower rate compared with the general rate (43.6%) for this category.

Universities B and C had similar numbers of candidates who achieved ‘Fair’ scores

(from 5.5 to below 7.0), accounting for 32% and 30% respectively. These results were

higher than the general rate of this score category (27.1%). The ‘Fair’ scores at University

A accounted for less than 17%, which was lower than the general rate.

The number of candidates classified as ‘Weak’ (scoring from 4 to less than 6.5) was

similar at Universities B and C (15.6% and 17% respectively), whereas it was very small at

the other institution (4.6%). The rate of ‘Fail’ students was the highest at University B

(3.9%), but nearly the same at the other institutions (approximately 1%).

My comparison of the oral test scores across the institutions shows that University

A had the highest results with the most candidates obtaining Grade A (‘Good’), and Grade

B (‘Fairly good’). After that came University C, and University B had the lowest test

results of the three institutions. This comparative result was aligned with the expected

outcomes by each institution: Level C1 by University A, Level B2 by University B, and

High-intermediate Level by University C (Section 5.2).

7.6.2 Inter-rater reliability in direct scoring between pairs of raters in the same

institution

This section examines the level of agreement between raters (inter-rater reliability)

when scoring live speaking performances. As presented in Section 7.3, one pair of raters

performed the rating and scoring in each test room of Universities A and B. University C

assigned only one official rater for each test room. To examine the degree of raters’

276

agreement in scoring, I invited a co-rater to score University C’s oral test together with the

official rater using the same assessment criteria and rating scale (see Chapter Three). The

test scores awarded by the guest rater (University C) was used for the purpose of the study,

and not calculated in the official scores of candidates. The total number of scores used for

correlation (N = 241) was lower than that used in the previous section of the test score

distribution (N = 314) at the three institutions. This was because double-rating at University

C (for the purpose of my study only) was not applied to all the student participants there,

but limited to those who gave me consent to have their oral test sessions observed and

audio-recorded (Appendix A.3c).

My analysis of the degree of inter-rater reliability at each institution revealed a

slight difference between raters at University A and University B. However, University C

was at a lower level of consistency compared with those at the other two institutions. The

Pearson correlation of 128 pairs of scores awarded by pairs of raters at University B was at

the higest value of .79. Figure 7.4 illustrates a comparison of raters’ performance across the

universities. In general, all three scatterplot diagrams illustrate a positive relationship

between pairs of oral raters. The first two plots indicate a strong relationship between pairs

of raters at University A and University B, whose correlations are between 0.7 and 0.9. The

third plot indicates moderate relationship between paired raters at University C, whose

correlation is within the range from 0.4 to 0.6.

277

Figure 7.4 Inter-rater agreement in oral test scoring across the institutions

The correlation coefficient of test scores awarded by University B’s raters had a

higher value (.80) than that of University A mentioned above. The relationship between

paired raters’ scores at University C was less strong than those at Universities A and B (the

third plot in Figure 7.4). Unlike those from Universities A and B, the scores awarded by

pairs of University C’s raters were not clustered around but scattered away from the line of

regression. The R2 Linear statistics shown in the plots tells us that there was 75% and 80%

shared variance between the two sets of scores in Universities A and B respectively,

whereas this value in University C was only 54%. This result indicates that there is more

overlap in the relationship between pairs of Universities A and B’s raters than in University

C’s raters. In other words, the level of agreement between raters at University C was

weaker than that at the other two universities.

278

I have presented my analysis towards inter-rater reliability in direct (live) scoring

between pairs of raters in the same test rooms. In the following section, I continue to

compare the test reliability via scores produced by different raters at another occasion of

rating: semi-direct scoring based on the audio recordings collected from the test rooms.

Four raters from each institution performed the re-scoring of the speech samples (Appendix

A.5a). I removed segments of candidates’ identity from the audio files and replaced them

with a voice over to identify each candidate, e.g. “candidate one”, “candidate two”, etc.

(Appendix F.2) to ensure raters’ objectiveness in awarding scores. After that, the speech

samples of each institution were transferred into a CD for raters’ marking. By this way, a

rater might have re-scored some (but not all) of his/her candidates’ speaking that he/she had

rated at the first time of live scoring.

7.6.3 Inter-rater reliability in semi-direct scoring across raters of the same institution

Examining the correlation of test scores produced from two times of scoring (the first on

live oral performance, and the second on audio-recordings) helped to estimate the

consistency in scoring among different raters of the same institution (inter-rater reliability).

The results aim to answer the second research question regarding the extent to which oral

test raters were consistent with each other in rating and scoring. I analysed oral rating on

sound recordings from test rooms to anticipate what the scores would have been like if

raters from the same institutions marked the same speaking performance on another

occasion. This section aimed to examine the degree of agreement across raters adopting the

same speaking task(s) and rating scale to assess the same candidates’ oral performances.

I estimated the level of rater agreement in scoring by computing the correlation

between four sets of scores by different raters at the same institution. These raters had

attended as oral test raters at the first time of scoring. For the second time of scoring, the

raters marked candidates’ audio-recorded speaking performances independently using the

same assessment criteria and scoring rubrics as they had done in the first time (Appendix

D.4). The analytical results of the sets of test scores that four raters of the same institutions

awarded in the second time are now presented. Table 7.14 shows the correlation of the test

scores by four raters at University A.

279

Table 7.14 Correlation of test scores by University A’s raters

Total Rater 1A

Total Rater 2A

Total Rater 3A

Total Rater 4A

Total Rater 1A

Pearson Correlation

Sig. (2-tailed)

N

Total Rater 2A

Pearson Correlation .459*

Sig. (2-tailed) .014

N 28

Total Rater 3A

Pearson Correlation .604** .798**

Sig. (2-tailed) .001 .000

N 28 28

Total Rater 4A

Pearson Correlation .776** .679** .787**

Sig. (2-tailed) .000 .000 .000

N 28 28 28

*. Correlation is significant at the 0.05 level (2-tailed).

**. Correlation is significant at the 0.01 level (2-tailed).

As shown in Table 7.14, the Pearson correlation of the test scores ranged from

0.459 (very weak) to 0.798 (strong). In examining each pair of raters, I found that the

weakest relationship was between Raters 1A and 2A (.459), and the strongest was between

Raters 2A and 3A (.798). Rater 1A’s scores did not show a strong correlation with Rater

3A’s either (.604). This result suggests that he/she might have applied the rating scale in a

slightly different way, or his/her intepretation of candidates’ performances might have

differed from that of his/her colleague rater. It may be useful to review his/her scores in

case more training will be required for oral examiners using the scale.

I continued to calculate the minimum, maximum, mean, and standard deviation in

the ratings of each rater at University A. Table 7.15 presents descriptive statistics of

University A’s raters’ scores in semi-direct scoring.

280

Table 7.15 Descriptive statistics of University A’s raters’ scores in the second time of scoring

N Mean Min. Max. Range Std. Deviation

Total Rater 1A 28 7.36 5.0 10.0 5.0 1.47

Total Rater 2A 28 7.38 5.5 8.5 3.0 .824

Total Rater 3A 28 7.46 5.5 9.0 3.5 .838

Total Rater 4A 28 7.50 6.0 9.0 3.0 .828

Valid N (listwise) 28

As can be seen in Table 7.15, at University A there was not a significant difference

in the mean across raters, which varied between 7.36 and 7.5. Rater 1A awarded the widest

range of scores, varying from 5.0 to 10.0. This was the only rater who gave the maximum

score (10 points) and the lowest score (5 points). Rater 4A was the most lenient and stable

in marking with the highest mean (M = 7.5) and the narrowest range of scores (3.0), and the

second lowest standard deviation (SD = .828) in the group. This result suggests that there

was inconsistency in the rating team. Individual raters’ leniency or harshness more or less

affected the test scores when they performed their oral rating individually. The raters did

not use the rating scale in the same way, or they interpreted the assessment criteria

differently. Table 7.16 displays the correlation of the test scores by the four raters at

University B.

281

Table 7.16 Correlation of test scores by University B’s raters

Total Rater 1B

Total Rater 2B

Total Rater 3B

Total Rater 4B

Total Rater 1B

Pearson Correlation

Sig. (2-tailed)

N

Total Rater 2B

Pearson Correlation .729**

Sig. (2-tailed) .000

N 27

Total Rater 3B

Pearson Correlation .748** .590**

Sig. (2-tailed) .000 .001

N 27 27

Total Rater 4B

Pearson Correlation .508** .522** .581**

Sig. (2-tailed) .007 .005 .001

N 27 27 27

*. Correlation is significant at the 0.05 level (2-tailed).

**. Correlation is significant at the 0.01 level (2-tailed).

As shown in Table 7.16, the Pearson correlation of the test scores ranged from

0.508 (moderate) to 0.748 (strong). In examining each pair of raters, I found that scores by

Rater 1B had a moderate relationship with those of Rater 4B (0.508), but a strong

correlation with scores by Raters 2B and 3B (0.729 and 0.748 respectively). Rater 2B

showed a low level of agreement in scoring with both Raters 3B and 4B (0.590 and 0.522

respectively). The agreement in scoring between Raters 3B and 4B shows a low level of

consistency (0.581). The highest correlation (0.798) was found between Raters 1B and 3B.

I continued to calculate the minimum, maximum, mean, and standard deviation in

the ratings of each rater at University B. The results are summarised in Table 7.17.

Table 7.17 Descriptive statistics of University B’s raters’ scores in the second time of scoring

N Mean Min. Max. Range Std. Deviation

Total Rater 1B 27 6.63 2.5 9.0 6.5 1.45

Total Rater 2B 27 6.07 3.0 8.0 5.0 1.03

Total Rater 3B 27 6.26 2.0 9.25 7.25 1.55

Total Rater 4B 27 7.24 6.0 8.6 2.6 .81

Valid N (listwise) 27

282

Table 7.17 shows that the mean at University B varied between 6.07 and 7.24. Rater

3B awarded the widest range of scores, ranging from 2 to 9.25. Rater 3B was the only rater

who gave the highest score (9.25 points) and the lowest score (2 points). Rater 4B was the

most lenient and stable in marking, which was demonstrated by the highest mean (M =

7.24), and the lowest standard deviation (SD = .81). Rater 2B was the most severe of the

four raters (M = 6.07) but with a lower variation compared with Raters 1B and 3B. This

result suggests that rater training could enhance raters’ agreement in scoring and reduce

variations in raters’ scores. Table 7.18 demonstrates the correlation of the test scores by the

four raters at University C.

Table 7.18 Correlation of test scores by University C’s raters

Total Rater 1C

Total Rater 2C

Total Rater 3C

Total Rater 4C

Total Rater 1C

Pearson Correlation

Sig. (2-tailed)

N

Total Rater 2C

Pearson Correlation .796**

Sig. (2-tailed) .000

N 27

Total Rater 3C

Pearson Correlation .759** .655**

Sig. (2-tailed) .000 .000

N 27 27

Total Rater 4C

Pearson Correlation .759** .645** .672**

Sig. (2-tailed) .000 .000 .000

N 27 27 27

*. Correlation is significant at the 0.05 level (2-tailed).

**. Correlation is significant at the 0.01 level (2-tailed).

As presented in Table 7.18, the Pearson correlation of the test scores ranged from

0.645 (moderate) to 0.796 (strong). In examining each pair of raters, I found that scores by

Rater 1C had a stronger positive relationship with those of Rater 2C (0.796) than with those

of Raters 3C and 4C (0.759). Rater 2C showed a low level of agreement in scoring with

both Raters 3C and 4C (0.655 and 0.645 respectively). The degree of agreement in scoring

283

between Raters 3C and 4C was also at a low level of consistency (0.672). I present the

ratings of each rater at University C in terms of the minimum, maximum, mean, and

standard deviation in Table 7.19.

Table 7.19 Descriptive statistics of University C’s raters’ scores in the second time of scoring

N Mean Min. Max. Range Std. Deviation

Total Rater 1C 27 6.09 4.0 9.5 5.5 1.44

Total Rater 2C 27 7.06 5.5 9.1 3.6 .99

Total Rater 3C 27 7.33 6.8 8.7 1.9 .64

Total Rater 4C 27 8.18 6.2 10.0 3.8 .92

Valid N (listwise) 27

Table 7.19 shows that the mean of four raters’ scores from University C varying

between 6.09 and 8.18. Rater 1C awarded the widest range of scores, ranging from 4.0 to

9.5. The standard deviation of 1.44 was the highest of all the raters. Rater 1C was the most

severe rater, whose mean was the lowest and who gave the lowest score (Min = 4.0) of all

the other raters. Rater 4C was the most lenient and stable in scoring, which was

demonstrated by the highest mean (M = 8.18), and a fairly low standard deviation (SD =

.92). Rater 4C was the only rater of the group who awarded candidates the maximum score

(10 points) whereas none of the other raters ever gave this score. Rater 3C’s scores varied

the least as the standard deviation was very low (SD = .638). This result highlights a need

for rater training to enhance inter-rater agreement in oral assessment.

7.7 Summary

In Chapter Seven, I have examined the scoring validity of the oral test administered at three

tertiary institutions. My analysis highlights remarkable differences and similarities in the

procedures and the outcomes of assessing EFL speaking skills across Vietnamese

universities. Oral raters were EFL teachers of the institutions who had necessary

understandings about the students’ English level, teaching content, and practical experience

with language testing. Live double-rating enabled life-like interaction and helped to

increase test score reliability. The correlation demonstrates different degrees of positive

relationships among test scores by raters at the same institution. Test scores show higher

284

degrees of raters’ agreement in the direct than the semi-direct mode for speaking

assessment. Differences between the two modes of marking were how raters perceived

candidates’ oral performances together with facial expressions or of body language. Paired

raters in live oral rating had an opportunity to inform each other their intended scores

or/and discuss with each other differences in their marking to reach agreed scores. Rating

scales as assessment tools did not yield an expected effectiveness that facilitated raters’

performance. Individual raters seemed to rely on personal testing experience to arrive at

particular scores since test administrators failed to organise a rater meeting before the test

to ensure all raters of the institution had a mutual understanding of the constructs being

tested, and an agreed interpretation of candidates’ oral performance with reference to a

clearly-defined rating scale.

In the next chapter, I explore the impact of the test in terms of its consequences on

teaching and learning. Examining consequential validity of the test helps to build more

awareness of the importance of assessment in the cycle of language pedagogy so that it will

foster the construction of more effective tests.

285

Chapter Eight

RESULTS: IMPACT OF ORAL TESTING ON EFL TEACHING AND LEARNING

8.1 Introduction

In the previous chapters, I analysed the extent to which the speaking test aligned with the

conventions of oral assessment discussed in language testing theory and empirical studies. I

employed Weir’s validation model (2005) for investigating various types of evidence to

support validity arguments. My analysis concentrated on the context validity of the oral

test, i.e. how speaking skills were assessed (test administration), what was assessed (test

content), and what measures were adopted for assessment (test tasks). I explored the

scoring validity by examining the oral rating process in terms of consistency and fairness in

measuring the quality of higher education.

Assessment in an educational context is not separate from teaching and learning.

They constitute a continual formation process in which assessment provides useful

information to review teaching effectiveness and promote further learning. School-based

assessments “cannot mark the end of learning”, but they “must be followed by high-quality,

corrective instruction designed to remedy whatever learning errors the assessment

identified” (Guskey, 2003). In this chapter, I address the test’s impact on EFL pedagogy in

Vietnamese universities. The impact of a test is referred to as “washback” in language

testing and assessment (Burrow, 2004; Cheng, 1997; Muñoz & Álvarez, 2010). Effects of

assessment on teaching and learning have been part of test validity (consequential validity)

where washback is considered “one form of testing consequences that needs to be weighted

in evaluating validity” (Messick, 1996, p. 243). Washback, as a commonly discussed topic,

“exemplifies the intricate relationship between testing and teaching and learning” (Cheng,

2005, p. 35) that is associated with “a whole set of variables that interact in the educational

process” (Shohamy, 1993, p. 2).

286

In this section I examine washback effects and the resulting consequential validity

of the EFL oral examination. The speaking test was administered at the end of every

semester, year after year, in the educational institutions. The end-of-semester assessment

was intended to inform what students did (or did not do) well in the test so that necessary

changes in teaching and learning oral skills could be made for better pedagogical results in

the future. Information from administering the oral test would demonstrate what needed to

be retained or adjusted to improve the quality of oral test administration next time. Several

empirical studies have investigated the influences of testing in different teaching and

learning contexts. These studies examined the impact of testing on teachers and teaching

(Saif, 2006; Borrows, 2004), students and learning (Watanabe, 2001; Andrews, Fullilove,

& Wong, 2002), learners’ behaviours towards test preparation (Stoneman, 2005; Read &

Hayes, 2003), and teaching materials (Yu & Tung, 2005).

Washback effects of testing can be seen both from a backward, and forward

direction in time. For the backward direction, I examined the influence of the test on the

teachers’ and learners’ preparation activities (A. Green, 2013; Brown, 2004). For the

forward direction, I explored its impact on future adaptations and modifications in teaching

and learning as a consequence of the test (Pearson, 1988). I utilised data collected from the

questionnaire surveys, interviews, and recordings of speech samples to answer Research

Question 4: What washback effects does the test have on teaching and learning EFL

speaking skills? I have organised this chapter into two parts: the test’s impact from

candidates’ perspectives, and the test’s impact on EFL teaching from raters’ views and

experience. The section ends with a summary of how oral assessment has had influential

effects on EFL pedagogy at the tertiary institutions involved in my study.

8.2 Impact of the oral test from candidates’ perspectives

My presentation in this section focuses on integrated results from the EFL students’ survey

and focus group interviews with representative informants in the survey. Where

appropriate, I illustrate these thematic results with candidates’ oral language production in

response to the assessment tasks. My analyis addresses candidates’ perceptions of the test’

impact in terms of test scores, learning activities in preparation for the test, and desired

adjustments in learning strategies after the test.

287

8.2.1 Impact of test scores

The most immediate impact of the oral test on candidates was from the test scores, which

were produced right after the test event and reported to candidates either on the same day or

a few weeks later. The issues of transparency and privacy in reporting test scores were

treated differently across institutions and inconsistently within institutions. In one test

room, the raters announced the test scores to all the candidates of that test room right after

the last candidate had finished. In other test rooms, scores were handed to the

administrative body for processing before they were officially released to individuals or

classes. One institution let the common test score report be known to all the class members,

while the others allowed only individual students to have access to their scores.

Test scores affected students in different ways, varying from their personal

emotions (self-image, anxiety, family) to study at school (overall academic result,

motivation to learn English) and other social issues (teacher-student relationship, job

opportunities). Figure 8.1 shows candidates’ agreement on various aspects affected by the

oral test scores. Outcomes from the student survey highlight the most notable impact of test

scores is on students’ motivation for English learning (82.4%).

Figure 8.1 Impact of test scores on candidates

27.8

82.4

4.8

56.8

44

81.8

14.2 2.80

102030405060708090

Self-i

mage

Motivati

on for E

nglish

learn

ing

Teac

her-student r

elationsh

ip

Anxiety

and em

otional

tension

Future jo

b opportunitie

s

Overal

l aca

demic r

esult

Repriman

d from fa

milyOther

Rate

of a

gree

men

t (%

)

288

Evidence from interviews highlights the relative importance of test scores for EFL

learners. In spite of not being a decisive factor in the success of students’ English learning,

oral test scores did play a motivational role. Learners became motivated not always because

they received high scores from lenient raters, but often because they believed the marking

was fair and the scores indicated where they were in their learning process. The following

extract from a candidate’s response demonstrates that test results (via scores) provided

essential motivation for his L2 study:

Learning without assessment makes students stand still and cannot improve. There

needs to be a test for students to know that we have some results after a period of

study. Despite not having a decisive importance, test results would be something

realistic that reflects students’ ability. We learners need to know where we are in

our study. That would be very useful for us. (U1G1.S1)

The student’s perceptions about the importance of scores were in line with the

policies of score calculation for the listening-speaking subject applied across the

institutions. As a component in an integrated skills subject, the score for the speaking skills

of each student was combined with that of the listening skills to make a total score on

completion of the course. The total score determined whether an individual would be

recognised to pass the course or not (Table 7.2). A rater explained this situation as follows:

… if a student was not lucky in the Speaking component, but he/she got a good

mark in the Listening component, then he/she could get a Pass grade for the total

score of this subject. It was understandable because his/her Listening score was

better. (U2.T1)

Courses in language skills (Listening, Speaking, Reading, and Writing) received

unequal weighting in curriculum designed for EFL majors. Reading and Writing were

treated as separate skills subjects. Students were awarded separate scores for these two

subjects. In contrast, Listening and Speaking skills were combined into one subject (see

Chapter Five). The Listening and Speaking component scores were added together and

reported as a whole. The total subject score recorded in an individual academic profile was

calculated with a different portion of the Speaking component across the institutions –

accounting for 40%, 50%, and 62.5% at Universities B, A, and C respectively (see Table

289

7.2). This score combination made students wonder how important the role of the Speaking

score was in their received scores. Students were not able to make self-assessment on their

progress in each skill, or which task they had performed better. The following extract

demonstrates candidates’ expectations for transparency in the scores reported to them:

We took two tests (Listening and Speaking) separately, but we know just a total

score as the final outcome. We did not know what aspects we were strong or weak

at. We did not know what the listening scores were, what the speaking scores were.

Furthermore, the oral test was divided into two parts: in one part, the examiner

asked us questions; in the other part, two candidates talked together. We did not

know in which part we performed better. (U2G1.S2)

Candidates from the other institutions had similar perceptions that combined scores

did not reflect learners’ oral production skills in the test. The combination of two test scores

provided an assessment which completed students’ academic profiles rather than “be[ing]

part of an ongoing effort to help students to learn” (Guskey, 2003, p. 10). Treating test

scores this way, with or without intention, made assessment “a one-shot, do-or-die

experience for students” (Guskey, 2003, p. 10). Questionable scores resulted in increasing

pressure on learners who were only given a vague understanding about what the scores

meant and to what extent their performance met the course objectives in order to target a

new level of competence. The following quote provides a typical example illustrating

candidates’ perception of receiving combined Listening-Speaking scores:

Actually, we were neither informed of the rating scale nor the separate scores for the

Speaking component later. They [the official scores] include the Listening scores, so

it is difficult to know what the scores tell us about our speaking ability. (U1G2.S1)

Most of the surveyed candidates (81.8%) reported that the test scores impacted their

overall academic results. Both of the Listening and Speaking components contributed to the

general evaluation that students received for the subject. Since the institutions applied a 10-

point scoring scheme in academic achievement, the Pass score was 5.0 or higher.

Differences between 4.9 and 5.0 (if the 0.1 increment was applied), or between 4.5 and 5.0

(if the 0.5 increment was applied), did not show a significant difference in numeric values,

but made a clear distinction between a Pass and a Fail status for a student’s Listening-

290

Speaking subject as a whole. Bias in oral marking might affect cases of candidates whose

scores were close to the Pass/Fail boundary. If raters had known how much an oral test

score would affect the Pass/Fail status of a candidate, they might have made necessary

adjustments at the time of oral scoring. However, it was impossible for speaking test raters

to make careful consideration (or reconsideration) of these cases because they did not know

candidates’ scores for the listening component at the time of oral scoring. Inaccuracy in

rating may result in unexpected errors in oral assessment when a Pass candidate could

become a Fail, or vice versa. Ratings of the oral test were counted in many educational

decisions. If his/her achievement result was not satisfactory, the student would have to re-

enrol for the course (including re-payment for the subject tuition fee). Low scores would

influence the overall academic result, scholarship competition, graduation recognition, and

potential future career opportunities.

8.2.2 Learning activities candidates found useful for the oral test

Participating students realised that a variety of learning activities provided useful

preparation for their speaking performance in the test room. The end-of-course speaking

test created a good opportunity for students’ learning to improve their speaking skills via

“conscious practice” (Hansen, Nohria, & Tierney, 1999, p. 55) in knowledge and skills

management. Having knowledge about what good English-speaking skills should be would

encourage students to learn to practice the skills they want to be good at (Babauta, 2012).

These activities could be organised by language teachers, or by candidates themselves in

class, or after class as extra-curricular activities, e.g. English-speaking clubs, colloquial

conversations with native English speakers, English contests of eloquence, and so on.

Candidates perceived the effectiveness of these activities at different degrees. Many of the

learning activities were good practice to get involved in to build speaking skills necessary

for the oral test (Figure 8.2).

291

Figure 8.2 Useful learning activities for candidates before the test

As shown in Figure 8.2, most of the students (67.9%) found oral presentation an

effective type of speaking practice in preparation for the oral test. Classroom-based oral

presentations in front of a large audience of peers could help students build self-confidence,

an important element in the success of oral performance. My results echoed findings from

an empirical study on oral performance-based assessment for undergraduate English

majors. That study indicated the positive, significant correlation between candidates’

general self-confidence with their scores in an oral achievement test. The more self-

confident students were, the higher they scored in the speaking test (Al-Hebaish, 2012).

Frequent oral practice with a peer helped students build interpersonal

communication skills, which candidates found very useful in the direct speaking test

format. Paired speaking practice received the second highest rate of agreement (65.1%)

from the student sample (Figure 8.2). In an oral test, candidates not only interacted with

human raters but also with a partner, in pairs. However, fewer students (about 40%) found

participating in a group discussion useful for them for the oral test. Explanations for this

result may be because group discussion was not included as a test format in the oral

assessment, or the organisation of group discussion was not effective so that candidates did

not learn much from this type of activity. Ineffectiveness of groupwork in speaking lessons

67.965.1

39.8

31.5

21.9

4.5

0

10

20

30

40

50

60

70

80

Oralpresentation

Paired speakingpractice

Groupdiscussion

Joining Englishspeaking clubs

Mock testtaking

Other

Rate

of a

gree

men

t (%

)

292

could be due to large classes (between 35 and 40 students per class), students’ passiveness

or their unwillingness to participate. A student commented on her class’s inactive attitude

towards participating in cooperative learning, “even though the teacher implored

insistently, many of my classmates did not want to speak in groups” (U2G2.S4).

8.2.3 Candidates’ perceptions of the test impact on EFL learning

I examined the overall perceptions and opinions of the test takers across the three

institutions by conducting descriptive statistics of the quantitative data collected from the

questionnaire survey. In this section, I present the main themes emerging from my

statistical analysis and support them with words from interviews. Table 8.1 demonstrates

the means and standard deviations for the set of six items asking the informants to indicate

their agreement on various aspects of the test impact on English learning.

Table 8.1 Means and standard deviations (SD) for EFL students’ perceptions of the test’s impact on learning

Statements N Mean SD

(1) The test helped me identify what I need to improve about my spoken English.

352 4.03 .74

(2) The test helped me build effective learning strategies to improve my English-speaking ability.

352 3.76 .83

(3) I have learnt useful test-taking skills from this test. 352 3.60 .84

(4) The test has made me more confident in speaking English. 352 3.45 .89

(5) I revised the lessons I had learnt from the English-speaking course to prepare for the test.

352 3.74 .83

(6) This test is one of the most important motivations for me to improve English speaking skills.

352 3.92 .76

Valid N (listwise) 352

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree 4 = Agree 5 = Strongly agree

As shown in Table 8.1, candidates’ agreement with the statements ranged from 3.45

to 4.03. Candidates expressed the strongest agreement (M = 4.03) that the test provided

them with an understanding of where their English level was and what deficiencies in their

speaking skills they needed to improve (item 1). From candidates’ practical experience in

293

the test, many realised that pronunciation and vocabulary were two aspects they needed to

improve for better speaking:

After the test event, there came a determination in my mind that I need to learn

more vocabulary, need to learn how to pronounce correctly. In general, [I need to

learn] many other things. (U1G2.S3)

Other candidates could identify that their current knowledge of topics like business

and media was not yet sufficient:

After the test, I realised that I lacked knowledge about business, some things about

media, then I think I will study more. I will try to learn more about those areas.

(U3G2.S1)

The test motivated students in that it marked their achievement to continue with the

next higher-level stage of English study or let them realise what they needed to improve to

enhance oral performance. Doing well in oral assessment was one of the students’ most

important motivations to improve English speaking skills (item 6), which received the

second highest agreement from candidates (M = 3.92). This result aligned with students’

perceptions of the influence of test scores (Section 8.2.1), which were direct effects of the

rating process. Candidates performing well found the test a rewarding experience when

their academic progress was acknowledged by the number of university credits they could

earn. Candidates who did not perform as well as expected learned the gaps to fill in their

English ability or whether to learn more about test-taking skills to keep pace with other

students.

Evidence from interviews demonstrates that students were not confident with the

preciseness of their self-evaluation because it was usually higher than their actual ability.

Professional rating gave candidates more reliable assessments. When I asked about the test

impact on her study, a candidate commented as follows:

It [the oral test] helps us know what we need to improve because we usually expect

a lot, we think we will probably reach a level in our expectation, but after taking the

test, we just know that we have not achieved that goal and we will need to try more.

(U1G3.S1)

294

The role of the test to help students build effective strategies for improving oral

skills (item 2) received a strong agreement from the respondents (M = 3.76). Practical

experience with spoken English in a testing environment enabled candidates to realise what

they needed to improve and why they had to improve it. In the long run, this awareness was

useful in building their own strategies to take appropriate action towards improving their

language competence and communication competence. The candidate in the following

example revealed that she would need another English learning strategy that she believed

would be more effective after identifying a weakness with structuring ideas in her talk:

After that English test, I realised that it would not be sufficient if I had only good

pronunciation. I need more knowledge to perform good speaking ability. For

example, my oral performance was a little disorderly that day. My ideas were not

clearly structured, so I felt that the teacher was slightly confused when listening to

me. I just realised after the test event that I need to have another practice strategy to

speak better: I need a clearer arrangement of ideas in my mind when speaking. I

find the oral test very useful for me. (U1G2.S1)

Strategies for learning oral skills varied from student to student. Each had his/her

own strengths, weaknesses, purposes, and interests in learning spoken English. These

personal attributes are not stable but may change over time in different periods of language

learning. Candidates were aware that learning strategies should be changed constantly to

suit personal needs and the teaching and learning process occurring in class (U1G1.S2).

Table 8.2 contains a summary of strategies which candidates across the institutions

identified to better their English-speaking skills.

295

Table 8.2 Candidates’ desired strategies to improve English speaking skills

Strategies for language competence Purposes

Learn more vocabulary (U1G2.S3, U2G2.S4) Express ideas more easily

Practise pronunciation (U1G2.S3, U2G2.S4) Facilitate oral communication

Use more complex sentences (U1G3.S2) Increase coherence in speech

Strategies for communication competence Purposes

Practise more speaking skills (U1G3.S3) Reduce anxiety, increase fluency

Enrich social knowledge (U1G2.S2) More informative and convincing

Learn how to start a discussion (U1G2.S4) Get involved more easily

Structure speech more coherently with supporting ideas (U2G2.S4)

Facilitate listening comprehension and become more convincing in argument

Practise presentation skills (U3G1.S2) Useful for future work

Gain more knowledge about business and media (U3G2.S1)

Know more about the modern world to prepare for future work

Building learning strategies is a key element in determining how students will learn

and improve their language skills. Well-planned strategies contribute to students’ success in

language study. When students can select appropriate strategies for them, “these strategies

become a useful toolkit for active, conscious, and purposeful self-regulation of learning”

(Oxford, 2003).

Taking the oral test could be practical experience for candidates not only to apply

their test-taking skills, but also to learn additional skills useful for future examinations.

Learning useful test-taking skills from the test (item 3) received a medium mean of

candidates’ agreement (M = 3.60). Besides the main goal of performing language skills to

the best of their ability, the following extract from a candidate in a focus group illustrates

how test-taking skills could help her in oral tests:

For me, the test helped to build effective skills for test taking. For example, I used to

have no experience when taking oral tests at school. I used to give such equivocal

answers, like answering to pass the time… But now, I have more experience.

Gradually I find that I can handle examiners’ questions more easily. I mean I try to

make the most of the testing time to express all the ideas I have before stopping. I

no longer wish for the time for my speaking to pass quickly. (U2G2.S2)

296

Raters’ experience with oral rating revealed that there was a notable difference

between oral language skills and oral test taking skills. Good oral performance in a test

setting required more than the existing oral language ability, e.g. test preparation skills

(predicting questions, doing oral rehearsal, overcoming test anxiety). It requires skills for

maintaining communication with examiners and candidate partners, including attentive

listening, eye contact, turn taking, using body language, opening and closing a test session,

and so forth (Flemming, 2017; Study guides and strategies, n.d.). It was not always certain

that a student with good oral language competence would perform speaking skills well in

an oral test. Raters reported that many candidates showed such nervousness that they could

not speak fluently (U1.T3, U3.T1). A rater in the following example expressed his

sympathy for a speaking session of a candidate who displayed very good English

pronunciation but lacked content preparation.

There was a candidate who pronounced English very well, but she might not have

prepared for the test carefully enough, so the content of the candidate’s speaking

was not good. I am very sympathetic for that case. (U3.T2)

Lack of test-taking skills may result in apparent disadvantages for candidates. An

example extracted from the test audio-recordings in an interview task illustrates how the

candidate (C23) was not well-equipped with effective test-taking skills prior to the test. The

candidate did not have an appropriate response as she had difficulty understand spoken

questions.

Excerpt 8.1 File U2.C22-23 0:32 – 1:17 1

2 --->

3

4 --->

5 --->

6

7

8 --->

9

10 --->

11 --->

12

IN: ((interlocutor addresses C23 by her middle name and first

name)), what is the main difference between youth sports today

and one hundred and fifty years ago?

C23: %again%

IN: what is the main difference between youth sports today and

one hundred and fifty years ago?

C23: (.5)

IN: ok, next question. %mm% what personality traits do children

gain from sports participation?

C23: %mm% (.10)

IN: %can you answer the question?% ok, next. what does losing

teach children about life?

297

In Excerpt 8.1, the interlocutor (IN) addressed the candidate by her names to

indicate that it was her turn to listen to and answer the first question (lines 2-3). The

candidate did not understand the spoken question and wanted the interlocutor to repeat it by

saying ‘again’ with a low voice (line 4). Though she asked quietly with only a single word

instead of making an appropriate request for repetition, the interlocutor could understand

and repeated the question immediately (lines 5-6). However, after waiting for a moment

(five seconds) and seeing that the candidate had no answer, the interlocutor moved on to

ask the second question (lines 8-9). The candidate did not ask the interlocutor to repeat the

question although she was not able to give an answer within ten seconds (line 10), twice as

long as the pause after the first question. The interlocutor may have wanted to give the

candidate some time to think and just continued to ask the third question (lines 11-12). Not

asking appropriately for the question to be repeated, the candidate did not have an

opportunity to hear the second question twice, as she did with the first question.

Excerpt 8.2, extracted from the sound recording in the same test room, illustrates a

similar situation in which the candidate (C27) did not understand the interlocutor’s question,

but the consequence was different from that in Excerpt 8.1.

Excerpt 8.2 File U2.C26-27 0:43 - 1:25 1

2 --->

3

4 --->

5

6 --->

7

8 --->

9

IN: now, ((interlocutor addresses C27 by his middle name and

first name)), what kinds of things do you or other people you

know do to protect the environment?

C27: (.) mm uhm I think %uhm mm sorry please I can’t hear

could you repeat the question?%

IN: what kinds of things do you or other people you know do to

protect the environment?

C27: (5.) mm mm I think this is an the endangered animal or

plan er that should protected.

As demonstrated in Excerpt 8.2, the interlocutor (IN) read the candidate (C27) a

question from the test booklet (lines 2-3). The candidate appeared not to understand the

content of the question. After some hesitation, the candidate made a polite request for the

interlocutor to repeat (lines 4-5). Although his request was quietly spoken, the interlocutor

could hear him and was willing to repeat the question right away (lines 6-7). The

298

candidate’s effective test-taking skills brought him an opportunity to hear the question

again and he could produce a short utterance (lines 8-9) in response to the question-and-

answer task.

8.3 Impact of the oral test on teaching from teacher raters’ perspectives

In this section, I analyse the impact of the oral test on teaching English speaking skills. I

examine these effects from two directions: changes teachers are likely to make in their

teaching after experience with the test to improve students’ speaking ability, and types of

activities involved in the speaking course to prepare learners for the end-of-course oral

exam. This section ends with results from the teacher survey to investigate the teacher

raters’ overall perceptions of test impacts on many aspects of teaching and rating speaking

skills within the educational context of Vietnam.

8.3.1 Major desired changes in teaching speaking skills

The oral test provided useful information to inform pedagogical changes in addition to its

function as an evaluative tool to measure learners’ academic achievement. This power of

testing could be an explanation why the institutions assigned their EFL teachers as oral test

raters to assess their students’ oral language performance, rather than outside teachers or

testing experts. Experience with students in an authentic test would help these teachers

update information about the educational needs and quality at the institutions. What

teachers learnt from the test would encourage them to make changes in teaching and

learning to meet the institutional requirements.

Results from my survey show a variety of changes EFL teachers would like to make

in their teaching oral language skills to English majors (Figure 8.3).

299

Figure 8.3 Major desired changes in teaching speaking skills

As illustrated in Figure 8.3, most of the teachers (82.9%) were concerned about

increasing task authenticity in their oral language teaching. Task authenticity can be

considered through four schools of thought: a genuine purpose, engagement, classroom

interaction, and real-world targets (Guariento & Morley, 2001). The communicative

approach in L2 teaching and testing involves incorporating authentic materials and real-life

circumstances to generate meaningful communication (Morrow, 2018; Banciu & Jirechie,

2012). The need for modifying classroom activities with authentic speaking tasks derived

from the learners’ needs and from some inappropriateness of the course book contents. The

following example was extracted from an interview with a rater responding to my question

whether she would make any adjustments in teaching to better students’ speaking skills.

I think yes. That is because the course book content is slightly wordy. There are

some lessons that are not very authentic. It is impossible to improve the content

because it was from the course book. If I have a chance to teach the course again, I

think I will organise more authentic speaking activities for students. These extra

activities will be based on the content covered in the course book so that students

can apply the knowledge they have learnt. Learning English is not only for using in

the classroom, but also for using it outside. (U3.T1)

51.4

71.4

82.9

42.9

74.3

42.9

0

10

20

30

40

50

60

70

80

90

More emphasison pronunciation

More groupdiscussion

More authenticlanguage tasks

Adopting newteachingmethods

Encourangingmore students'participation in

class

Using mock tests

Rate

of a

gree

men

t (%

)

300

Other changes in teaching oral skills that received high levels of agreement from

teachers were encouraging more students’ participation in speaking class activities in

general (74.3%), and in group work discussions in particular (71.4%). Classroom practice

would provide learners with an effective English-speaking environment to interact with

peers under the teacher’s guidance. Many teachers supported these changes probably

because of students’ passivity in speaking classes (U2G2.S3), or anxiety about making

mistakes (U1G3.S4). Teachers would organise more group discussion to facilitate more

interactions among group members and help learners become more confident when using

English with many people. The teacher in the following example was very determined in

adopting group work in classroom although she found this activity slightly time-

consuming:

We usually give topics and divide the class into small groups, normally around three

to five students each… Each student in a group has to speak out their points of view,

not only one good student in each group. I find that it is very effective, but a little

time-consuming, so I think I will foster this kind of activity to enhance students’

speaking ability (U1.T1).

8.3.2 Teaching activities to prepare learners for the oral test

Washback effects of the test on teaching speaking skills are demonstrated not only through

changes perceived after the test but could be represented via teaching activities prior to the

test for the purpose of equipping learners with sufficient skills for their test performance. In

other words, testing could have influence back to the teaching and learning activities that

occurred before the test event.

The results shown in Figure 8.4 demonstrate the overall degrees of factors affecting

the practice of teaching oral skills at the institutions involved in the study. As can be seen

from the bar chart, the percentages of these factors ranged between 14.3% and 80%.

According to EFL teachers’ responses in the survey, teachers’ teaching experience and

beliefs were the most important factors that determined the methods of, and approaches to,

teaching speaking skills.

301

Figure 8.4 Factors affecting teaching oral skills

Learners’ expectations were the second most important factor that affected oral

skills teaching methods. A large number of teachers (71.4%) agreed that learners’

expectations played a more important role than teachers’ professional training and teaching

syllabus, both of which received a lower rate of agreement (62.9%). This result could be

due to the increase in adopting learner-centredness in EFL pedagogy in Vietnam in recent

years (Dang, 2006). The factor of learners’ expectations was regarded as more important

than the selected course book and the teaching syllabus design so that the outcomes of the

education process would meet the needs of the learner and labour market requirements. The

role of teachers’ practical experience was more important than professional training in

teaching orall skills. Teachers believed they needed to teach themselves and learn from

practical experience. That the teaching syllabus did not receive the highest percentage

implied that teachers could be flexible in applying the teaching syllabus into their classes.

The teacher could modify or adjust the teaching syllabus to suit the learners’ needs in some

particular conditions, as illustrated in the following quote from a teacher interview:

62.9

80

48.6

62.9 60

71.4

2014.3

54.3 54.3

0

10

20

30

40

50

60

70

80

90

Professio

nal tra

ining

Teac

hing exp

erience

and beli

ef

Coursebook

Teac

hing syll

abus

Past exp

erience

as a

langu

age le

arner

Learn

ers’ exp

ectatio

ns

Peers’ exp

ectatio

ns

Socia

l exp

ectatio

ns

End-of-c

ourse exa

m

Internati

onal EF

L exa

ms

Rate

of a

gree

men

t (%

)

302

There is a course book accompanying the listening-speaking course. However, it is

inadequate to adhere to the course book only. The teacher usually has to modify

with updated knowledge about contemporary topics in social life so that students

can have more information to think about. The course book provides just part of

what is needed. (U2.T2)

Teachers tended to include supplementary materials outside the language skills or

resources the course book offered. Teachers spent most of the class time broadening

students’ language knowledge and skills that were not necessarily tested. During the course,

not many teachers (54.3%) oriented their speaking classes to international EFL tests or

school-based exams. It would not be sufficient for English majors if teachers taught and

helped students master only the types of speaking tasks used in the end-of-course test. Each

student was allowed only four to five minutes to perform in the oral test. Such a limited

amount of time could not provide candidates a full opportunity to perform the wide range

of speaking skills that they might have been capable of (U1G1.S2, U2G2.S3).

Neither did teaching oral skills at the institutions target preparing students for

international EFL examinations because these exams had purposes different from school-

based assessment. International EFL examinations aim to evaluate candidates’ language

proficiency at the time of testing, regardless of what content they have learned, whereas the

main purpose of school-based assessments is to measure candidates’ language skills learnt

from a course of study (as their achievement). It could be for these reasons that teaching

speaking skills at the universities was not much driven by the goals of international EFL

tests. However, as presented in Chapter Six, institutional oral test designers could adapt and

employ some types of test tasks and scoring methods used in international English language

examinations for assessing speaking skills at their institutions.

Teachers adopted various types of teaching activities to engage students in English

speaking classes. Figure 8.5 presents activities that teachers found useful in preparing

students for the oral test. As can be seen from the bar chart, task-oriented activities received

the highest agreement (85.7%) of teacher respondents in terms of usefulness for end-of-

course oral assessment. The use of authentic materials and group discussion in speaking

classes also received high agreement (80% and 74.3% respectively), which was in line with

303

the discussion in 8.2.1. This convergent result suggests that teachers would continue to

reinforce the inclusion of authentic materials and small group discussion in their teaching

later on after the test.

Figure 8.5 Useful activities to prepare students for the oral test

Teachers (65.7%) agreed that English-speaking practice after class was useful. A

probable explanation for this result is that the class time was limited and not usually

sufficient for thorough practice of all the new language materials being learnt. English

speaking is a skill that learners need to practise regularly in and after class. Organising

speaking activities for students to participate in after class would be part of teachers’ plans

to promote using English for communication among learners. However, teachers need to

consider the types of extra-curricular activities to suit young adult learners’ interests and

suitable venues, as most of the learners were studying in campus quite far away from

downtown. A teacher shared with me her experience in getting students engaged with the

target language via communication with English-speaking foreigners in Vietnam. The

verbal interaction could take the form of an informal interview through which students

could practise English speaking skills and compare interviewees’ answers to class topics

from the perspectives of people from different cultures, different levels of knowledge, and

31.4

85.7

62.9

8074.3

54.3

65.7

0102030405060708090

Langu

age ga

mes

Task-

oriented

activ

ities

Exposu

re to va

rious m

edia

Authentic

mate

rials

Group discussi

on

Teac

her-student in

terac

tions

Engli

sh-sp

eaking p

ractic

e after c

lass

Rate

of a

gree

men

t (%

)

304

different levels of English competence. She found this type of after-class activity very

useful and it increased students’ interest in using and improving their spoken English

(U2.T1).

8.3.3 Teachers’ perceptions of the test impacts on teaching EFL speaking skills

With a similar approach to the test takers’ questionnaire survey, I employed descriptive

statistics, including frequencies, percentages, means, and standard deviations, to examine

teachers’ overall perceptions and opinions of the test impact on teaching. Following are the

main results supported by examples from individual interviews with teacher raters.

Table 8.3 presents the means and standard deviations (SD) for the set of eight items

which asked the teacher respondents to indicate their agreement on various aspects of the

test’s impact on spoken English teaching.

Table 8.3 Means and standard deviations (SD) for EFL teachers’ perceptions of the test’s impact on teaching and learning speaking skills

Statements N Mean SD

(1) The test helped me identify the strengths and weaknesses in

students’ English-speaking skills.

35 4.37 .55

(2) The test helped me identify what test-taking skills my students

need for class activities.

35 4.17 .62

(3) I exploited the content of the current course book to teach students

speaking skills.

35 3.71 .96

(4) I regularly got students involved in speaking activities in class. 35 4.40 .55

(5) I used tasks similar to those of the test to help students practise

speaking English in class.

35 4.00 .84

(6) Before the test, I gave students full details of all aspects of the test

tasks, e.g. goals, contents, format, assessment criteria, etc.

35 4.14 .69

(7) Before the test, I spent time in class discussing with students

various topics so they would be familiar with the test format,

vocabulary, and structures.

35 4.09 .66

(8) Before the test, students were informed of all the questions they

would be asked in the test.

35 2.31 1.35

Valid N (listwise) 35

Note. 1 = Strongly disagree 2 = Disagree 3 = Neither agree nor disagree

4 = Agree 5 = Strongly agree

305

As can be seen from Table 8.3, the means showing teachers’ agreement with the

statements ranged from 2.31 to 4.40. Seven out of the eight items received high means

suggesting that the teachers strongly agreed with most of the statements. Getting students

involved in speaking class activities (item 4) received the highest mean of teachers’

agreement (M = 4.40). This result was supported by the amount of spoken English the

teacher used in class. Most of the teachers (85.7%) reported that their instructions were at

least 80% in English. A number of teachers (17.1%) used only English in classroom

activities. High exposure to spoken English not only gave students a sense of attending an

English-speaking course, but also motivated or, to some extent, obliged learners to use

spoken English in class communication. This practice created good habits and helped

students to form language skills essential for the oral test. Teachers believed that they

needed to continue promoting students’ involvement in speaking activities because the

students were not confident in using spoken English. They may have good knowledge of

the English language, but lack of confidence prevented them from giving a good

performance of their potential speaking skills. The following example from a rater’s

interview illustrates why and how she continued to think of activities to encourage

students’ classroom engagement:

The quality of candidates’ speaking performances in my test room was not as

satisfactory as the last time… I could withdraw lots of experience. First, candidates

were not confident. In order to help candidates be more confident, I would make

them engage more in discussion activities, spend more time speaking into a

microphone in front of the class to demonstrate their language skills as much as

possible. Familiarity with speaking in front of such a large audience will make them

confident in oral exams. The majority of candidates whose speaking was not fluent

had very good knowledge, very good language competence, did not express well, or

fluently, or had pauses, was time-consuming. In general, there was too much silent

time, or some reason why they could not express all the ideas they had. They had

many ideas, but they were too nervous. I think I will have to get them to speak

[English], discuss, and practice much more than before so that they can become

confident. (U2.T2)

306

The evaluative function of the test helping teachers identify the strengths and

weaknesses in students’ English-speaking skills (item 1) received the second highest

support from EFL teachers (M = 4.37). More thorough understanding about students would

help teachers to make the necessary adjustments in their teaching approach in general and

speaking skills teaching methods in particular. For example, practical experience with the

rating task made a teacher realise that many students walked into the test room without

being sufficiently equipped with test-taking tips and skills to earn high scores. The

following extract was the rater’s response to my question whether she would make some

changes in her teaching oral skills after the test:

I think yes. After the test event, I could learn from test takers and the teacher in

charge of the (speaking) class. I will teach students the procedures for each test

session for example, what the questions are like, what sentence patterns candidates

should use to answer. And which way to respond to that type of test format or test

question in order to earn points. (U1.T3)

In the following example, an experienced rater commented on students’ strengths

and weaknesses at speaking skills and mentioned that she would pay more attention to that

shortcoming in teaching oral skills to Vietnamese learners.

Students are only good at answering questions. They are weak in justifying their

own opinions. For example, when the teacher raises an argument, students do not

evaluate or comment on that argument, but merely present their own views. Second-

year majors are expected to give opinions about some viewpoint. I will focus more

on that issue. Normally candidates are good at answering other people questions,

but they have not been able to argue against people’s opinions. Students just argue

in simple ways without reasoning. The so-called critical thinking is not good

enough. Vietnamese students lack that ability. (U1.T1)

Excerpt 8.3 illustrates the absence of what the rater mentioned as “critical thinking”

in candidates’ oral performance. The candidate (C9) did not elaborate her responses to the

extent the rater expected for an interview task:

307

Excerpt 8.3 File U2.C9-10 0:01 – 0:35 1

2

3 --->

4 --->

5 --->

6 --->

7

8

9 --->

10

11

IN: first of all, we would like to ask you some questions.

((interlocutor addresses C9 by her first name)), would you

love to protect the environment?

C9: yes, I’d love to.

IN: in what ways would you do?

C9: I don’t threw the trash away, and: in at at my family,

throw the (.) grow the plants on the garden, and I save energy

when I use it.

IN: uhm huhm, how about you?

((interlocutor addresses the other candidate and asks the same

question))

As illustrated in Excerpt 8.3, the candidate (C9) gave direct responses to the

interlocutor (IN)’s questions without sufficient elaboration on the ideas provided. When the

interlocutor raised the first question in the form of a Yes/No question (line 3), the candidate

just gave a short answer of affirmation (line 4) without explaining why she would like to

protect the environment or what would happen if the environment is not protected. The

interlocutor continued to ask a follow-up question (line 5) as prepared in the testing

material booklet. The candidate responded by listing three things she did to contribute to

protecting the environment (lines 6-8) without further elaboration, for example how the

things she did could help with protecting the environment, or why she needed to lend a

hand to protect the environment, or what if the environment was not protected, etc. The

interlocutor passed the same question on to the student partner without asking Candidate 9

for further information (line 9).

The problem was similarly repeated in a more challenging task, the paired

discussion task, which came after the interview task. Candidates did not contribute opinions

and arguments in a critical way, but simply took turns to add ideas around the topic.

Excerpt 8.4 illustrates a typical example of candidates’ inadequacy of interaction and

critical thinking in paired discussion tasks.

308

Excerpt 8.4 File U2.C11-12 3:34 – 4:54 1 --->

2

3 --->

4 --->

5 --->

6 --->

7

8

9. --->

10 --->

11 --->

12 --->

13

14

15 --->

16

17 --->

18

C11: okay, today we talk about >short term and long term

environmental impact of oil spill<

C12: mm hmm. so: uhm I think uhm

C11: [when] oil spill is oil is (.) oil’s trapped in the

water, the beach is: (.) either (.5)

C12: it make water is polluted because oil oil spill (when) they

on the surface, so maybe some fish and uh species they uh uh

they

C11: [they] will die and

C12: [yeah]

C11: we have limited limited about

C12: [yeah] and animals die or become uh become ill when they

uh maybe uh tch breathe or eat some food but it affected by uh

um oil

C11: fish, uh fish uhm (.4) have have oil. we eat we eat fish

and we=we get sick because of that.

C12: yeah, and it's hard for fisher, uh, fisheries because some

(.) because uh, maybe if the fish ...

Candidate 11 re-read the discussion question from the testing material booklet (lines

1-2) quite quickly to raise the topic again after 30 seconds’ preparation time together with

Candidate 12. Candidate 12 was thinking how to start the discussion (line 3) when

Candidate 11 inserted an utterance to explain what oil spill was (line 4) using one of the

prompts provided (U2.Q63). When Candidate 11 continued talking about the beach,

perhaps she was trying to use ‘either… or…’ (line 5) but paused for quite a long time (five

seconds), so Candidate 12 interrupted using another prompt to talk about the impact of oil

spills on water (line 6). When Candidate 12 seemed to have difficulty talking about the

impact on fish and other species (lines 7-8), there was an overlap when Candidate 11

interrupted to complete her partner’s unfinished sentence with an utterance indicating the

consequence of oil spills on the water surface (line 9). Candidate 12 showed agreement

right away in the middle of Candidate 11’s utterance (line 10). While Candidate 11 was

intending to talk about some other aspect (line 11), Candidate 12 interrupted with another

prompt about the impact on animals’ health when they ate food affected by oil (lines 12-

14). Candidate 11 took turn to add a similar impact on human health when people ate dirty

fish caused by oil spills (lines 15-16). Candidate 12 agreed and took a turn to use another

309

prompt to talk about the impact on fisheries (line 17). The discussion was not finished there

but continued with a similar tone.

We can see that there was a relatively equal co-construction by both the candidates

with ten turns altogether (five turns each) in 1 minute and 20 seconds – less than 8 seconds

per turn. The time for each turn was too short for a sufficient and convincing argument to

be produced. What the candidates were able to do was take turns to use the prompts

provided to explain how each prompt was linked with the topic question. The candidates

made almost no argument or further explanation about what short-term or long-term

impacts on the environment caused by oil spills were. The transcript might show some

signs of interruptions and overlapping talk (lines 4, 6, 10, and 12) however, these could be

interpreted as the candidates’ showing interest in their co-constructed oral performance, or

as their effort to fill pauses in the flow of their talk, or as an attempt to incorporate every

prompt provided to perform a 3-minute discussion task. Such interpretations should be

considered when deciding the score for each candidate.

Raters from different institutions had a general consensus about students’ weak-

nesses in expressing personal viewpoints with supporting arguments for or against an issue.

In the following example, a rater commented that many candidates performed a “turn

taking” talk rather than a discussion and explained the circumstance. In an oral examiner’s

words:

What our Vietnamese students did not usually do well in ‘mind map’ was that they

performed it like ‘turn taking’. For example, a student says, ‘I like eating ice-

cream’, then the other respond, ‘I like drinking Pepsi’. And the two people

continued until the time was over, or they got bored. It would drive the story to

something more interesting or a lot more fun if they could ask and answer why they

liked Pepsi, or why they liked that kind of ice-cream but disliked this kind. That is

our (Vietnamese students’) shortcoming until we have grown up. When talking with

Westerners, you can notice that they are very clear in their view points, whereas we

Vietnamese are normally unassertive, so do not dare to say we disagree with

someone else’s view point. I suppose it is something of Vietnamese culture. In our

culture, I see people tend not to express disagreement frankly. (U2.T3)

310

When I asked the same teacher whether she would make any changes or

improvements in her teaching to help students improve their English-speaking skill if she

were assigned to teach the course again, the teacher reported as follows:

… I have made changes since the last semester. That means I helped students express

their viewpoints and explain why. In the speaking class, I clearly told my students

that I had my own points of view. They should not be afraid of disagreeing with my

opinions or their classmates. I encouraged my students to express their personal

perspectives with explanation. That (skill) is very necessary when speaking on a

‘mind map’ task. (U2.T3)

Having students express their personal opinions with supporting ideas was a

speaking activity the teacher frequently applied in her class. She knew that her students

would have to perform this task as part of the end-of-course exam. According to her

teaching experience, Vietnamese students did not have the habit of expressing outright

disagreement with others’ viewpoints, which was disadvantageous for them when

performing a discussion task.

Table 8.4 summarises changes in teaching necessary perceived by the teachers as a

consequence of information gathered from the oral test. These changes vary from classroom

activities and text book adjustments, to extra-curricular activities after class.

311

Table 8.4 Teachers’ changes and adjustments as a consequence of the oral test

Classroom activities Purposes

Organise more groupwork (U1.T1) Increase students’ talk time and build confidence in speaking

Set up interactional situations as a stimulus for oral production (U1.T2)

Practise problem-solving skills and quick responses to social situations

Teach speaking skills in class with more responsibility (U2.T1)

Make sure students can perform well when being rated by other raters

Incorporate technological applications in planning lessons (U3.T3)

Suit young adult learners’ interests

Inform testing procedures, techniques to answer different types of questions (U1.T3)

Facilitate students’ performance in test rooms, and meet raters’ expectation to score higher

Encourage students to express opinions and defend personal perspectives with evidence and supporting arguments (U2.T3)

Practise critical thinking and bring open-mindedness to the speaking class

Textbook adjustments Purposes

Adjust the length of, or the approach to exploit topics introduced in the textbook (U3.T2)

Fit lessons with the timeframe allowed and arouse learners’ interest

Design more class speaking activities based on textbook contents (U3.T1)

Overcome limitations of the textbook, increase authenticity of the available teaching materials

Extra-curricular activities Purposes

Have outdoor activities for students to socialise with and interview foreigners about the topics learnt in class (U2.T1) Encourage students to join English-speaking clubs within and out of the institution (U1.T1)

Put language into use and discover new ideas from authentic social interactions Build more confidence and motivation in learning speaking skills

As presented in Table 8.4, most of the desired changes and adjustments teachers

would do were teaching and learning activities in language classrooms. The teachers would

organise more interactional activities in stimulating situations to engage learners in quick

response practice (U1.T2). The adoption of more groupwork (U1.T1) would create more

opportunities for students’ speaking practice, increase students’ talk time in class, and build

their self-confidence in face-to-face conversations with others. Working with peers,

students could practise real-world oral language skills, e.g. problem-solving skills,

decision-making skills, etc., which were limited for a whole class activity. Peer interactions

provide a useful rehearsal of candidates interacting with each other to produce “language

that is appropriate to exchanges between equals, which may well be called for in the test

312

specifications” (Hughes, 2003, p. 121). However, a considerable disadvantage of grouping

candidates for an oral test is that “with larger numbers (of group members) the chance of a

diffident candidate failing to show their ability increases” (Hughes, 2003, p. 121).

Teachers found that they needed to have their students more engaged in practising

interactional skills (U2.T3). Most of the students could complete interactive speaking tasks

with the examiner or fellow candidates. Nevertheless, candidates’ oral performance

revealed weaknesses in defending personal perspectives with evidence and supporting

arguments. Elicited samples of interactional skills were restricted to expressing opinions,

agreement and disagreement but lacked other skills such as requesting information and

opinions, attempting to persuade others, or justifying other speakers’ statements or opinions

(Bygate, 1987; O’Sullivan, Weir, & Saville, 2002). In discussion tasks, candidates usually

had difficulty in interaction management skills, i.e. when and how to come to a decision as

required (Task 2, University B), or end an interaction within the time constraint allowed

(Task 2, University A). Despite being informed of the time limit at the beginning of task

delivery (Table 6.3), most of the paired candidates performing interactive tasks could not

wrap up exchanges on their own or make a sense of closure for their discussions.

Candidates’ discussions were usually in progress when the interlocutor/examiner signified

or asked them to stop. Excerpt 8.5 illustrates the last minute of an interactive task in which

the paired candidates were discussing the question “What aspects of teamwork do you find

the most challenging?” (U1.Q31).

Excerpt 8.5 File U1.C13-14 7:25 – 8:27 1 --->

2

3

4

5

6

7 --->

8

9 --->

10 --->

11

12

C13: also time schedule because, uh, the team were uh (.) work in

a team were and we have deadline to meet. so, uh, it's important

for our every member of the team to try try the best to, uh, meet

the deadline. yeah. because sometimes when I work in a group, I

sometimes just like leave the work onto the last minute. yes. so

it's pretty like a pressure to catch with the deadline.

C14: it's just been character-characteristic and % also you brought

it because %, uh (.)

IN: can you speak it up? speak it up.

C14: um (.) I think characteristic also a vital factor because, uh,

every single people have (.) they are always run. so, a hope attain

needs, uh, us. we need, uh, supportive, every specific

313

13

14

15 --->

characteristic to make the whole team become a great team, become

more fit to work.

IN: ok, so it’s the end of the discussion. thank you.

In Excerpt 8.5, Candidate 13 was talking about ‘time schedule’ as another aspect of

team work that she thought important (lines 1-6). She had not concluded whether it was the

most challenging in her opinion or not. Candidate 14 took turn to add her view point but

with hesitation in such a low voice (lines 7-8) that the interlocutor asked her to speak more

loudly (line 9). Candidate 14 was elaborating and explaining her different opinion about

‘characteristics (of team members)’ (lines 10-14) when the interlocutor interrupted with an

indicator that the time for this task performance was over (line 15). He finished the

speaking session by thanking the candidates although they had not arrived at a conclusion

on whose opinion sounded more convincing or which aspect seemed to be the most

challenging as required. For this weakness in candidates’ discussion task performance,

teachers expressed their desire to pay attention to equipping students with more

interactional skills and skills in managing interactions in Speaking lessons (U1.T1, U2.T3).

Teachers noted the importance of informing candidates of the testing procedures,

and useful techniques to answer different types of questions (U1.T3). The purposes of these

test preparations were to familiarise candidates with test room procedures, and to enhance

candidates’ oral performance to meet raters’ expectation. Evidence from the students’

survey and interviews indicated that not all candidates had a clear understanding about

what aspects of their speaking were tested, or how much each aspect weighed in the

possible score they could obtain (see Chapter Four). According to the Standards for

Educational and Psychological Testing (AERA/APA/NCME, 1999), the higher a test’s

consequences for test takers, the more integral it is that the candidates are provided with

information about the test scoring criteria to obtain the most appropriate responses as

possible. However, the information provision requires more collaborative work of test

designers and teachers in charge of Listening-Speaking classes.

Information obtained from the research suggests some requirements for course book

adjustments to suit Vietnamese learners, and to enhance the quality of oral skills

instruction. Teachers sought more appropriate approaches to introduce and develop topics

314

covered in the course books that Vietnamese students were not very familiar with. For

example, when learning the topic ‘Intelligent Machines’, it would be boring and difficult

for students to grasp ideas about mechanics. Instead, the teacher could employ visual aids

or video clips to introduce the theme intended. The teacher could modify the lesson with

information about smart computers which are popular with young people. The purpose is

for students to see that the topic is very universal and not as tough as they think. Students

would understand topical lessons better, faster, and be more interested in class attendance

(U3.T2).

In addition to in-class changes, teachers recognised that extra curricular activies for

students after class were very useful for improving oral skills, e.g. participating in English-

speaking clubs (U1.T1), socialising with English native speakers, conducting interviewers

with foreigners about the topics discussed in class (U2.T1), etc. Class meetings were

scheduled only once a week, which was insufficient for regular practice of speaking skills.

Teachers realised that to achieve better performance in oral tests, students should be

engaged in using spoken English in real life. These activities help students build more

confidence in L2 communications, become more motivated in L2 study, and learn more

about the culture of English-speaking countries through authentic social interactions with

people from those countries.

8.3.4 Implementing a new method of assessing speaking skills

The institutional listening-speaking curriculum design including course book, course

objectives, test format and test questions may vary from time to time. It usually takes a

longer time for the course book to change. The test of speaking skills is administered at the

end of every semester. The test questions may vary from semester to semester. The purpose

of this change is to ensure the security of the test content as there may be some candidates

retaking the test. The speaking test format changes in accordance with the course book, and

the course objectives. Further, the oral test may change in accordance with the EFL Faculty

to suit the need of enhancing the quality of testing. For example, one of the institutions had

shifted from a one-on-one speaking test format with a single examiner to a paired format

with two examiners, as a result of practical experience from a previous classroom-based

oral assessment which contained many shortcomings in rating and assuring fairness. The

315

following quote from a teacher rater of this institution reports problems with the old

method of classroom-based speaking tests:

Previously the final test for speaking skills did not take the form of gathering all the

students of the same course on the same day. There was not an arrangement for two

raters in each test room. There was only one rater who was the teacher in charge of

that listening-speaking class. I think such a way of assessment is not objective. It

easily became unfair when a teacher taught those students, and then tested them.

(U2.T2)

Teachers’ easiness or empathy did not motivate students to try their best in their

speaking skills study and realise the importance of assessment. In the words of another

teacher rater of the same institution:

Students feel that oral assessment is not challenging. Very often the teacher was

empathetic to let candidates pass. Even I myself used to be too tolerant in scoring.

That is to say, I was not very strict. (U2.T3)

Another teacher from the same institution advised that the new format for oral assessment

had just been applied for the last two semesters. One of its purposes was to familiarise

students with English examinations for international certificates, the present version of

which required candidates to perform four language skills. No matter whether students

were learning English for work, study overseas, or any other goals, their familiarity with

oral assessment would be very beneficial for them when they need international

certification of their English proficiency (U2.T1). The change to the oral testing method

helped university students to be familiar with the way of assessing oral performance that

people around the world are adopting. This change received strong support from the

teaching staff and laid important foundations for expanding oral assessment to apply in

non-English major training programmes, which had never included formal speaking tests

before.

I find it true that the new method of oral assessment is a little more tiring. The exam

with more examiners is obviously more formal. This is the second time my

institution administered the oral test with double ratings, which was much better for

students. I hear that the faculty is going to apply this form of testing for non-majors

316

of English, i.e. students of other majors will be gathered for the oral test on the same

day. As I said, I myself, a human rater cannot be very accurate. But when there is

another one, the raters marking together had to be more considerate. Then fairness

was more assured. I think this is a good form of oral testing. (U2.T3)

8.4 Summary

In this chapter, I have presented the impacts of the institutional test on teaching and

learning English speaking skills. The test impacts on teaching were viewed from teacher

raters’ perspectives, and those on learning were perceived from student candidates’

perspectives. The oral assessment had positive influences on EFL pedagogy in that it

helped teachers and students identify what areas in teaching and learning oral skills need to

be improved. The test motivated teachers to teach the contents covered in the course book

to prepare students for the test and encouraged students to exploit the course book for

lexical resources and speaking skills that the course targeted. The oral test created positive

changes to the way of assessing English majors’ speaking skills which enhanced fairness

and accuracy. However, the oral assessment also caused some negative impacts because of

non-alignment in test administration and scoring. The combined exam score of listening

and speaking skills did not reflect students’ ability and the constructs being measured.

Inadequate information about the assessment criteria hindered students’ best preparation to

meet raters’ expectation in assessing oral production skills.

The next chapter summarises key points of my research results regarding important

areas in speaking test validation: test administration, test content, tasks, scoring, and the

impact of the test. Limitations of the study and recommendations for future research are

included as are implications for EFL education. Finally, conclusions about the findings of

my research are presented.

317

Chapter Nine

SUMMARY AND CONCLUSIONS

9.1 Introduction

In my study, I examined the operational practices of EFL oral assessment in Vietnamese

universities. There is a general consensus in the literature that language testing in Vietnam

does not receive sufficient attention compared with that paid to innovations and research in

EFL language teaching methodology (L. A. Pham, 2017; Vu, 2007; Le & Roger, 2009).

Overlapping of current policies and the lack of a specific implementation route for

standardised assessment of English language competency are challenges for the educational

quality assurance of tertiary education (Phuong Chinh, 2017). In Vietnamese tertiary

education, oral assessment is not compulsory for non-English majors. Most of these

students take a written English examination when they have finished a general Engish

training programme or completed English as a specific purpose (ESP) at university.

However, oral tests are compulsory for English majors since the curriculum includes

speaking skills, either as a separate subject or integrated with listening skills. My study

explored the institutional assessment of English majors’ speaking skills in the form of

achievement tests administered on completion of the integrated Listening-Speaking subject.

Testing in a language teaching programme needs to ensure its effectiveness in accurately

measuring what it intends to measure; its consistency in producing reliable results across

individuals, and in its content relevance to the course of instruction (Bachman & Palmer,

1996; Fulcher, 1997a; Siddiek, 2010; Muñoz & Álvarez, 2010). Overall, my research

results indicate remarkable discrepancies in the methods of oral assessment across

Vietnamese universities. The selection of each method determined candidates’ response to

tasks, examiners’ decision on candidates’ scores, and inferences made from test scores.

This final chapter begins with a review of key results from my study on oral

assessment with reference to my research questions. I summarise my findings related to the

testing context (from Chapter Four to Chapter Six), the rating process (Chapter Seven), and

318

the impact of oral assessment (Chapter Eight). I discuss my study’s implications for

Vietnamese tertiary EFL education after presenting limitations and recommendations for

future research. My conclusions close the chapter with reflections on the entire study’s

results in relation to current challenges in language testing and assessment in Vietnam.

9.2 Summary of research results

I conducted an empirical investigation into the testing and assessment of EFL speaking

skills administered in Vietnamese universities. I collected data in a temporal sequence of

three stages, before, during, and after the examination, to seek answers to core areas in oral

testing. My research participants included 35 EFL teachers (examiners) and 352 EFL

students (candidates) engaged in the achievement tests of oral skills at their institutions. I

invited 6 EFL experts to contribute their judgements on the test content relevance to the

course objectives. The richness of data collected from observations, surveys, and interviews

enabled me to explore multiple aspects of school-based oral assessment. The rationale of

my adoption of a mixed methods design is that “the comprehensive nature of mixed

methods research allows for many questions to benefit substantially from the incorporation

of both types of data” (Hustad & McElwee, 2016, p. 308). The integration of various types

of evidence (documents, stakeholders’ opinions, field notes, speech samples, etc.) helped to

secure my interpretations of specific aspects of oral testing. For example, I obtained

convergent results about test administration by combining observational data with

contrasting raters’ and candidates’ perceptions about how the test was administered at their

institutions. Integrated data from the questionnaire surveys, test materials, speech sample,

and experts’ judgements provided convergent themes about the relationship between test

contents and course objectives (Appendix E.3).

This section summarises the main results of the study in response to the three

research questions: 1) positing factors affecting oral performance in the testing context

(contextual validity); 2) identifying inconsistency in rating and scoring as sources of bias in

language ability measurement (scoring validity), and 3) exploring the washback effects

(consequential validity) of oral assessment on EFL education in Vietnamese universities.

My test method evaluation concentrated on the setting before candidates’ oral performance

including the conditions established for oral testing, the contents covered in the test, and the

319

speaking tasks employed to elicit candidates’ samples of oral production for assessment.

My investigation went further into the rating process during candidates’ live oral

performance to learn about raters’ consistency in their marking. I considered the

consequence of oral testing in L2 pedagogy after the test to inform future changes for

improving school-based oral assessment and associated systems. A short and concise

response to the research questions provides a comprehensive picture of spoken language

testing from its construction, operational practices, to pedagogical impact.

9.2.1 Issues in test administration affecting test fairness and candidates’ speaking

performance

I gathered onsite oral test administration data of different types from multiple sources. In

this section, I summarise emerging issues in test administration that had potential

influences on candidates’ oral performance and the fairness in assessing EFL speaking

skills. Results suggest necessary changes in administrative work to ensure language testing

quality. More attention should be paid to exam environment, examiner arrangement,

candidate characteristics, test formats, and speaking session timing.

(1) Influence of the test-taking environment

Test room observations indicated that test administrators did not ensure physical conditions

favourable for spoken language production and assessment, e.g. necessary silence and

appropriate temperature (Sections 4.2.1, 4.3). All the test rooms were classrooms used for

everyday lessons. They were not language laboratories or rooms specially designed for oral

tests. Examiners made slight seating adjustments on the basis of the condition of the

available facility condition to facilitate face-to-face interaction between examiners and

candidates. The examiners assigned to each test room were responsible for controlling the

noise level in their own test rooms. Noise did not come from outside of the campus, but

from the candidates themselves. There was not a large and comfortable waiting room, as

recommended for candidates of each oral test room (Alderson et al., 1995). Neither were

there sufficient staff to ensure a silent testing environment or to provide assistance with

arranging candidates’ test-taking turns for each test room. In one institution there was one

usher to try to manage candidates for all four test rooms at the same time, approximately

220 students in total. The issue of noise was unavoidable at many test sessions where there

320

were candidates either waiting for their turn, or who had finished their speaking sessions,

congregating around the test rooms. Failure to control the noise level from these groups

affected the candidates performing their test tasks in the test rooms.

Uncontrolled noise indicated a lack of formality which may threaten the security of

test materials. Candidates who had finished the test may convey the test content to others.

Survey, interview, and observational data revealed that many candidates suffered from the

noise from both inside and outside the test rooms. Context validity of the test was low as

poor local test conditions have negative impact on test scores (Fulcher, 2003). Essential

silence is required to ensure fairness and to facilitate the best performance of candidates

and examiners. Noise not only affected the candidates but the examiners as well. Examiners

needed to be focused to make accurate judgements on and score the candidates’

performances. Serious consideration was not given to the noise issue in administering the

oral examination. It may cause psychological pressure on candidates performing their

speaking skills if they feel that they are not fully heard by the examiner(s).

Lack of guidelines for test administration led to unequal timing for each speaking

session. There was a discrepancy in timing for candidates within the same test room.

Students at the end of student lists usually had shorter speaking sessions than those taking

the test earlier. Differences in timing across raters and test rooms were noted when some

test rooms with a similar number of registered candidates finished the test event almost half

an hour earlier than others.

Candidates were not informed prior to the test how much time their speaking

section would take. Usually the time was decided by the examiners or the interlocutor, if

there was one. Candidates did not know how much time was allowed for their discussion,

for example. Frequently, candidates had not come to a final conclusion in their discussion

when the interlocutor interrupted them because the time set for the pair was over, or the

interlocutor thought they had done enough.

(2) Role of the interlocutor in task delivery and scoring

The inclusion of an interlocutor in each test room demonstrated a sense of formality in

exam administration and helped to enhance reliability in scoring. However, the role of the

interlocutor was not clearly specified, which affected consistency in task delivery, and

321

speech sample elicitation for assessment (Section 4.3). In a double-rating format, for

example, one examiner played the role of an interlocutor delivering tasks to candidates at

one institution. The interlocutor’s role was different from his/her co-examiner in that there

was an interlocutor outline of what he/she would say to candidates. He/she assessed the

candidate’s overall performance using a holistic rating scale. The other examiner as an

assessor used an analytic scale for his/her rating without any interaction with candidates

(Appendix D.1b). At another institution, the interlocutor’s role was flexible depending on

negotiation with either of the paired raters. No interlocutor outline or separate scoring sheet

for overall performance was available for the examiners’ use. The difference was that one

of the examiners could deliver tasks to students, and ask questions if necessary. The other

did not ask anything. Both used the same kind of scoring sheet and scale to record their

own ratings (Appendix D.1a). As presented in the data analysis, not using an interlocutor

outline inevitably led to inconsistency in assessment procedures, task delivery, and asking

eliciting questions, particularly in the cases of weak candidates who needed extra questions

to produce sufficient oral samples for rating, or candidates with health problems.

The presence of an interlocutor in a paired rating helped to build the formality of

test administration when each candidate speaking was judged by two examiners.

Candidates might not know what kind of rating scale each examiner was using, or the role

of each examiner, but the presence of two examiners made the candidates feel more

confident about the results they would receive. They knew that the result would be agreed

upon by two examiners, not arbitrarily decided by just one of them. If there was one

examiner who was more severe than the other, then the candidate’s performance would be

defended and balanced in scoring by the other. Obviously, face validity from candidates is

higher when the speaking ability is assessed by two raters.

(3) Inadequate consideration of variations of test taker characteristics

Test administrators did not consider variations in the characteristics of candidates entering

the oral exam. Candidates differed in levels of language proficiency, test anxiety, and

known criteria for assessment (Section 4.2). Many candidates reported that they were not

informed of the assessment criteria. Not having sufficient information about the test may

have resulted from students’ irregular class attendance, or the teacher did not give the class

322

enough information. Candidates did not have satisfactory preparation or concentration on

the spoken language aspects being tested to achieve the best scores they could. Not

understanding the criteria may have been one of the candidates’ causes of anxiety when

they were not certain of what was being assessed. Candidates may have been afraid of

making grammatical mistakes whereas raters may not have paid much attention to grammar

accuracy. Candidates tended to use simple words for everyday conversation whereas the

rater may have expected them to use the language they had learnt during the course.

Candidates may have tried to speak as much as possible in a paired discussion while the

rater appreciated their skills of maintaining interaction and turn-taking in a co-constructed

performance with the candidate partner. When there was little time for speaking as there

were too many candidates, the raters may have cut down the speaking time of each

candidate without prior notice. In some cases, raters may have paid more attention to the

quality rather than the quantity of the candidate’s speaking. The rater may have stopped a

candidate speaking when they thought there was enough speech sample for their rating

although the time contraint for him/her was not as long as for others (U2.C22-23; U3.C27).

Test anxiety did have negative impact on some students’ speaking performance.

Test anxiety is unavoidable in testing, especially in face-to-face examinations (Phillips,

1992; Young, 1986). Nearly half of the candidates surveyed admitted that they were

nervous or very worried before and during the test event. Causes of candidates’ anxiety

varied from personal characteristics (lack of confidence, unfamiliarity of speaking in a test

setting, lack of interpersonal communication skills), language knowledge (weaknesses in

vocabulary, pronunciation, grammar), to external factors (partner, rater, noise, stressful

atmosphere of the test room). My interviews with candidates revealed that test anxiety

hindered the generation of ideas and affected oral production.

Variations in candidates’ speaking ability, familiarity with oral assessment, and

physical conditions are factors that need more careful attention, when pairing students.

Paired candidates did not receive adequate consideration from test administrators or test

raters who had the right to put candidates into pairs for assessment. Pairing was done at

random according to the order of candidates in the student list. Evidence from interviews

and speech samples revealed that candidates tended to speak more with partners they had

323

had experience with in speaking practice in class, i.e. being paired with a familiar classmate

was an advantage to candidates. Candidates’ interpretation of a picture may have been

influenced by his/her partner’s previous description of another picture when they performed

the monologic task in the presence of their partner.

9.2.2 Test content relevance and inequality in test questions’ degree of difficulty

The oral test administered at Vietnamese universities reflected the content covered in the

course book used at each institution (Section 5.5.1). Most of the test questions

demonstrated relevance to the course objectives. However, course content was exploited

differently when designing oral test questions. At one institution, the test questions

replicated the topics from the course book, but their wording was totally different. At

another institution, most of the test questions were taken straight from the course book

while the remaining portion was developed from the themes in the course with which

students were familiar. The other institution informed students of the questions they would

be asked in the test. These questions were of two kinds: display questions and referential

questions for an examiner-candidate interview task (Section 5.3). My results show that

candidates gave better oral performance on questions referring to their personal experience

or opinions (referential questions). In contrast, they had difficulty responding to knowledge

checking questions (display questions) and produced very little meaningful conversation

with the examiner.

Informing candidates of the list of questions to be asked could yield two effects in

candidates’ learning attitudes and oral test performance. Access to the questions helped

students concentrate on good preparation for the oral test. However, candidates did not

seem to participate in natural English speaking, but recited from their memorisation of

answers to the question list to handle the one-on-one interview task, particularly in the

cases of display questions (University C). Further, the list of questions provided prior to the

test date could discourage students’ class attendance and have negative washback effects on

language study. Students might concentrate on practising the questions and invest less time

on other topics covered in the course books. I will review the washback effects of oral

testing in Section 9.2.5.

324

Most of the speaking test content derived from the themes in the course books. Test

designers exploited the course book contents in several ways: adopting the themes to design

new questions, reusing available questions, and checking information and/or knowledge

covered in the course books. Evaluating and selecting materials for speaking skills need to

“ensure that materials provide not only linguistic support but also opportunities for

meaning to be engaged, as well as space for learners’ cultural and affective values to

operate in the learning process” (Bao, 2013, p. 420). To encourage more real-world talk,

the content of the materials should facilitate students’ connections to their own situations

and experiences in meaningful and intersting ways (Bao, 2013, p. 423).

There was evidence that the oral test questions within the same task varied in degree

of difficulty. These variations might have added more or less complexity and challenges to

candidates’ task performance, and so affect the test scores they were awarded. Differences

in the degree of difficulty of test questions were caused by question length, wording, topic,

and inconsistent selection of visual prompts for speaking tasks (see Section 5.5). Test

designers may not have paid attention to this aspect in test development. To anticipate the

kind of language being elicited, and the accuracy and comprehensiveness of test questions,

administrators could run informal trial or “pilot testing” (Alderson et al., 1995, p. 74) on a

small group of colleagues or students with similar levels who would not need to take the

test formally. Such pilots would provide valuable information about the ease and/or

complexity of real administration.

9.2.3 Diversity in test tasks elicited different speech patterns for assessment

Differences in task design and response formats determined the candidates’ responses in

task performance. Two of the universities constructed a task to assess candidates’

interaction skills with peers. The other did not offer candidates any opportunity to perform

spoken English with partners. All the latter required their candidates to do was passively

answer questions raised by the examiner. Interaction occurred in examiner-candidate

interviews but the examiner decided the questions to ask, the degree of elaboration in the

answers, the change of speaking topics, and so forth (Section 5.1.1).

Candidates at University A used visual inputs that afforded them an opportunity to

demonstrate oral language production as part of the CEFR-based assessment. This task

325

type enabled assessment of a wide range of spoken language aspects such as vocabulary,

grammar, discourse management, etc. (Louma, 2004). The other institutions did not have

this kind of task. Two of the universities tested speaking skills in pairs. This format is

closely linked to pair work in CLT, with which Vietnamese students would have been

familiar in English classes since secondary school. Paired testing helps candidates reduce

test anxiety (Berkoff, 1985), and “enables a complementary approach” for rating in that

“the assessor observes the interaction while the interlocutor is directly implicated in it”

(Weir & Taylor, 2011, p. 305).

Because of the time constraints, the interview format limited candidates’ speech to

question-answer tasks that aimed to elicit full responses to various topics (Section 6.1.3).

Although the interview lengths at one institution (offering only one task) were longer than

those at another (offering two tasks), both were short of typical features of a structured

interview, e.g. warming up, level check, probing, and closing (Johnson & Tyler, 1998;

Clark & Cliffford, 1998). The interlocutor/examiner asked candidates predetermined

questions. However, varied elicitation techniques across examiners made interviews

different in terms of lengths, topic coverage, and degrees of difficulty whereas the same

assessment criteria were applied at each institution.

My analysis of qualitative data gathered from observational field notes and course

outlines indicated a diversity of tasks. The three institutions adopted a direct (live) format

of oral assessment. Each designed and utilised their own test tasks that enabled varying

levels of candidates’ interaction and patterns of speech samples elicited. University A

offered two tasks: Task 1 required each candidate to speak extensively (monologic) with

highly limited interaction with the rater and their partner; Task 2 provided an opportunity

for each candidate, in pairs, to present his/her oral ability in a discussion task (dialogic).

University B’s assessment also had two tasks: the first task required individual candidates

to produce meaningful language in response to referential questions. University B’s second

task was ‘Discussion’ but with procedures different from that of University A. As presented

in Table 6.1, each pair of University B’s candidates was given a mind map with word cues

to discuss a related topic, while University A’s candidates made a choice between the two

discussion topics provided. University C adopted a responsive speaking task with a list of

326

questions to ask candidates in a one-on-one interview. Variations in testing methods

determined candidates’ spoken language samples and required appropriate rating methods

for examiners’ judgements. However, raters were not provided with effective devices for

accurate and consistent rating, e.g. clearly-defined assessment criteria, scoring guidelines,

rating scales, samples for oral performance interpretation, etc.

9.2.4 Inconsistency in rating and scoring spoken language performance

My test score analysis indicated that there was a discrepancy in raters’ agreement when

both raters scored the same (pair of) candidates. The degree of agreement was higher when

raters used a more specified rating scale than a too general rating scale (see Chapter Seven).

The level of agreement was lower when paired raters scored independently. When pairs of

raters informed each other of the scores after each performance, their agreement tended to

be higher. Informing each other of the scores right after each test session was a method that

raters applied to prevent their scoring from varying too much from the other’s in double

rating practice. However, this scoring method potentially might lead to bias if one rater’s

decision is influenced by the other. Advantages of sharing scores was that the agreement

between raters was achieved, and too lenient or too severe scoring could be avoided

compared with single scoring.

Factors that affected scoring validity could be traced back to the rating scales that

the scorer used, rater training, and raters’ familiarity with candidates. As presented in

Chapter Four, rating without a clearly-defined scale or with ambiguous scale descriptors

were sources of measurement error. The descriptors were either too general for the rater to

decide a score for a speaking sample to fit with them (if a holistic scale was used), or

specific components of a candidate’s performance. Further, raters did not receive sufficient

training on oral rating. There was no certainty of raters’ accurate understanding of the

assessment criteria and descriptors for each level in the scale since there was no meeting or

training prior to the test date. This meant that raters could not reach agreement in

understanding the described features representative of a particular score, and interpreting

students’ performance in reference to the established assessment criteria. Variety of

interpretation and application of rating scales in the rating process indicates potential

327

sources of inexactness that may threaten the soundness in testing and assessment (Mathews,

1985).

Raters’ familiarity (or unfamiliarity) with candidates may have influenced their

scoring (Section 7.7). Raters’ acquaintanceship with candidates varied remarkably across

institutions. Raters who knew candidates may have taken their knowledge of the

candidate’s speaking ability into their rating. Some raters who did not know the candidates’

syllabus may expect candidates to perform differently from the course objectives. However,

the assignment of teachers who were unfamiliar with candidates should bring objectiveness

into their ratings. Objective assessment makes teachers and students more responsible for

their teaching and learning in class.

9.2.5 Impact of oral testing on EFL teaching and learning

I analysed washback effects of the oral test on teaching and learning English speaking skills

from the perspectives of candidates and teacher raters participating in the test event

(Chapter Eight). My analysis revealed that the oral assessment had impacts not only on the

EFL pedagogical activities to prepare for the test, but also on educational adjustments and

improvements induced from the test. In this summary section, I review positive and

negative washback effects perceived by teachers and students from both directions:

teaching and learning before the test to accomplish good speaking performance in the test,

and teaching and learning after the test to improve speaking performance.

9.2.5.1 Positive washback

The oral test helped teachers and students to identify what needed improving in their

pedagogical methods and encouraged appropriate changes.

Most students’ speaking practice in classroom activities was not assessed and scored. The

teacher might not have paid close attention to every class member when he/she was

monitoring the whole class. In class, the teacher was not able to realise all the problematic

areas in spoken English among the students he/she was in charge of. Students could

perform their English-speaking ability and interact with peers in a relaxed manner, like

everyday conversations. They did not put themselves under pressure to achieve their best

oral performance. However, speaking English in an oral test was different. Students tended

328

to demonstrate their best speaking ability to obtain the highest scores possible when

assessed according to the benchmarks established by the institution.

The school-based test was an important event where the teacher rater could have the

fullest perception of their students’ spoken English. Examiners needed to analyse each

candidate’s speech sample carefully to decide a rating that would reflect the student’s oral

ability in reference to the criteria being assessed. It was from the assessment that the raters

would get overall information about the students’ English level, as well as their strengths

and weaknesses. Students would estimate how well they performed the test tasks as

required. Many of them compared their English ability with other students studying the

same programme when they were paired in a discussion task.

Teachers and students, from their own perspectives, could identify to overcome

problems associated with teaching and learning English speaking skills. For example,

students realised that they needed to improve their pronunciation because they found that

their partners could not understand their spoken English. Many others did not have

difficulty with English pronunciation, but in expressing their thoughts on test topics that

they were not strong at, e.g. for the topic “advantages and disadvantages of the new model

of climbing the career ladder”. Despite the familiarity with the topic, candidates found it a

real challenge to discuss with the other partner if they did not know the meaning of the key

words provided in the language input such as “flexible”, “loyal”, “stable”, “retire”, “gain

experience”, and so on (U2.Q64). Dealing with this kind of task made students realise that

they had certain gaps in their English lexical resource. The test was an opportunity for

students to self-evaluate whether they had grasped the meaning and usage of new

vocabulary as part of the course requirements.

Other students did not have problems with their knowledge of language, but had

difficulty expressing their thoughts appropriately. For example, a paired discussion task

required not only correct pronunciation and topical vocabulary, but also effective oral skills

in agreement, disagreement, explanation, asking opinions, etc. guided by a written question

or a mind map (Table 6.1). Deficiencies in discussion skills were not attributed to the

training programmes. All the course books provide learners with essential speaking skills

for discussion that could be used for either paired or group discussion tasks. The paired task

329

included in the test enabled evaluation of to what extent learners were able to perform

speaking skills and discussion strategies presented in the course books, such as agreeing

and disagreeing, using persuasive language, changing the topic, adding to a speaker’s

comments to become an active conversation partner, etc. (see Appendix D.3). It was via

this test that teachers and students could find out which skills students performed well and

which they needed to practise more. After the test, students would have their own learning

strategies to improve the skills they needed, or the teacher would adjust their methods of

teaching to enhance students’ performing these skills if they taught the same programme

again.

Many test questions also required candidates’ general knowledge to be able to give

a task-based oral performance. Language knowledge only was not sufficient to answer such

questions as “What are the pros and cons of neuromarketing?” (U3.Q19). Besides having to

know what “neuromarketing” was, candidates needed to know what its pros and cons were.

The oral test was measuring not only linguistic competence but also general knowledge. At

another institution, knowledge requirements in the question-and-answer task were similar,

for example “Preservatives are added to food to keep it fresh for a longer period of time. Do

you think the advantages outweigh the disadvantages? Why or why not?” (U2.Q58).

Although the sentence context provided a hint (something added to food to keep it fresh) to

understand the term “preservatives”, candidates needed to have definite knowledge about

the functions and effects of preservatives in food to be able to answer this question with

confidence. All these pieces of information were available in the course books that the

institutions used to teach the integrated listening-speaking course. The test gave both

teachers and students an opportunity to evaluate whether students had grasped the targeted

knowledge and were able to incorporate the knowledge into their performance of spoken

English.

Oral testing helped teachers realise some students’ weaknesses at subskills such as

critical thinking, an essential element in the paired discussion task. Although critical

thinking was included in every unit of the course books, students did not seem to be able to

employ many of these important subskills in their oral performance. Critical thinking, an

import factor for personal success and for social development, has been popular in western

330

countries for a long time. However, in Vietnam, critical thinking is still quite a new and

abstract concept (Learning about critical thinking, 2017). A possible explanation for this

shortcoming is that critical thinking has been limited to three levels from family education,

school education, and social life (Le, 2014). At school, critical pedagogy is “very different

in comparison with the popular teaching methods in Vietnam, i.e. the teacher reads – the

learner writes, the teacher questions – the learner answers. Learners’ raising questions is

possible, but restricted on content, time, and space” (D. K. Nguyen, 2017).

Vietnamese educators have supported the creation of an educational environment in

which young people can learn how to express their opinions. For example, the teacher

should listen to learners, encourage them to constantly widen their knowledge, allow them

to criticise teaching materials, and be open to learners’ different viewpoints (Xuan Phuong,

2016). In the scope of teaching and learning spoken English, teachers have made positive

changes in their speaking lessons not only to increase student talking time (U1.T1), but also

to encourage learners to express personal ideas, even if their ideas are contradictory to the

teachers’ ideas (U2.T3).

The oral test motivated teachers to teach the contents covered in the course book to

prepare students for the test, and encouraged students to exploit the course book for

lexical resources and speaking skills.

Each institution established their own training programmes and selected their own course

book for the listening-speaking course. Evidence from the EFL experts’ judgement and

EFL teachers’ opinions revealed that most of the test contents originated from the course

books (Section 5.5). Test designers were the teachers in charge of the listening-speaking

classes, since they were the ones who were using the books and best understood the

students’ English-speaking levels. Test designers of each institution adapted original

materials from the course books to compile into the official test for their institution. The

contents of the oral test covered all the themes in the course book. The lexical resource

came from listening passages or reading texts (receptive skills) to be used in speaking tasks

(productive skills). For example, the discussion question “Should people have a perfect

memory to remember everything that has happened? Why (not)?” (U1.Q7) derived from

the theme “Nostalgia” (University A); the question “Which meal is the most important of

331

the day to you? Why?” (U2.Q16) was taken exactly from a post-listening group discussion

activity conducted after students had listened to a conversation between two people eating

lunch (University B); or the question “What group of people dispute with file-sharing?

Why?” (U3.Q32) emerged from an excerpt entitled “Intellectual Property and the Music

Business” (University C), and so on. The test contents scattered throughout the course book

encouraged teachers to pay equal attention to all the topics, and discouraged teachers’

giving more emphasis to some topics and ignoring others. Students who regularly

participated in speaking class activities would benefit more than those who were usually

absent from the lessons. These benefits should stimulate students to attend regular speaking

classes.

Most of the test contents originated from the course books being used. However,

this content implementation did not mean that the course book was the only source of

language materials or activities that teachers could use to make students involved in

English speaking activities within and after class time. Teachers used additional appropriate

materials, online practice which accompanied the course books, or supplementary sections

to each unit provided for teachers’ creativeness. For example, the following wrap-up

activity aimed to encourage students’ self-study to broaden knowledge and maximise

English speaking time in class. This activity, extracted from the course book Lecture Ready

3 (Frazier & Leeming, 2013, p. 66), comes at the end of the lesson unit entitled “Intelligent

Machines”:

Think of a robot or computer from a science fiction movie or novel that you find interesting.

Bring a picture of it or a passage in which it is introduced to class and describe the robot or

computer to your classmates. Include the following information in your description:

• What is it capable of?

• Would it pass the Turning Test?

• Is it helpful or harmful to humans?

• If it were possible to create such a machine today, do you think it should or should

not be created? Why or why not?

Figure 9.1. Example of a wrap-up activity for a speaking skills lesson

332

In order to maintain this positive effect of washback, the specific contents of the test

questions used would have to be changed for every examination because of the need for test

content security. Otherwise, teachers might use the test questions for test preparation, or

students who failed the oral test might focus on a limited number of the test questions for

their second attempt.

The oral test created positive changes to the way of assessing English majors’

speaking skills which enhanced fairness and accuracy.

Adopting a two-examiner rating scheme in the assessment method helped to increase the

degree of agreement between raters and decrease bias in scoring. Two of the three

institutions arranged two raters for each test room: one institution had had pairs of raters for

a long time; the other had just applied it for the second time as they had found that the new

method of testing made scoring more reliable when the scores were averaged, or there was

a discussion between raters to arrive at the final score for each candidate. As such, errors in

rating and scoring could be minimised when the severity or leniency of either of the raters

could be balanced. Although students might have felt more stressed when facing two

examiners, most of them understood that their speaking performance was rated twice, and

were reassured about the fairness of the evaluation they received. The ratings by two raters

were more objective than those done by only one.

The double-rating format helped narrow the distance between institutional tests and

international tests. The test format of paired interaction established a link between testing

and communicative language teaching that has become popular in Vietnam since the early

2000s. The practice of switching raters among classes has helped to decrease the influence

of acquaintanceship between raters and candidates that could cause errors in measurement.

It has made teachers in charge of speaking classes more responsible for the quality of their

teaching, and students take the test more seriously than classroom-based assessment with

their listening-speaking skills teacher acting as the rater in the end-of-course exam.

9.2.5.2 Negative washback

The combined score of listening and speaking did not reflect students’ ability and the

constructs being measured.

333

The speaking test scores were not reported separately but combined with those from the

listening skills test to make the final scores for the listening-speaking course that students

had taken. By this way of test administration, students were not able to know their scores

for each component. The ambiguity in test score interpretation overshadowed the

significance of the oral test, which took much more resources to design and administer

compared with testing listening skills. The power of oral testing was lessened when

students participated in the test without knowing what aspect(s) in the rating scale they

performed well or needed to improve. Students saw the test as a compulsory requirement to

fulfill to achieve credits for the subject. They did not find the assessment a useful device

“provid[ing] formative feedback” (Bachman & Palmer, 2010, p. 27) for making decisions

about learning oral skills, e.g. what areas to pay more attention to, how to build better

learning strategies, or where to go in the next phase of study. As such the educational role

of school-based assessment could not be fully brought into play.

Reporting combined scores of listening and speaking skills together did not help

learners to recognise which specific skill areas in their English communication need

improving. Institutions set a Pass/Fail grade for assessing English majors’ achievement in

their training programme. However, institutions calculated the Listening/Speaking scores in

different ways (Table 7.2). Not specifying a Pass/Fail boundary for an individual

component (of Listening or Speaking score) may make students concentrate more on one

skill but less on the other. It was challenging for score users to interpret the learner’s ability

for each of listening or speaking skills when the scores were added into one. For example,

an overall “average” score does not assure that the learner has an average language

competence for both listening and speaking components, but may be the result of averaging

a good listening score and a poor speaking score, or vice versa. A previous study showed

that there is very little relationship between listening and speaking scores of young adult

learners (Celik & Yavuz, 2015). The main reason was that test takers’ anxiety in the

speaking exam was higher than that of listening, or there may be a mismatch in the level of

listening-speaking difficulty that caused negative effects on the correlation between the two

variables.

334

Inadequate information about the assessment criteria hindered students’ best

preparation to meet raters’ expectations of oral performance.

Although students were informed of the test format they had to follow, the number of tasks

and the kind of tasks they had to perform, students were not informed of the criteria for

assessment which raters would use to judge their speaking performance. Learning and

testing experience helped students guess that oral examiners usually paid attention to

candidates’ pronunciation and vocabulary. They were not certain whether or not grammar

accuracy accounted for an important part in rating and scoring or what expectations they

would have to meet for different types of response formats: monologue in a picture-cued

task, immediate personal responses in a question-and-answer task, and prepared interactive

responses in a discussion task. For example, candidates in the picture description task were

not clear to what extent they had to describe each picture (general, detailed, or both), or

whether candidates were allowed to ask for a change of picture, whether accuracy of the

description was required (what if a candidate did not understand the meaning, or gave an

incorrect interpretation of the meaning conveyed in the picture?). On performing language

skills in a testing context, “familiarity (with criteria for assessment) will naturally affect

candidates’ planning and monitoring of cognitive processes involved in task completion”

(Galaczi & ffrench, 2011, p. 131).

More consideration should be given to the transparency of oral assessment criteria,

scoring rubrics, and test results (scores) to all students prior to the test day so that

candidates would have sufficient time to prepare because “only when students understand

the assessment criteria and how they are applied to the oral language they produce can they

actually take responsibility for their own learning” (HKDSE English Language

Examination, 2012). Transparency in testing and exams contributes to establishing fairness

and eliminating fraud in the Vietnamese education system (Quy Hien & Nguyen Dung,

2014; H. Lam, 2018).

Not knowing about the assessment criteria was a probable reason why a large

number of students were less keen on the paired speaking format than the one-on-one

format. Students were concerned that their performance would be influenced by their

partners, e.g. partners’ higher level of spoken English, partners’ unbalanced turn taking,

335

partners’ unexpected questions, etc. Students were more familiar with the traditional one-

on-one interview than taking an oral test in pairs. They did not understand that the purpose

of pairing candidates was for test takers to see a demonstration of interactional skills as one

of the criteria to be assessed. Lack of information about this requirement may have made

students more anxious when being paired with another candidate, especially if they had not

met each other before, or it meant that they did not take the opportunity to demonstrate

their interactional skills, but merely took turns to present personal ideas. Incorrect

interpretation of the purposes of pairing also made candidates unable to create an effective

co-constructed performance with their partner; they may have become too timid to interact

(if recognising the fellow candidate had better skills), or too extensive in showing off their

ability (if their partner was less able). Neither situation led to productive co-construction of

oral performance.

The oral test reflected the teaching guides and identified necessary changes for the

future. On the one hand, the speaking test was beneficial to teachers and students in that it

directed oral language education in a positive direction. The test, on the other hand, pointed

out negative impacts on teaching and learning that called for improvement from teachers,

learners, and test administrators. The washback of a test on pedagogical practices should be

understood as the impacts of testing itself that:

are only circumstantial with respect to test validity in that a poor test may be

associated with positive effects and a good test with negative effects because of other

things that are done and not done in the education system. (Messick, 1996, p. 242)

In the next section, I present limitations of the study, and my recommendations for

future research in the field of L2 testing and assessment.

9.3 Limitations of the study and recommendations for future research

This study has brought the reader a comprehensive insight into the practice of oral

assessment in the context of Vietnamese tertiary education. As an immersion into the EFL

teaching and learning community in order to understand how university students’ English-

speaking ability was assessed, I had the ambition to capture and inscribe the complex

process of testing spoken language to explain whether or not the tests were effective after

all the effort that teachers and students had invested in the processes of teaching and

336

learning. Besides the notable results I achieved in this research study, for many objective

and subjective reasons, there exist some limitations that call for ongoing studies in the

future.

The focus subjects were restricted to EFL majors only as most non-English majors

at Vietnamese universities did not take speaking tests. Oral assessment for non-English

majors was optional in the English syllabus and determined by the teacher-in-charge.

English examinations for these classes usually focus on grammar accuracy, vocabulary

range, and reading comprehension. My initial research design was ambitious in that the

study would include both English majors and non-English majors so that I could make a

comparison between the two groups of students in terms of assessment criteria, test tasks,

content, and the time devoted for testing speaking skills. Preliminary contacts with

Vietnamese colleagues from different institutions made me decide to limit the research

samples to English major students only as I was advised that such data collection would be

impossible because oral tests were not popular amongst non-English majors at Vietnamese

universities. This situation may be because most Vietnamese universities require their

graduates to obtain regular TOEIC certification that includes only two receptive language

skills, Listening and Reading (C. Phan, 2014; Language Link Academic, 2017). A study

with English majors would establish necessary foundations in theory and practice for future

ongoing oral test administration into non-English majors, which accounted for a much

larger proportion of university students in Vietnam. Language tests for EFL majors need to

operate effectively as many of these students become EFL teachers after graduation, and

their teaching practice receives significant impact from their experience with language

learning education (C. D. Nguyen, 2017). The exclusion of oral examinations in L2

pedagogy may lead to potential “adverse washback effect” because “teachers may simply

not teach certain important skills if they are not in the test” (Weir, 2005, p. 18). Official

inclusion of oral assessment in the EFL training for non-majors is an exigent requirement

for tertiary institutions to prepare students with essential communication skills for global

integration (Vu & Nguyen, 2004; Tuong Han, 2017), and fits with the national innovation

process of assessing tertiary students’ English language towards building learners’ capacity

rather than knowledge-focused examinations (L. A. Pham, 2017). The current situation

would benefit from larger scale research in the future into the area that this study has not

337

explored: How are EFL testing and assessment administered for non-majors? What is the

explanation for many tertiary institutions not assessing non-English majors’ oral skills? Do

non-English major graduates’ speaking skills meet recruiters’ (or employers’) requirements

about workplace communication in English?

My study focused on only three institutions because of the limited timeframe and

financial support. Data collection in authentic test rooms was not usual in the local

conditions. The presence of an outsider observing the examiners’ and examinees’ activities

in an oral test room had never happened before. That three other invited institutions did not

give consent to the research participation was understandable; they were worried about the

test security or there were barriers in administrative procedures that required a long time for

approval. Collecting data of various kinds from multiple sources needed to be done within

the limited time of examination season at universities – between early December 2015 and

the middle of January 2016. It was challenging to manage time for travel and searching for

assistance with the observation and questionnaire surveys, as presented in Chapter Three.

The generalisability of the findings would have increased if I had recruited a larger number

of participants at more institutions. However, this would have necessitated earlier

arrangements with intended institutions beforehand, and more human resources to assist

with simultaneous data collection.

Statistical results of the test scores and speaking samples did not cover all the

student participants but were restricted to those who gave consent to their test scores being

used and having their speaking sessions audio-recorded. It is noted that the EFL Faculties

(or the Training Department of each institution), not the students, decided the access to the

official test scoring sheets. However, I was considerate in excluding the scores of those

students who were not willing to let their oral scores be used even though I was provided

with full scoring sheets by the EFL Faculties. At one institution, the examination scores

were popularised among class members, i.e. every student knew their classmates’ scores.

Nevertheless, only the scores and speaking samples of the students whose consent I

obtained went into data analysis as a compliance with HREC’s regulations for human data

collection. In addition, about one-third of the samples were not able to be used for the study

because of poor recording quality caused by noise, or inappropriate placement of the

338

recorder. Statistical results about test timing and speaking sample analysis, for example,

derived from the recordings that were of at least acceptable quality when replayed on a

normal device such as a computer or a CD player. Whether audio-recording of a speaking

test event is for scoring, research, or evidence storing, I recommend it should be done with

greater consideration regarding the test room setting (Is it noisy?), position of the recorder

(Is it placed at an appropriate position for best recording without affecting candidates’

performance?), recorder manipulation (Who uses the recorder during the test, the

interlocutor, the assessor, or the researcher?). For more secure recording, I suggest two

recorders be used at the same time in case technical errors occur, or some speakers might

have soft voices, or be sitting too far away from one recorder.

My investigation into the test’s impact was mostly based on teachers’ and

candidates’ responses gathered from the questionnaires and interviews, not on empirical or

observational evidence, because the time frame for data collection did not permit it.

Changes and influences in educational settings do not usually take place immediately but

over time in a complicated manner out of even test designers’ control (Spolsky, 1994). It

would be interesting to explore washback effects of oral assessment on various aspects in

EFL education such as the English curriculum, teaching materials, assessment and

evaluation methods, teachers’ attitudes towards learning strategies and activities, etc.

(Cheng, 2005). For example, a comparison of how candidates’ learning activities and

behaviours after the test (in the next semester or school year) differ from their activities and

behaviours before the test (in the previous semester or school year). An investigation into

reasons for using teaching materials supplementary to a course book, or even the

replacement of a course book in use, based on information from the tests, would yield more

legitimate considerations into the syllabus design of educational institutions.

My analysis attempted to explore how the rater decided a score for each speaking

task, or for each component in the rating process. However, what I could do was make

statistical comparisons of scores awarded by two raters, scores for different tasks, and sub-

scores for the assessment criteria indicated in the rating scales between the first and the

second time of scoring. I could not dwell much on how the rater perceived the candidates’

oral performances and interpreted their speaking ability with reference to descriptors in the

339

rating scale, if any. These judgements and calculations occurred in the rater’s mind and

depended on his/her estimation. In the authentic rating and scoring process, raters did not

make any comments or give reasons for the scores they awarded as they were not required

or did not have time to do so. As a researcher, I was not able to intervene in these

institutional rating practices. If raters’ comments on and/or reasons for the given scores

could have been collected, they would have been informative sources to learn further about

what aspects in test takers’ spoken language production most raters agreed on as well as

what the causes of discrepancies were in marking. The topic of consistency in rating and

scoring would warrant an ongoing study to compare a control group (examiners whose

performance was held in ordinary conditions) with an experimental group (other examiners

who received standardised training and adopted a refined rating scale). Data from these two

groups would tell the difference between raters trained with oral assessment skills and

normal EFL teachers nominated to be raters. An experimental study would help to confirm

whether or not rater training and clearly defined rating scales are effective for achievement

tests in the current situation of EFL education in Vietnam.

At the present time, many native English teachers come to work in cooperation

programmes at Vietnamese institutions (VTC, 2015; Hong Hanh, 2013). This condition

enables opportunities to study whether there are differences between native English raters

and Vietnamese raters when assessing and scoring English speaking samples of Vietnamese

EFL learners. Research into native English speakers’ perceptions of EFL pedagogy in

Vietnam would produce useful information to consider the ways Vietnamese teachers teach

Vietnamese students the English language in general and English-speaking skills in

particular.

My study project concentrated on validating speaking tests for EFL students in

Vietnamese universities. The experience has been a rewarding lesson that will benefit

future research associated with language testing and assessment of the other skills, namely

Writing, Listening and Reading Comprehension at different CEFR levels. Prospective

researchers could find the limitations in my study as inspiration for designing and

strengthening forthcoming associated projects with other groups of participants in other

research settings.

340

In the next section, I discuss the educational and professional implications drawn

from my research.

9.4 Implications

The EFL oral assessment in my study was designed to measure the English-speaking ability

of Vietnamese university students at the same educational period: they were in the middle

of their second year after having completed an English-speaking course. The comparability

study across institutions not only contributes to a more comprehensive understanding about

the local practices of L2 assessment but also demonstrates the benefits of using a mixed

methods research design in examining multiple facets of oral language assessment. My

discussion in this section points out the implications of testing speaking skills for different

stakeholders including test administrators, test designers, raters/scorers, test takers, policy

makers, and language testing researchers. The implications are pertinent at the present time

to enhance the quality of graduate outcomes in the Vietnamese context of international

cooperation and communication.

9.4.1 Implications for speaking test administrators

Administrative settings for school-based speaking tests should be in alignment with basic

requirements for operational testing practices to facilitate candidates’ oral production

performance and minimise any physical conditions unfavourable for speaking during the

test. Uniformity in administration gives an indication of the test’s importance and

consolidates the sense of fairness in examinations.

All test takers should be apprised of the test content, tasks, assessment criteria, and

scoring rubrics at the commencement of a course (Louma, 2004; Galaczi & ffrench, 2011).

Understanding these requirements will render them more confident during the test and

enable better strategies for preparation. If each teacher-in-charge only gives short notice in

class a few days prior to the test date, information may be varied and not consistent across

speaking classes. Knowing the content (speaking topics), and tasks in advance would

motivate students to use the course book, and explore other sources to enrich their

knowledge about the topics to be assessed. Familiarity with the tasks helps candidates with

“goal-setting and monitoring” (Weir, 2005, p. 58) in language processing to perform

speaking tasks. Being informed of what kinds of tasks there are in the test should encourage

341

students’ class attendance given that the teacher will keep students engaged in more

speaking activities.

In the current conditions of Vietnamese tertiary institutions’ facilities, everyday

classrooms can be used for speaking tests if standardised rooms are not available. It is

important to establish a quiet testing atmosphere so that the rater can fully hear candidates’

speaking, and the candidates are not distracted by any noise. To enable raters’ complete

concentration on rating, the inclusion of an usher can help with management of candidates,

arranging them to enter the test room in an appropriate order, and making sure candidates

whose speaking sessions have finished do not frequent the exam site. If the oral test is

operating in several test rooms at the same time, there needs to be a supervisor “responsible

for overseeing the general conduct of the examination sessions” (Taylor, 2011, p. 339).

Preparation for organising a speaking test needs to ensure that “basic requirements for the

speaking test are separate rooms with adequate lightning and ventilation, and the room

should be checked for temperature and cleanliness” (Taylor, 2011, p. 340).

Test takers should receive examiners’ constructive feedback on their speaking

performance. These comments are essential for students to know what needs to improve

about their speaking skills and how they can improve it. The feedback will help students

develop better language learning strategies and build up more effective test-taking

techniques. According to Louma (2004, p. 174), “effective feedback may concentrate on

weaknesses more than strengths, but it can be helpful when coupled with concrete

descriptions about better performance”. A recent study demonstrated that students need not

only positive feedback motivating them in language study, but they also want to receive

negative feedback pointing out what can be improved about their language skills in the

future (Pirhonen, 2016).

Audio-recording should be carried out so as to keep evidence of examiners’ and

candidates’ performances in the speaking test. Not only would candidates try their best in

speaking, examiners would perform with more responsibility, as they would be aware that

their voices are being recorded. Research has indicated that there might be concerns for the

potential impact of recording equipment on interaction in the test room. Once the recording

has become a normal aspect of the test setting, however, it does not interfere with

342

interactants’ performance (Kendon, 1979). Recordings of oral assessment need to be

retained for administrative purposes in case there are requirements for rerating (e.g.

candidates’ complaints, mistakes in score input, mutual agreement between examiners,

etc.). Audio-files of speaking test samples would be fruitful sources of data for research to

learn about authentic speaking task performance, interaction patterns or consistency in task

delivery. Secure storage and access to audio-files require specific regulations and

managerial policy (University of York, n.d.) to prevent threats to privacy in the present era

of advanced cyber technology.

The practice of speaking test administration reveals the extent to which an

educational institution is concerned about the importance of assuring the EFL training

quality. Context validity will be strengthened if all these oral language testing practices are

assured within an administrative framework used across EFL classes and institutions.

9.4.2 Implications for speaking test designers

Like testing the other language skills in educational contexts, the central concern in

assessing speaking is whether the test can obtain its prescribed goals, i.e. measuring

students’ language achievement with reference to the course objectives (Henning, 1987;

Brown, 2005). This requirement calls for more clearly stated specifications of what a test is

designed to test. Therefore, test designers need to pay greater attention to the test format

and the wording of specific questions because they determine candidates’ responses to the

speaking task. In the sense of learning-oriented assessment, candidates should be familiar

with the test format and understand exactly what the tasks require them to do. Candidates’

preferences of particular test formats (e.g. one-on-one interview, paired speaking test,

group assessment, etc.) should be considered as they would perform better in one format

but may be less comfortable with another.

The current scoring methods require more clearly defined assessment criteria and

rating scales by which assessors apply their judgement. Oral test designers should keep in

mind that uncertainty and ambiguity in employing the assessment criteria and rating scales

are sources of potential bias in the rating and scoring phase. Prompts (e.g. pictures, word

cues) used for speaking tasks should provide topical support and be within the lexical range

of the test takers at whom the test is targeted (Fulcher, 2003, p. 75). For example, the photo

343

of old-fashioned letters (U1.Q6) provided the test taker some hints to demonstrate certain

linguistic aspects in an extended turn when he/she talked about the topic “Nostalgia” using

the Simple Past tense (grammatical range) and related vocabulary such as “childhood”,

“hand-written letters”, “funny drawings”, etc. (lexical range). The candidate may also

mention that this form of written communication is becoming less and less popular due to

the growth of e-mail and online social networks (sociolinguistic knowledge).

A candidate is not required to respond to all the question items available in the test

material booklet. It is important that differences in the degree of difficulty of test items be

kept to a minimum so that candidates’ oral performance is not affected by good (or bad)

luck during the test, and neither therefore, are the scores they are awarded. Variations in the

linguistic input (length and complexity), the channel of task delivery (visual or audio), and

the candidate’s familiarity with certain speaking topics may contribute to the hardship level

of the topic assigned to him/her.

9.4.3. Implications for oral test raters and scorers

Scoring speaking performances is the most complicated in the assessment process because

human raters involved even in the same testing session do not have the same perception of

the same candidate’s speaking performance. Thus, a sound understanding of the test

purpose and descriptors in the rating scale is very important so that consistency in scoring

can be assured across raters. To achieve identical judgements when interpreting the

candidate’s oral performance with reference to the descriptors, Vietnamese raters need

training as a key element to enhance inter-rater reliability in assessing speaking (Stahl &

Lunz, 1991). Training can help raters clarify understanding of the assessment criteria, and

reduce random bias in their rating (Weigle, 1998). One of the institutions involved in this

study paired a novice rater with a trained rater so that the experienced rater could coach the

novice one. Double rating is a way to avoid the possible influence that a single rater may

have on score awarding because the severity (or leniency) of a rater’s scoring cannot be

totally eliminated in spite of training (Lumley & McNamara, 1995; Weigle, 1998). Most

studies about reliability in assessment and evaluation support the use of at least two

independent raters to enhance scoring consistency (Wang, 2009; Partington, 1994).

344

Rating scales need to be reconstructed with more rater-friendly descriptors and take

into consideration the constructs being assessed. Examiners with different roles (e.g.

interlocutor and assessor) in the rating process should use different rating scales. A holistic

scale is for the interlocutor’s use, and an analytic scale for the assessor. The combination of

both scales for two examiners is reasonable in that the interlocutor is the one having direct

contact with candidates when delivering tasks and maintaining the speaking test sessions.

The interlocutor can perceive the overall performance better than the assessor but cannot

pay sufficient attention to specific features in the test taker’s oral performance. Vice versa,

the assessor observes the conversation between the interlocutor and candidates and can

make better, more detailed judgement when using an analytic scale. Information from the

analytic scale can be used to provide candidates with feedback after the test. Whatever type

of scale is intended to be used, calibration of the scale on a collection of multi-level

samples of genuine oral performances is needed to ensure the scale works well before it is

put into any future use (Hughes, 2003, p. 106).

Vietnamese raters should have appropriate training in oral assessment to ensure that

they understand the test purpose and descriptors in the rating scale before embarking on

scoring. It is recommended that speaking tests be audio-recorded to retain evidence of

candidates’ and raters’ performance as is done for tests of other subjects in which

performance is recorded on paper. Audio-recording is feasible in the current condition of

universities in Vietnam. Once audio-recording has become normal in oral test rooms, it will

not cause any anxiety to candidates, but contribute to both raters and candidates’

performance because they will understand that their speaking is recorded. Further, these

recordings can be used as a rich data source for institutional research on ESL/EFL speech

patterns. However, before the practice of oral test audio-recording can be done, there has to

be a clear issuance of regulations on the conduct of recording, and the storage and usage of

audio files with respect to human ethics in accessing information that might affect

speakers’ privacy and school reputation.

9.4.4 Implications for oral test takers

Results from my study indicate that candidates who participated in the speaking exam had

quite good preparation for what was to be tested, but an unclear understanding about how

345

their oral skills would be rated. Knowledge of the predetermined assessment criteria is as

important as exam preparedness so that candidates can have effective strategies for their

task-based performances. Candidates need to know that oral production in a testing context

is not like oral practice in Speaking lessons. Candidates’ speaking demonstrations have to

meet the rater’s requirements for specified oral ability within a few limited moments, i.e.

criteria for assessment and the constructs being measured.

Evidence from test room observations and test raters’ opinions revealed that

confidence is crucial to perform good spoken English. Confidence helps test takers to be

more attentive and productive in generating and constructing ideas for speaking. Staying

confident during the test contributes to test takers’ fluency and coherence, two important

assessment criteria that raters pay attention to in oral rating. Lack of confidence, or

increased test anxiety, do not give the rater a positive perception of a strong communicator,

and hence affect the rater’s judgement and scoring.

Test takers need to be self-equipped with test-taking techniques that are not

mentioned in any Speaking course books. Even though my research participants were

English majors, there were test room situations in which oral performance breakdowns

occurred because candidates did not have timely essential reactions. For example, test

takers need to know what kinds of questions they are allowed to (or should) ask the

examiner in the test room; what they should do if they cannot understand the examiner’s

questions; how they can cooperate with the candidate partner to co-construct their

performance in a paired speaking test; how they manage time to complete a discussion task

using various interactional skills, and so on.

9.4.5 Implications for educational policy makers

My study results highlight a variety in oral testing methods across tertiary institutions in

Vietnam. Some assessment practices lack necessary quality and therefore influence the

quality of EFL teaching and learning. Unstable quality of language testing across

institutions in general, and oral assessment in particular, will continue if there are not

sufficient and effective policies for test operational practices from the MOET. The

promulgation of the CEFR-based evaluation framework for assessing L2 proficiency in

Vietnam (MOET, 2014) is a preliminary approach to enhancing and standardising the

346

national quality of EFL education. The CEFR’s objective “is not that of providing a ready-

made, universal solution to the issues related to assessment” (Piccardo, 2012, pp. 51-52).

In order for the CEFR-based assessment and accreditation to be effective in local

institutions, more detailed guidelines are needed in terms of course book selection,

weighting of the speaking skills in the curriculum, speaking task construction, criteria for

assessment, development of the rating scale, rater traning, physical setting for oral

assessment, etc. Each of these elements has the potential to influence all the others in the

operational process of oral assessment.

On the basis of the MOET’s general policies, Vietnamese universities make their

own decisions in teaching material selection and learning outcome evaluation. These

decisions rely on the institutional training objectives, the university students’ general level

of English, and the availability of human resources. Shortage of educational quality

management might result in carelessness of test administration, irresponsiblitity in

judgement, bias in ability measurement, and unfairness in educational decisions. It is

crucial for institutional policy makers to establish official regulations on benchmarks for

oral language competence to achieve for each level, the number of raters involved in oral

assessment, the quality of rater training, the response formats required for assessment with

reference to the entire direction of L2 training, etc. Well-conceived guidelines allow

assessment practitioners to make sound decisions between the complex dichotomies of

what is right and wrong, fair and unfair, effective and ineffective in particular education

contexts.

Innovations in teaching materials and methodology cannot be separated from

reformations in testing and assessment. Oral assessment should be required as a regular

school-based practice when Vietnamese students start learning English. Learning English

for several years without oral testing from secondary to high school might contribute to

students’ speaking test anxiety, reduce students’ motivation in learning oral skills and their

awareness of its importance in real-life communication. Oral assessment is currently

postponed until students enter university, and applied mostly to EFL majors – the minority

of university students. Concerns about objectivity and transparency in scoring have been

the primary barrier preventing practitioners from accepting challenges, approaching and

347

implementing oral assessment in Vietnamese high-stakes examinations. Policy

requirements are needed to promote a process of awareness building, operational

perfection, and eventually enhance the quality of the whole system.

9.4.6 Implications for language testing researchers

Mixed methods research on language testing and assessment helps to achieve a better

evidence-based understanding of the topic under study. As a subfield of research and

practice in Applied Linguistics, language testing allows a variety of methodological tools

for data collection including surveys, interviews, prompted responses, verbal reports, etc.

(Gass & Mackey, 2007). I processed the qualitative strand and quantitative strand of this

study using different methods and approaches. My combination of the two methodologies

in result analysis “together yield[ed] better, richer information” to answer “issue-focused”

and “hybrid research questions” (Guetterman & Salamoura, 2016, p. 158). My research

questions required an integration of data from different sources to address issues in testing

and assessment. For example, I examined the convergent results emerging from my

comparison of the quantitative results from the questionnaire surveys on examiners and

examinees regarding test administration (Research Question 1a) with the qualitative

findings from test room observation, and face-to-face interviews with the research

participants about the same topic. To examine whether the test contents matched the course

objectives (Research Question 1b), my analysis was not merely based on the documents of

the contents employed in the oral test, but also combined with EFL content experts’

judgements and candidates’ actual performances of test tasks developed on the contents.

Researchers utilising a mixed methods research design need a combination of

“insider and outsider perspectives through different strands of their study” (Vidaković &

Robinson, 2016, p. 181) to support conclusions about the issue investigated. I took on the

role of an observer (outsider) for gathering questionnaire and observational data, while I

explored the perspectives of testers and test takers (insiders) at the institutions. My study

integrated the voices of these participants to enhance understanding and practices which

can benefit the community of EFL teachers and learners involved in oral assessment. My

adoption of mixed methods added practical value to examining different stakeholders’

awareness and behaviours when they did not have many opportunites to express themselves

348

and listen to each other. The examiner-candidate relationship was blurred when a test was

administered only for candidates to complete their test performance as required, and for

examiners to fulfill their duty of rating and submitting scores to the test administrator. My

study created a sense of openness to an assessment context in which the raters’

expectations, test takers’ feedback, experts’ evaluation, etc. were synthesised to find out

possible solutions to the practical problems in Vietnam’s higher education measurement of

quality.

9.5 Conclusions

I have examined the operation of oral assessment for EFL majors at three institutions in a

southern city of Vietnam. I collected data for this study from multiple sources including test

raters, candidates, content experts, and documents. I adopted a convergent mixed methods

approach in data collection and analysis. The study concentrated on validating the oral test

in terms of test administration, content, scoring, and impacts on teaching and learning.

Testing and assessing oral skills is a frequent practice for English majors every

semester at all the universities involved. The oral test under study was a kind of

achievement test administered at the end of an instructional course. Each institution

designed their own teaching syllabus and applied assessment methods that suited their

needs. A cross-institutional comparison of these oral testing methods demonstrated several

positive signs in assessing speaking skills which were aligned with the literature discussed

in the field. First, scoring performed by pairs of raters (Universities A and B) helped to

reduced bias in judgement and made the test results more reliable. Face validity increased

when most of the raters and the candidates showed a positive perception of paired rating.

Second, the discussion task (Universities A and B) enabled the testing of authentic

interactional skills (linguistic performance) in addition to knowing about the language

(linguistic competence) and had a positive washback effect on teaching and learning in that

teachers organised communicative activities, e.g. pair work and group work, in EFL

speaking classes to prepare students for the end-of-term test. Third, assigning either of the

paired examiners to be an interlocutor (University B) helped to ensure consistency in

speaking task delivery and marking when there was a clear division of the interlocutor’s

and the assessor’s roles in a test event. The use of an interlocutor outline helped to maintain

349

uniformity and fairness in impromptu task delivery to different candidates from beginning

to end. Fourth, the inclusion of an usher (University B) proved to be helpful in maintaining

the quietness of the examination site and managing candidates’ turns to enter the test

rooms. It would be distracting for raters to hold all the responsibilities of marking, calling

candidates, checking candidates’ identity, and keeping candidates in the waiting lounge in

good order. Finally, the oral test represented a quite satisfactory sample of the contents

taken from the syllabus (Universities A, B and C) as achievement testing should do. That

the test contents relied on what students had learnt during the course motivated students’

class attendance and use of the course book to prepare for the test. However, letting

candidates know all the questions before the test would encourage candidates’ learning by

heart and reciting the prepared scripts in task performance. Speaking from memorisation

reduced test takers’ potential creativity and the ability to react promptly in real life verbal

communication.

Despite being included in the training programme of the same degree (B.A.

Degree), there were significant variations in testing methods across the tertiary institutions

regarding test tasks, rating scales, and examiner-examinee acquaintanceship. The design of

elicitation techniques which enabled candidates to perform their speaking ability via oral

production tasks varied from extensive oral presentation (monologue) to transactional and

interpersonal exchanges between the examiner/interlocutor and the candidate, or between

two candidates. However, the rating scales were either too general without sufficient

descriptors of typical features for a particular band score, or were not clearly defined which

descriptors were used for scoring each task, or did not specify differences between adjacent

scores with a 0.5 or 1.0 band increment. i.e. how a 4.0, a 4.5, or a 5.0 tells the difference

about each performance? This question is important if the cut-off score is 5.0, which would

decide the candidate’s Pass/Fail status as did the scoring scheme applied at University A.

My results suggest the need for more clearly-defined specifications in oral test

construction and development. Too simple descriptors in the rating scales may facilitate

easiness but inaccuracy in the scoring process. They are less stressful for the rater in

marking but may lead to inevitable measurement errors and unfairness in assessment. Test

scores do not outline which type of speaking task a candidate could perform better than the

350

other, or which aspects in the assessment criteria a candidate was stronger or weaker at.

Lack of feedback on candidates’ achievement and weaknesses after testing made student

candidates perceive the oral examination, and many similar tests, as a duty to fulfill rather

than an opportunity to perform spoken English confidently and receive constructive

feedback from EFL teacher raters to improve their speaking skills. This shortcoming in test

administration contributed to causing more oral test anxiety for candidates as they felt that

their English was (being) assessed in face-to-face communication with examiners.

My study revealed institutional obstacles that hinder the effective operation of

language assessment. These obstacles included the large class sizes that did not allow the

rater to elicit adequate speaking samples from each candidate or to ensure equal timing for

every candidate. There was a lack of rater training or at least an internal meeting for oral

raters prior to the test so that all the raters at the same institution could obtain a thorough

understanding of the assessment criteria and communal agreement on the descriptors in the

rating scale. Inadequate rater concensus resulted in an observed inconsistency of rater

bahaviour and task delivery, and potential inconsistency in using rating scales to decide

marks that was impossible to be observed in this study.

Test scores did not give a clear reflection of students’ speaking ability to motivate

them in their study of English. Students did not benefit from score reports to learn from

their deficiencies to improve oral skills. These obstacles could be overcome once there is a

full recognition of the role of language testing and assessment. Changes need time and

effective cooperation of all stakeholders involved in the testing process. Language

assessment is not merely a popular device to measure academic achievement, but also to

provide usable information for innovations in language teaching and learning.

My study project was coming into its final stage of completion when Vietnam

issued a refreshed policy to adjust the strategies in EFL education for the 2017-2025 period

(Vietnamese Government, 2017) emphasising the importance of speaking skills and the role

of quality assurance in foreign language pedagogy. The policy has created necessary

impetus for practising and researching a field of social science that is quite young in

Vietnam – language testing and assessment. This policy, on the other hand, presented

language testers and researchers with many challenges to meet more and more demanding

351

requirements of language learners as well as to keep pace with global tendencies of second

language testing in general, and of oral assessment in particular.

Within this scope of investigation, my study has taken the very first steps into

assessing EFL speaking skills of university students in Vietnam. I, a researcher, served as a

connecting bridge to represent voices from those who did the language testing (examiners),

and those who received the language test (examinees). Via this study I have learnt that

examiners had many valuable ideas to discuss with colleagues regarding their beliefs and

practices in language testing. Candidates proposed various suggestions on their school-

based oral assessment. I brought these opinions together and presented them in the light of

the literature and previous research to enhance mutual understanding amongst stakeholders,

which helps to identify existing problems and bring about resolutions. For example, oral

raters expected candidates to speak with confidence and express their own stance towards

an issue in an interactional discourse rather than taking turns to contribute a separate talk

about different aspects of the issue in a discussion task. Candidates expected to receive

raters’ feedback on the test performance to improve their speaking skills whereas test

administrators and raters did not think it important or regard it as an essential procedure for

an educational test.

Oral assessment is more complex than testing other language skills or language

components (e.g. reading comprehension, grammar, vocabulary) in that subjectivity is

inevitable in human raters’ judgements. This is the reason why assessing speaking skills

requires well-trained raters to minimise measurement bias. Direct oral rating is even more

challenging than rating creative writing skills because the rater cannot rewind a candidate’s

oral performance. Assessing writing allows the rater to reread a candidate’s written

performance. Oral test validity and reliability depend not only on raters but also on many

other factors, from test design to test administration and scoring. L2 pedagogy in Vietnam

needs the collaboration of language teachers, test writers, raters, policy makers and

researchers to establish more innovative testing methods which will meet strategic goals in

EFL education, bringing Vietnamese evaluations closer to the current world trends in

language testing and assessment.

352

REFERENCES

Alpha History. (2018). Europeans in Vietnam. Retrieved from

https://alphahistory.com/vietnamwar/europeans-in-vietnam/

Alderson, J. C. (1998). Developments in language testing and assessment, with specific reference to

information and technology. Forum for Modern Language Studies, 34(2), 195-206.

Alderson, J. C. (1991). Language testing in the 1990s: How far have we come? How much further have we to

go? In S. Anivan (Ed.), Current developments in language testing (Vol. 25, pp. 1–26). Singapore:

SEAMEO Regional Language Centre.

Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.

Alderson, J. C., & Banerjee, J. (2002). Language testing and assessment. Language Teaching (Part 2), 35, 79-

113.

Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge:

Cambridge University Press.

Al-Hebaish, S. (2012). The correlation between general self-confidence and academic achievement in the oral

presentation course. Theory and Practice in Language Studies, 2(1), 60-65. Retrieved from

https://search.proquest.com/docview/1348130463?accountid=10499

Alshahrani, A. (2008). Rapid Profile as an alternative ESL placement test. Annual Review of Education,

Communication & Language Sciences, 5, 29-50.

American Educational and Research Association/American Psychological Association/National Council for

Measurement in Education. (1999). Standards for educational and psychological testing. Wasington,

DC: American Educational and Research Association/American Psychological Association/National

Council for Measurement in Education.

Andrews, S., Fullilove, J., & Wong, Y. (2002). Targeting washback—a case-study. System, 30(2), 207-223.

Angiolillo, P. (1947). Armed forces foreign language teaching. New York: Vanni.

Anh Tu. (2013). Sinh viên thiếu khả năng thực hành tiếng Anh [Students lack ability to practise English].

Retrieved from http://www.baomoi.com/Sinh-vien-thieu-kha-nang-thuc-hanh-tieng-

Anh/59/12342188.epi

Anh Thu. (2017). Chương trình tiếng Anh hệ 10 năm tại Hà Tĩnh triển khai chậm, chưa đồng bộ [Slow and

unsynchronous initiation of the 10-year English programme in Ha Tinh]. Retrieved from

https://baomoi.com/chuong-trinh-tieng-anh-he-10-nam-tai-ha-tinh-trien-khai-cham-chua-dong-

bo/c/23146350.epi

353

Anh Minh. (2018). Nguy cơ ô nhiễm ở hai dự án bô-xít Tây Nguyên [Potential risks for environmental

pollution from two bauxite projects in Central Land of Vietnam]. Retrieved from

https://kinhdoanh.vnexpress.net/tin-tuc/vi-mo/nguy-co-o-nhiem-o-hai-du-an-bo-xit-tay-nguyen-

3717643.html

Antecol, H., Cobb-Clark, D. A., & Trejo, S. J. (2004). Selective immigration policy in Australia. Canada, and

the United States, Brussels economic review, 47(1), 57-76.

Atkinson, P. (1992). Understanding ethnographic texts (Vol. 25). Newbury Park, CA: Sage Publications, Inc.

Atkinson, P. A. & Coffey, A. (1997). Analysing documentary realities. In D. Silverman (Ed.), Qualitative

research: Theory, method and practice (pp. 45–62). London: Sage.

Atkinson, P., & Coffey, A. (2004). Analysing documentary realities. In D. Silverman (Ed.), Qualitative

research: Theory, method and practice (2nd ed., pp. 56-75). London: SAGE Publications Inc.

Australian Council for Educational Research. (n.d.). Progressive achievement. Retrieved August 2, 2018 from

https://www.acer.org/pat

Babauta, L. (2012). Life as a conscious practice. Retrieved from https://zenhabits.net/conscious/

Bachman, L. F. (1990). Fundamental considerations in language testing. New York: Oxford University Press.

Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University

Press.

Bachman, L.F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful

language tests. Hong Kong: Oxford University Press.

Bachman, L. F., & Palmer, A. (2010). Language assessment in practice: Developing language assessments

and justifying their use in the real world. Oxford: Oxford University Press.

Bailey, K. M., & Curtis, A. (2015). Learning about language assessment: Dilemmas, decisions, and

directions. USA: National Geographic Learning.

Banciu, V., & Jireghie, A. (2012). Communicative language teaching. The Public Administration and Social

Policies Review, 1(8), 94-98.

Bao, D. (2013). Developing materials for the speaking skill. In B. Tomlinson (Ed.), Developing materials for

language teaching (2nd ed., pp. 407-428). New York: Bloomsbury Academic.

Barnwell, D. P. (1996). A history of foreign language testing in the United States: From its beginnings to the

present. USA: Bilingual Press.

Bazeley, P. (2013). Qualitative data analysis: Practical strategies. London: Sage Publications Ltd.

354

Be, N., & Crabbe, D. (1999). The design and use of English language textbooks in Vietnamese secondary

schools. In Fourth International Conference on Language and Development. Hanoi, Vietnam.

Retrieved from http://www.languages.ait.ac.th/hanoi_proceedings/crabbe.htm.

Bell, A., Brenier, J., Gregory, M., Girand, C., & Jurafsky, D. (2009). Predictability effects on durations of

content and function words in conversational English. Journal of Memory and Language, 60(1), 92–

111.

Berkoff, N. A. (1985). Testing oral proficiency: A new approach. In E. P. Lee (Ed.), New directions in

language testing (pp. 93-100). Oxford: Pergamon Institute of English.

Bianco, J. L. (2001). Viet Nam: Quoc Ngu, colonialism and language policy. Language planning and

language policy: East Asian perspectives, 159-206.

Black, T. R. (2005). Doing quantitative research in the social sciences: An integrated approach to research

design, measurement and statistics. London: SAGE Publications Ltd.

Bøhn, H. (2015). Assessing spoken EFL without a common rating scale: Norwegian EFL teachers’

conceptions of construct. Sage Open, 5(4), 2158244015621956.

Booth, W. C., Colomb, G. G., & Williams, J. M. (2008). The craft of research. Chicago, Il: The University of

Chicago Press.

Bowen, G. A. (2009). Document analysis as a qualitative research method. Qualitative Research

Journal, 9(2), 27-40.

Brereton, J. L. (1944). The case for examinations: An account of their place in education with some proposals

for their reform. Cambridge: Cambridge University Press.

Brown, J. D. (2000). Questions and answers about language testing statistics: What is construct validity?

JALT Testing and Evaluation SIG Newsletter, 4(2), 7-10.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing,

20(1), 1-25.

Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment.

New York: McGraw-Hill ESL/ELT.

Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University

Press.

Brown, H. D. (2004). Language assessment: Principles and classroom practices. New York: Pearson

Education, Inc.

Brown, H. D., & Abeywickrama, P. (2010). Language assessment: Principles and classroom practices. White

Plains, NY: Pearson Longman.

355

Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University

Press.

Bygate, M. (1987). Speaking. Oxford: Oxford University Press.

Bryman, A. (2006). Integrating quantitative and qualitative research: How is it done? Qualitative Research,

6(1), 97-113.

Bryman, A. (2012). Social research methods (4th ed.). New York: Oxford University Press Inc.

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.

Bui, T. M. H. (2006). Teaching speaking skills at a Vietnamese university and recommendations for using

CMC. Asian AFL Journal, 14(2).

Burrows, C. (2004). Washback in classroom-based assessment: A study of the washback effect in the

Australian adult migrant English program. In L. Cheng & Y. Watanabe (Eds.), Washback in

language. Marwah, NJ: Lawrence Erlbaum Associates Inc.

Carroll, B. (1980). Testing communicative performance. Oxford: Pergamon.

Carroll, J. B. (1983). Psychometric theory and language testing. In Oller, J. W. Jr. (Ed.) Issues in language

testing research (pp. 80-107). Rowley, MA: Newbury House,

Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge: Cambridge

University Press.

Cassady, J. C., & Johnson, R. E. (2002). Cognitive test anxiety and academic performance. Contemporary

educational psychology, 27(2), 270-295.

Cegala, D. J., & Waldron, V. R. (1992). A study of the relationship between communicative performance and

conversation participants’ thoughts. Communication Studies, 43(2), 105-123.

Celik, O., & Yavuz, F. (2015). The relationship between speaking grades and listening grades of university

level preparatory students. Procedia-Social and Behavioral Sciences, 197, 2137-2140.

Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254-272.

Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge:

Cambridge University Press.

Chapelle, C. A. (2008). Utilizing technology in language assessment. In E. Shohamy and N. H. Hornberger.

Language testing and assessment 2nd ed, Volume 7, (pp.23—134). Netherlands, Springer

Science+Business Media, LLC..

Cheng, L. (1997). How does washback influence teaching? Implications for Hong Kong. Language and

Education, 11(1), 38-54.

356

Cheng, L. (2005). Changing language teaching through language testing: A washback study. In M. Milanovic

& C. Weir (Series Eds.), Studies in language testing Vol. 21. Cambridge: Cambridge University

Press.

Cheng, L., Watanabe, Y., & Curtis, A. (2004). Washback in language testing: Research contexts and

methods. Marwah, NJ: Lawrence Erlbaum Associates Inc.

Chi An. (2015). 58 percent of married women in Vietnam are victims of domestic violence: Report. Retrieved

from http://www.thanhniennews.com/society/58-percent-of-married-women-in-vietnam-are-victims-

of-domestic-violence-report-54062.html

Chi Hieu., & AnhVu. (2016). Gây cá chết hàng loạt ở biển miền Trung: Formosa nhận lỗi [Causing mass

death of fish on Central coast: Formosa takes responsibility]. Retrieved from

https://thanhnien.vn/thoi-su/gay-ca-chet-hang-loat-o-bien-mien-trung-formosa-nhan-loi-718748.html

Chinda, B. (2009). Professional development in language testing and assessment: A case study of supporting

change in assessment practice in in-service EFL teachers in Thailand. Unpublished PhD Thesis, The

University of Nottingham, UK.

Chuang, Y. Y. (2009). A study of college EFL students’ affective reactions and attitudes toward types of

performance-based oral tests. Journal of Educational Research, 43(2), 55-80.

Clark, H. (2016). Mass fish deaths in Vietnam highlight the country’s press freedom problem. Retrieved from

https://www.huffingtonpost.com/helen-clark1/fish-deaths-vietnam-press-freedom_b_10744496.html

Clark, H. H. (1979). Responding to indirect speech acts. Cognitive psychology, 11(4), 430-477.

Clark, J. (1989). Multipurpose language tests: Is a conceptual and operational synthesis possible? Language

teaching, testing, and technology. Washington, D.C.: Georgetown University Press.

Clark, J. L. D., & Clifford, R. T. (1988). The FSI/ILR/ACTFL proficiency scales and testing techniques:

Development, current status, and needed research. Studies in second language acquisition 10, 129-

147.

Clayman, S. E., & Gill, V. T. (2009). Conversation analysis. In M. Hardy & A. Bryman (Eds.). The handbook

of data analysis (pp. 589-606). London: Sage Publications Ltd.

Coffey, A., & Atkinson, P. (1996). Making sense of qualitative data: Complementary research strategies.

Thousand Oaks, CA: Sage, 26-53.

Cohen, A. D. (1994). Assessing language ability in the classroom (2nd ed.). Boston, MA: Heinle & Heinle

Publishers.

Cohen, L., Manion, L., & Morrison, K. (2007). Research methods in education (6th ed.). New York:

Routledge.

357

Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching,

assessment. Cambridge: Cambridge University Press.

Council of Europe. (2011a). Self-assessment grid - Table 2 (CEFR 3.3): Common Reference levels. Retrieved

from https://rm.coe.int/CoERMPublicCommonSearchServices/DisplayDCTMContent?

documentId=090000168045bb52

Creswell, J. W. (2003). Research design: Qualitative, quantitative, and mixed methods (2nd ed.). Thousand

Oaks, CA: Sage Publications, Inc.

Cresswell, J. W. (2005). Educational research: Planning, conducting, and evaluating quantitative and

qualitative research (2nd ed.). Upper Saddle River, NJ: Pearson.

Creswell, J. W. (2012). Educational research: Planning, conducting, and evaluating quantitative and

qualitative research (4th ed.). Boston, MA: Pearson Education, Inc.

Creswell, J. W., & Clark, V. L. P. (2011). Designing and conducting mixed methods research (2nd ed).

Thousand Oaks, CA: SAGE Publications, Inc.

Creswell, J. W., & Zhou, Y. (2016). What is mixed methods research? In A. J. Moeller, J. W. Creswell, & N.

Saville (Eds.), Second language assessment and mixed methods research. Cambridge: Cambridge

University Press.

Cumming, A., & Berwick, R. (Eds.) (1996). Validation in language testing. Great Britain: The Cromwell

Press.

Dalby, S. M., Rubenstone, S., & Weir, E. H. (1996). The international student's guide to going to college in

America: How to choose colleges and universities in the United States, how to apply, how to fit in.

New York: Arco.

Dancey, C. P., & Reidy, J. (2007). Statistics without maths for psychology. Harlow, UK: Prentice Hall.

Dandonoli, P., & Henning, G. (1990). An investigation of the construct validity of the ACTFL Proficiency

guidelines and oral interview procedure. Foreign Language Annals, 23(1), 11-22.

Dang, Q. A. (2009). Recent higher education reform in Vietnam: The role of the World Bank. Retrieved from

www.dpu.dk/…/om-dpu_institutter_institut-for-paedagogik.

Dang, T. H. (2004). ELT at tertiary tevel in Vietnam: Historical overview and assessment of current policies

and practices. Doctoral dissertation, La Trobe University, Australia.

Dang, V. H. (2006). Learner-centeredness and EFL instruction in Vietnam: A case study. International

Education Journal, 7(4), 598-610.

Dang Khoa. (2018). Bạo lực gia tăng ở V-League 2018 [Violence is increasing in 2018 V-League]. Retrieved

from https://thethao.thanhnien.vn/bong-da-viet-nam/bao-luc-gia-tang-o-vleague-2018-85475.html

358

Davies, A. (1968) (Ed.). Language testing symposium: A psycholinguistic approach. Oxford: Oxford

University Press.

Davies, A. (1990). Principles of language testing. Oxford: T. J. Press Ltd.

Davis, L. L. (1992). Instrument review: Getting the most from a panel of experts. Applied Nursing Research,

5(4), 194-197.

Decker, W. C. (1925). Oral and aural tests as integral parts of the Regents examination. Modern Language

Journal, 10, 369-371.

Delgado-Rico, E., Carretero-Dios, H., & Ruch, W. (2012). Content validity evidences in test development: An

applied perspective. International journal of clinical and health psychology, 12(3), 449.

de Vaus, D. A. (2002). Surveys in social research (5th ed.). Crows Nest, Australia: Allen & Unwin.

Dieu Linh. (2018). Quy chế tuyển sinh đại học 2018: Những điểm mới cần biết [2018 regulations for

university admission: New terms to know]. Retrieved from https://baomoi.com/quy-che-tuyen-sinh-

dai-hoc-2018-nhung-diem-moi-can-biet/c/26691301.epi

Distinguishing English programmes in secondary education. (2018, January 10). Retrieved from

http://cth.edu.vn/phan-biet-cac-chuong-trinh-tieng-anh-pho-thong-he-3-nam-he-7-nam-va-he-10-

nam/

Do, H. A. (n.d.). Hội Truyền bá chữ Quốc ngữ [Association for Quoc ngu Dissemination]. Retrieved from

http://www.archives.gov.vn/Pages/Tin%20chi%20tiết.aspx?itemid=196&listId=c2d480fb-e285-

4961-b9cd-b018b58b22d0&ws=content

Do, H. T. (2006). The role of English in Vietnam’s foreign language policy: A brief history. Paper presented

at the 19th annual EA Education conference. Retrieved from http://www.worldwide.rs/role-english-

vietnams-foreign-language-policy-brief-history/

Donato, R., & McCormick, D. (1994). A sociocultural perspective on language learning strategies: The role of

mediation. The Modern Language Journal, 78(4), 453-464.

Douglas, D. (1997). Testing speaking ability in academic contexts: Theoretical considerations. Princeton, NJ:

Educational Testing Service.

Doyle, W. (1983). Academic work. Review of Educational Research, 53(2), 159-199.

Dörnyei, Z. (2003). Questionnaires in second language research: Construction, administration, and

processing: Marwah, NJ: Lawrence Erlbaum Associates, Inc.

Dörnyei, Z., & Taguchi, T. (2010). Questionnaires in second language research: Construction,

administration, and processing. Marceline, Mi: Walsworth Publishing Company.

Dörnyei, Z. (2007). Research methods in applied linguistics. Oxford: Oxford University Press.

359

Ducasse, A. M., & Brown, A. (2009). Assessing paired orals: Raters' orientation to interaction. Language

Testing, 26(3), 423-443.

Early, P., & Swanson, P. B. (2008). Technology for oral assessment. In C. M. Cherry & C. Wilkerson (Eds.),

Dimension (pp. 39-48). Valdosta, GA: SCOLT Publications.

Edumax. (2008). Chuẩn Cambridge – Khung tham chiếu Châu Âu [Cambridge standard – The European

framework for reference]. Retrieved from https://edumax.edu.vn/khung-tham-chieu-chau-au-va-cac-

quy-doi.html

Edutopia. (2008). Why is assessment important? Retrieved from https://www.edutopia.org/assessment-guide-

importance

Elder, C., & Wigglesworth, G. (1996). Perspectives on the testing cycle: Setting the scene. Australian Review

of Applied Linguistics Series S, 13, 1-12.

Ellis, R., & Yuan, F. (2005). The effects of careful within-task planning on oral and written task performance.

In R. Ellis (Ed.), Planning and task performance in a second language (pp. 193–218). Amsterdam:

John Benjamins.

English Profile (2015). Why is the CEFR useful for teachers and learners? Retrieved from

http://www.englishprofile.org/the-cefr/cefr-for-teachers-learners

Everson, P., & Hines, S. (2010). How ETS scores the TOEIC® speaking and writing responses. Compendium

study. Retrieved from https://www.ets.org/Media/Research/pdf/TC-10-08.pdf

Fall, T., Adair-Hauck, B., & Glisan, E. (2007). Assessing students' oral proficiency: A case for online testing.

Foreign Language Annals, 40(3), 377-406.

Farhady, H. (2012). Principles of language assessment. In C. Coombe, P. Davidson, B. O'Sullivan, & S.

Stoynoff (Eds.), The Cambridge guide to second language assessment. Cambridge.: Cambridge

University Press.

Fisher, M. (1981). To pass or not to pass: An investigation into oral examination procedures. Spoken English,

14, 109-116.

Fitzpatrick, A. R. (1983). The meaning of content validity. Applied psychological measurement, 7(1), 3-13.

Flick, U. (2002). An introduction to qualitative research. London: Sage Publications Ltd.

Flemming, G. (2017). Preparing for an oral exam. Retrieved from https://www.thoughtco.com/preparing-for-

an-oral-exam-1857439

Fox, T. C. (2003). Vietnam's 400-year Catholic history. Retrieved from

http://www.nationalcatholicreporter.org/todaystake/tt042403.htm

360

Francis, J. C. (1981). The reliability of two methods of marking oral tests in modern language examinations.

British Journal of Language Teaching, 19, 15-23.

Fries, C. C. (1952). The structure of English: An introduction to the construction of English sentences. New

York: Harcourt Brace.

Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating scale

construction. Language Testing, 13(2), 208-238.

Fulcher, G. (1997a). An English language placement test: Issues in reliability and validity. Language

testing, 14(2), 113-139.

Fulcher, G. (1997b). The testing of speaking in a second language. In C. Clapham & D. Corson (Eds.),

Encyclopedia of language and education, Vol. 7: Language testing and assessment (pp. 75-85).

Amsterdam: Kluwer Academic Publishers.

Fulcher, G. (1999). Assessment in English for academic purposes: Putting content validity in its place.

Applied Linguistics, 20(2), 221-236.

Fulcher, G. (2003). Testing second language speaking. London: Longman Pearson Education.

Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book.

Abingdon, UK: Routledge.

Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking tests:

Performance decision trees. Language Testing, 28(1), 5-29.

Fulcher, G., & Reiter, R. M. (2003). Task difficulty in speaking tests. Language Testing, 20(3), 321-344.

Frazier, L., & Leeming, S. (2003). Lecture Ready 3: Strategies for academic listening, note-taking and

discussion. Oxford: Oxford University Press.

Galaczi, E., & ffrench, A. (2011). Context validity. In M. Milanovic & C. Weir (Series Eds) & Linda Taylor

(Vol. Ed). Studies in Language Testing: Vol. 30. Examining speaking: Research and practice in

assessing second language speaking. Cambridge: Cambridge University Press.

Gan, Z. (2010). Interaction in group oral assessment: A case study of higher-and lower-scoring

students. Language Testing, 27(4), 585-602.

Garrison, C., & Ehringhaus, M. (2007). Formative and summative assessments in the classroom. Retrieved

from http://ccti.colfinder.org/sites/default/files/formative_and_summative_assessment_in_

the_classroom.pdf

Gass, S. M., & Mackey, A. (2007). Input, interaction, and output in second language acquisition. In B.

VanPatten & J. Williams (Eds.), Theories in second language acquisition: An introduction (pp.175-

199). Mahwah, NJ: Lawrence Erlbaum.

361

Gillham, B. (2008). Developing a questionnaire (2nd ed.). London: Continuum.

Gipps, C., & Stobart, G. (2009). Fairness in assessment. Educational assessment in the 21st century, 105-118.

Giri, R. A. (2010). Language testing: Then and now. Journal of NELTA, 8(1), 49-67.

GPA English. (2017). Tìm hiểu về critical thinking – Tư duy phản biện [Learning about critical thinking].

Retrieved October 3, 2017 from http://gpaenglish.edu.vn/tin-tuc-su-kien/tim-hieu-ve-critical-

thinking-tu-duy-phan-bien-180.htm

Goodwin, C. (1981). Conversation organization: Interaction between speakers and hearers. New York:

Academic Press.

Green, A. (1998). Verbal protocol analysis in language testing research. Cambridge: Cambridge University

Press.

Green, R. (2013). Statistical analyses for language testers. Basingstoke, UK: Palgrave Macmillan.

Green, A. (2013). Washback in language assessment. International Journal of English Studies, 13(2), 39-51.

Green, A., & Hawkey, R. (2012). Marking assessments: Rating scales and rubrics. In C. Coombe, P.

Davidson, B. O’Sullivan, & S. Stoynoff (Eds.), The Cambridge guide to second language assessment

(pp. 299-306). New York: Cambridge University Press.

Guariento, W., & Morley, J. (2001). Text and task authenticity in the EFL classroom. ELT Journal, 55(4),

347-353.

Guskey, T. R. (2003). How classroom assessments improve learning. Educational, School, and Counseling

Psychology Faculty Publications, 60(5), 6-11.

Gwet, K. L. (2012). Handbook of inter-rater reliability: The definitive guide to measuring the extent of

agreement among multiple raters (3rd ed.). Gaithersberg, MD: Advanced Analytics, LLC.

Halcomb, E. J., & Davidson, P. M. (2006). Is verbatim transcription of interview data always necessary?

Applied Nursing Research, 19(1), 38-42.

Hallam, S., & Ireson, J. (2005). Secondary school teachers' pedagogic practices when teaching mixed and

structured ability classes. Research Papers in Education, 20(1), 3-24.

Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamps-Lyons (Ed.) Assessing second

language writing in academic contexts (pp 241-276). Norwood, NJ: Ablex Publishing.

Hanh My (2012). Guidelines for designing English lesson plans [Hướng dẫn thiết kế giáo án môn tiếng Anh].

Retrieved from https://tailieu.vn/doc/huong-dan-thiet-ke-giao-an-mon-tieng-anh-1118951.html

Hansen, M. T., Nohria, N., & Tierney, T. (1999). What’s your strategy for managing knowledge. Retrieved

from

362

https://eclass.uoa.gr/modules/document/file.php/ECON197/Papers%20Strategy/Hansen%201999%2

0Whats%20your%20strategy%20for%20managing%20knowledge.pdf

Hawthorne, L. (1997). The political dimension of English language testing in Australia. Language

Testing, 14(3), 248-260.

He, L., & Dai, Y. (2006). A corpus-based investigation into the validity of the CET-SET group discussion.

Language Testing, 23(3), 370-401.

He, A. W., & Young, R. (1998). Language proficiency interviews: A discourse approach. In K. d. Bot & T.

Huebner (Series Eds.), & R. Young & A. W. He (Eds.). Studies in Bilingualism: Vol. 14. Talking and

testing: Discourse approaches to the assessment of oral proficiency (pp. 1-24). The Netherlands:

John Benjamins Publishing Co.

Heaton, J. B. (1975). Writing English language tests: A practical guide for teachers of English as a second or

foreign language. London: Longman Publishing Group.

Hesses-Bieber, S. N., & Leavy, P. (2006). The practice of qualitative research. Thousand Oaks, CA: Sage.

Henning, G. (1987). A guide to language testing: Development, evaluation, research. Cambridge, MA:

Newbury House Publishers.

Heyden, P. M. (1920). Experience with oral examinersin modern languages. Modern Language Journal, 5,

87-92.

Hilsdon, J. (1991). The group oral exam: Advantages and limitations. In J. C. Alderson & B. North (Eds.),

Language testing in the 1990s: The communicative legacy (pp.189-197). London: Modern English

Publications and the British Council.

Hirai, A., & Koizumi, R. (2013). Validation of empirically derived rating scales for a story retelling speaking

test. Language Assessment Quarterly, 10(4), 398-422.

Hiranburana, K., Subphadoongchone, P., Tangkiengsirisin, S., Phoocharoensil, S., Gainey, J., Thogsngsri, J., .

. . Taylor, P. (2017). A framework of reference for English language education in Thailand (FRELE-

TH) – based on the CEFR, the Thai experience. LEARN Journal, 10(2), 90-119.

HKDSE English Language Examination. (2012). Assessment, teaching and learning: From principles to

practice. Retrieved June 20, 2017 from

http://www.hkeaa.edu.hk/DocLibrary/SBA/HKDSE/Eng_DVD/atl_interrelationship.html

Hoang, V. V. (2010). The current situation and issues of the teaching of English in Vietnam. Retrieved from

http://www.ritsumei.ac.jp/acd/re/k-rsc/lcs/kiyou/pdf_22-1/RitsIILCS_22.1pp.7-18_HOANG.pdf

Hoang, V. V. (2013). The role of English in the internationalization of higher education in Vietnam.

Retrieved from https://js.vnu.edu.vn/FS/article/view/1082/1050

363

Hoang, V. V. (2016). Renovation in curriculum design and textbook development: An effective solution to

improving the quality of English teaching in Vietnamese schools in the context of integration and

globalization. VNU Journal of Science: Education Research, 32(4), 9-20.

Hoang, X. V. (2006). Bạch thư chữ Quốc Ngữ [Learning about the history of Quoc Ngu]. San Jose, CA:

Vietnamese Cultural Association.

Hoang Huong. (2018). Mỏi mệt với 2 chương trình tiếng Anh [Tiredness from two English programmes].

Retrieved from http://vietnammoi.vn/moi-met-voi-2-chuong-trinh-tieng-anh-71490.html

Hong Hanh. (2013). Giáo viên nước ngoài đến Việt Nam dạy Ngoại ngữ: Lương 6 triệu đồng/tháng [Foreign

teachers come to Vietnam to teach English: Salary of VND 6 million per month]. Retrieved from

https://dantri.com.vn/giao-duc-khuyen-hoc/giao-vien-nuoc-ngoai-den-viet-nam-day-ngoai-ngu-

luong-6-trieu-dongthang-1358906625.htm

Howard, F. (1980). Testing comiunicative proficiency in French as a second language: A search for

procedures. Canadian Modern Language Review, 36(2), 272-280.

Huang, H.-T. D., & Hung, S.-T. A. (2013). Comparing the effects of test anxiety on independent and

integrated speaking test performance. TESOL Quarterly, 47(2), 244-269.

Huang, H. T. D., Hung, S. T. A., & Hong, H. T. V. (2016). Test-taker characteristics and integrated speaking

test performance: A path-analytic study. Language Assessment Quarterly, 13(4), 283-301.

Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge: Cambridge University Press.

Hughes, R. (2011). Teaching and researching speaking (2nd ed.). London: Pearson Education Limited.

Huy Lan., & Lan Anh. (2010). Chương trình ngoại ngữ 10 năm: Khởi động lúng túng? [The 10-year foreign

language programme: Troubled initiation?]. Retrieved from https://nld.com.vn/giao-duc-khoa-

hoc/chuong-trinh-ngoai-ngu-10-nam-khoi-dong-lung-tung--2010081711321136.htm

Huyen Nguyen. (2018). Kỳ thi THPT quốc gia 2018: Đổi mới để hạn chế tối đa tiêu cực [The 2018 High

School Graduation Examination: Innovations for maximum restriction of negative occurrences].

Retrieved from https://laodong.vn/giao-duc/ky-thi-thpt-quoc-gia-2018-doi-moi-de-han-che-toi-da-

tieu-cuc-613639.ldo

Hustad, A., & McElwee, S. (2016). Ideas to action: A framework for the design and implementation of a

mixed methods study. In A. J. Moeller, J. W. Cresswell, & N. Saville (Eds.), Second language

assessment and mixed methods research (pp. 299 – 324). Cambridge: Cambridge University Press.

Hutchinson, T., & Hutchinson, U. G. (1997). Textbook as agent of change. In T. Hedge, & N. Whitney (Eds.).

Power, pedagogy and practice (pp. 307-323). Oxford: Oxford University Press.

Hutchinson, T., & E. Torres (1994). The textbook as agent of change. ELT Journal, 48(4), 315-328.

364

Hwang, C. C. (2005). Effective EFL education through popular authentic materials. Asian EFL Journal, 7(1),

90-101.

International English Language Testing System. (2018a). Which IELTS test is right for me? Retrieved from

https://www.ielts.org/about-the-test/two-types-of-ielts-test

International English Language Testing System. (2018b). Common European Framework. Retrieved from

https://www.ielts.org/ielts-for-organisations/common-european-framework

Ingram, D. E. (1985). Assessing proficiency: An overview of some aspects of testing. In K. Hyltenstam, & M.

Pienemann, (Eds.) Modelling and assessing second language acquisition. San Diego, CA: College

Hill Press, 215-276.

Ingram, D. E. (1996). The ASLPR: Its origins and current developments. Paper presented NLLIA Language

Expo '96 (Brisbane, Queensland, Australia, July 19-21, 1996Queensland, Australia. Retrieved from

https://eric.ed.gov/?id=ED402735

Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an oral proficiency test?

Exploring the potential of an information‐processing approach to task design. Language

learning, 51(3), 401-436.

Iwashita, N., Brown, A., McNamara, T., & O’Hagan, S. (2008). Assessed levels of second language speaking

proficiency: How distinct? Applied Linguistics, 29, 29–49.

Jenkins, J. (2006). Current perspectives on teaching world Englishes and English as a lingua franca. TESOL

Quarterly, 40(1), 157-181.

Johnson, B., & Christensen, L. (2014). Educational research: Quantitative, qualitative, and mixed

approaches (5th ed.). Thousand Oaks, C.A.: SAGE Publications, Inc.

Johnson, B., & Turner, L. A. (2003). Data collection strategies in mixed methods research. In A. Tashakkori

& C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (pp. 297-319).

Thousand Oaks, CA: Sage Publications.

Johnson, M., & Tyler, A. (1998) Re-analyzing the OPI: How does it look like natural conversation? In K. d.

Bot & T. Huebner (Series Eds.), & R. Young & A. W. He (Eds.). Studies in Bilingualism: Vol. 14.

Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. 27-51). The

Netherlands: John Benjamins Publishing Co.

Kam, H. W. (2002). English language teaching in East Asia today: An overview. Asia Pacific Journal of

Education, 22(2), 1-22.

Keitges, D. J. (1982). Language proficiency interview testing: An overview. JALT JOURNAL, 4, 17-45.

Kelly, G. P. (1977). Colonial schools in Vietnam, 1918 to 1938. In Proceedings of the Meeting of the French

Colonial Historical Society (Vol. 2, pp. 96-106). The USA: Michigan State University Press.

365

Kendon, A. (1979). Some theoretical and methodological aspects of the use of film in the study of social

interaction. In G. P. Ginsberg (Ed.), Emerging strategies in social psychological research. New

York: John Wiley, 67-91.

Kendon, A. (1994). Do gestures communicate? A review. Research on language and social

interaction, 27(3), 175-200.

Kenh Tuyen Sinh. (2014). Hàng ngàn sinh viên yếu về kỹ năng, thiếu về định hướng [Thousands of students

are weak at skills and lack of orientation]. Retrieved from http://kenhtuyensinh.vn/hang-ngan-sinh-

vien-yeu-ve-ky-nang-thieu-ve-dinh-huong

Khanh Linh. (2018). Thực phẩm bẩn tràn lan: Trách nhiệm thuộc về ai? [The widespread of insanitary food:

Who takes responsibility?]. Retrieve from http://thoibaotaichinhvietnam.vn/pages/xa-hoi/2018-04-

05/thuc-pham-ban-tran-lan-trach-nhiem-thuoc-ve-ai-55813.aspx

Knoch, U. (2007). Diagnostic writing assessment: The development and validation of a rating scale (Doctoral

dissertation, ResearchSpace@ Auckland).

Knoch, U. (2009). Diagnostic writing assessment: The development and validation of a rating scale (Vol. 17).

Peter Lang.

Koike, D. A. (1998). What happens when there’s no one to talk to? Spanish foreign language discourse in

simulated oral proficiency interviews. In K. d. Bot & T. Huebner (Series Eds.), & R. Young & A. W.

He (Eds.). Studies in Bilingualism: Vol. 14. Talking and testing: Discourse approaches to the

assessment of oral proficiency (pp. 69-98). The Netherlands: John Benjamins Publishing Co.

Krejtz, I., Szarkowska, A., & Łogińska, M. (2015). Reading function and content words in subtitled

videos. Journal of Deaf Studies and Deaf Education, 21(2), 222-232.

Kronberger, N., & Wagner, W. (2000). Key words in context: Statistical analysis of text features. In M. W.

Bauer & G. Gaskell (Eds.), Qualitative researching with text, image and sound. A practical

handbook (pp. 299-317). London: Sage Publications Ltd.

Kunnan, A. (1994). Modelling relationships among some test-taker characteristics and performance on EFL

tests: an approach to construct validation. Language Testing, 11(3), 225-52.

Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural modeling study (Vol. 2).

Cambridge: Cambridge University Press.

Kunnan, A. J. (2000). Fairness and validation in language assessment: Selected papers from the 19th

Language Testing Research Colloquium, Orlando, Florida (Vol. 9). Cambridge: Cambridge

University Press.

Kvale, S. (2007). Doing interviews. London: SAGE Publications Ltd.

366

Kvale, S., & Brinkmann, S. (2009). Interviews: Learning the craft of qualitative interviewing (2nd ed.).

Thousand Oaks, CA: SAGE Publications, Inc.

Lam, H. (2018). Gian lận thi cử 'chấn động', giáo dục Việt Nam về đâu? [‘Shocking’ fraud in the national

examination, where is Vietnam’s education going?]. Retrieved from https://baomoi.com/gian-lan-thi-

cu-chan-dong-giao-duc-viet-nam-ve-dau/c/26936037.epi

Lam, T. L. (2011). The impact of Vietnam’s globalization on national education policies and teacher training

programs for teachers of English as an international language: A case study of the University of

Pedagogy in Ho Chi Minh City. (Doctoral dissertation), Alliant International University, San Diego,

CA.

Lam, T. L. H., & Albright, J. (2018). Vietnamese foreign language policy in higher education: A barometer to

social changes. In J. Albright (Ed.), English Tertiary Education in Vietnam (pp. 19-33). London:

Routledge.

Lan Tam. (2017). Tại sao cứ gọi Sài Gòn mà không phải Hồ Chí Minh [Why is it still called Sai Gon but not

Ho Chi Minh]. Retrieved from https://giadinhviet.com.vn/van-hoa-giao-duc/tai-sao-cu-goi-sai-gon-

ma-khong-phai-ho-chi-minh-297.html

Language Link Academic. (2017). 8 kĩ năng cơ bản nhất cho bài thi TOEIC [The 8 most fundamental

techniques for the TOEIC test]. Retrieved from https://llv.edu.vn/vi/goc-chuyen-gia-toeic/8-ki-nang-

co-ban-nhat-cho-bai-thi-toeic

Language Testing Service. (2003). Linking classroom assessment with student learning. Retrieved from

https://www.ets.org/Media/Tests/TOEFL_Institutional_Testing_Program/ELLM2002.pdf

Larson, J. W. (2000). Testing oral language skills via the computer. Calico Journal, 18(1), 53-66.

Lazaraton, A., & Frantz, R. (1997). An analysis of the relationship between task features and candidate output

for the Revised FCE Speaking Examination. Report prepared for the EFL Division, University of

Cambridge Local Examinations Syndicate, Cambridge, UK.

Le, Q. K. (2015). Tiếng Anh – Chìa khóa giúp bạn thành công! [English – Key to your success!]. Retrieved

from https://www.careerlink.vn/en/careertools/career-advice/tieng-anh-–-chia-khoa-giup-ban-thanh-

cong

Le, S. (2011). Teaching English in Vietnam: Improving the provision in the private sector. Ph. D. Thesis,

University of Victoria, Australia.

Le, T. T., & Chen, S. (2018). Globalisation and Vietnamese foreign language education. In J. Albright (Ed.),

English Tertiary Education in Vietnam (pp. 34-45). Australia: Routledge.

Le, T. T. P. (2005). The quality of teaching and testing of English specific purpose program in the USSH.

M.A. Thesis. Vietnam: University of Social Sciences and Humanities.

367

Le, V. (2016). Quá nửa sinh viên tốt nghiệp kém tiếng Anh [More than half of graduates are weak at English].

Retrieved from https://baotintuc.vn/giao-duc/qua-nua-sinh-vien-tot-nghiep-kem-tieng-anh-

20160506225914927.htm

Le, V. C., & Barnard, R. (2009). Curricular innovation behind closed classroom doors: A Vietnam case study.

Prospect, 24(2), 20-33.

Le, V. M. (2016). Tính hiếu học của người Việt Nam [Vietnamese people’s fondness for learning]. Vietnam

Journal of Social Sciences, 4(101), 103-108.

Le, V. P. (2014). Hội Truyền bá Quốc ngữ và tác động của nó đến xã hội Việt Nam 1938-1945 [Association

for Quoc ngu Dissemination and its effects upon Vietnamese society 1938-1945]. Unpublished

doctoral dissertation, Hanoi University of Social Sciences and Humanities, Vietnam.

Le, V. P. (2017). Sự phổ biến chữ Quốc ngữ trên Đông Dương Tạp chí và Nam Phong Tạp chí [The

popularity of Quoc ngu in Dong Duong Magazine and Nam Phong Magazine]. Retrieved from

http://nghiencuuquocte.org/2017/12/24/su-pho-bien-chu-quoc-ngu-tren-dong-duong-tap-chi-va-nam-

phong-tap-chi/

LeCompte, M. D. (2009). Trends in research on teaching: An historical and critical overview. In , I. J.

Sarahand A. G. Dworkin, (Eds) International handbook of research on teachers and teaching (pp.

25-60). New York: Springer Science + Business Media LLC,

Le Huyen. (2018). Công bố lịch thi THPT quốc gia 2018 [Public announcement of the 2018 national high-

school graduation examination]. Retrieved from http://vietnamnet.vn/vn/giao-duc/tuyen-sinh/cong-

bo-lich-thi-thpt-quoc-gia-2018-436122.html

Li, L., Chen, J., & Sun, L. (2015). The effects of different lengths of pretask planning time on L2 learners'

oral test performance. TESOL Quarterly, 49(1), 38-66.

Lich N., Luan M., Anh H., Thanh B. (2014). Chứng chỉ A, B, C... tụt dốc [Level A-B-C Certificates are…

slipping down]. Retrieved from https://thanhnien.vn/giao-duc/chung-chi-a-b-c-tut-doc-501977.html

Lich Su Viet Nam. (2016). Nguồn gốc tên gọi Sài Gòn ngày nay [The origin of the name Saigon today].

Retrieved from https://lichsunuocvietnam.com/nguon-goc-ten-goi-sai-gon/

Little, D. (2005). The Common European Framework and the European Language Portfolio: Involving

learners and their judgements in the assessment process. Language Testing, 22(3), 321-336.

Liu, M. (2007). Language anxiety in EFL testing situations. ITL-International Journal of Applied Linguistics,

153(1), 53-75.

Louma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.

Lowe, P. (1988). ILR handbook on oral interview testing. Washington, DC: Defense Language

Institute/Foreign Service Institute Oral Interview Project.

368

Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training.

Language Testing, 12(1), 54-71.

Lumley, T., & O’Sullivan, B. (2005). The effect of test-taker gender, audience and topic on task performance

in tape-mediated assessment of speaking. Language Testing, 22(4), 415-437.

Lundeberg, O. K. (1929). Recent developments in audition-speech tests. Modern Language Journal, 14, 193-

202.

Lynn, M. R. (1986). Determination and quantification of content validity. Nursing Research, 35(6), 382-386.

MacKinnon, A., & Hao, L. V. (2016). The Sociocultural Context of Higher Education in Vietnam: A Case for

Collaborative Learning in Physics Courses. International Journal of Educational Studies, 1(3), 145-

161.

Madsen, H. S. (1983). Techniques in testing. Hong Kong: Oxford University Press.

Mathews, J. C. (1985). Examinations: A commentary. London: George Allen and Unwin.

Mathison, S. (1998). Why triangulate? Educational Researcher, 17(2), 13-17.

May, L. (2010). Developing speaking assessment tasks to reflect the ‘social turn’ in language testing.

University of Sydney Papers in TESOL, 5, 1-30.

May, L. (2011). Interactional competence in a paired speaking test: Features salient to raters. Language

Assessment Quarterly, 8(2), 127-145, doi: 10.1080/15434303.2011.565845

Mart, C. T. (2013). The audio-lingual method: An easy way of achieving speech. International Journal of

Academic Research in Business and Social Sciences, 3(12), 63.

Martyniuk, W. (2008). Relating language examinations to the Council of Europe’s Common European

Framework of Reference for Languages (CEFR). In L. Taylor & C. J. Weir (Eds.), Multilingualism

and assessment: Achieving transparency, assuring quality, sustaining, diversity – Proceedings of the

ALTE Berlin Conference May 2005 (pp. 9-20). Cambridge: Cambridge University Press.

McDonald, A. S. (2001). The prevalence and effects of test anxiety in school children. Educational

psychology, 21(1), 89-101.

McDonough, K., & Chaikitmongkol, W. (2007). Teachers' and learners' reactions to a task‐based EFL course

in Thailand. TESOL Quarterly, 41(1), 107-132.

McGarrell, H. M. (1981). Language testing: A historical perspective. Medium, 6(4), 17-40.

McMillan, J. H., & Schumacher, S. (2001). Research in education: A conceptual introduction (5th ed.). New

York.: Addison Wesley Longman, Inc.

McNamara, T. F. (1995). Modelling performance: Opening Pandora’s Box. Applied Linguistics, 16(2), 159-

175.

369

McNamara, T. F. (1996). Measuring second language performance. New York: Addison Wesley Longman.

McNamara, T. (2005). 21st century shibboleth: Language tests, identity and intergroup conflict. Language

Policy, 4(4), 351-370.

McNamara, T., & Eades, D. (2004). Linguistic identification of refugee claimants: a flawed test? Presentation

in Invited Symposium ‘Enforcing citizenship policy through language tests’. American Association

for Applied Linguistics Annual Meeting, Portland, OR, May.

McNamara, T. F., & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas

assessments of speaking skills in occupational settings. Language Testing, 14(2), 140-56.

McNamara, T. F. (2000). Language testing. Cambridge: Oxford University Press.

Merriam, S. B. (1988). Case study research in education: A qualitative approach. San Francisco: Jossey-

Bass.

Merriam, S. B. (2009). Qualitative research. A guide to design and implementation. San Francisco: Jossey-

Bass.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.13-103). New York:

Macmillan.

Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241-256.

Minh Nhat. (2017). Dạy học tiếng Anh ở trường phổ thông: Nhiều bất cập, yếu kém [Teaching and learning

English at high schools: Lots of inadequacy and weaknesses]. Retrieved from

https://news.zing.vn/day-hoc-tieng-anh-o-truong-pho-thong-nhieu-bat-cap-yeu-kem-post708487.html

Moc Tra. (2017). Sinh viên ‘đuối’ việc vì không nói được tiếng Anh [Students are ‘exhausted’ from work

because of not having sufficient ability to speak English]. Retrieved from http://news.zing.vn/sinh-

vien-duoi-viec-vi-khong-noi-duoc-tieng-anh-post739052.html

Moeller, A. J. (2016). The confluence of language assessment and mixed methods. In A. J. Moeller, J. W.

Creswell, & N. Saville (Eds.), Second language assessment and mixed methods research.

Cambridge: Cambridge University Press.

MOET. (2007). Quyết định số 43/2007/QĐ-BGD&ĐT về việc “Ban hành quy chế đào tạo đại học cao đẳng

hệ chính quy theo hệ thống tín chỉ” [Decision no. 43/2007/QĐ-BGD&ĐT, on “Issuance of

regulations on training full-time students at universities and colleges following the credit-based

system”]. Retrieved from

http://www.chinhphu.vn/portal/page/portal/chinhphu/hethongvanban?class_id=1&_page=277&mode

=detail&document_id=38739

MOET. (2010). Quyết định số 3321/QĐ-BGDĐT về việc ban hành Chương trình thí điểm Tiếng Anh Tiểu học

[Decision no. 3321/QĐ-BGDĐT on issuance of the pilot English programme for primary schools].

370

Retrieved from http://ngoainguquocgia.moet.gov.vn/van-ban/van-ban-phap-

quy/id/91/moduleId/491/control/Open

MOET. (2014). Thông tư số 01/2014/TT-BGDĐT về việc “Ban hành khung năng lực ngoại ngữ 6 bậc dùng

cho Việt Nam” [Circular no. 01/2014/TT-BGDDT on “Issuance of the six-level framework of

reference for foreign language proficiency adopted in Vietnam”]. Retrieved from

http://vbpl.vn/bogiaoducdaotao/Pages/vbpq-van-ban-goc.aspx?ItemID=37680

MOET. (2015). Thông tư số 07/2015/TT-BGDĐT về việc “Ban hành quy định về khối lượng kiến thức tối

thiểu, yêu cầu về năng lực mà người học đạt được sau khi tốt nghiệp đối với mỗi trình độ đào tạo

của giáo dục đại học và quy trình xây dựng, thẩm định, ban hành chương trình đào tạo trình độ đại

học, thạc sĩ, tiến sĩ” [Circular no. 07/2015/TT-BGDDT on “Issuance of regulations about the

minimum amount of knowledge, the qualification requirements that students are to meet upon

graduation for each training level of tertiary education, and the process of constructing, appraising,

issuing curricula for training at bachelor, master, doctorate levels”]. Retrieved from

http://vbpl.vn/bogiaoducdaotao/Pages/vbpq-van-ban-goc.aspx?ItemID=66762

MOET. (2016a). Tóm lược lịch sử phát triển giáo dục và đào tạo Việt Nam [A brief history of the

development of Vietnamese education and training]. Retrieved from https://moet.gov.vn/gioi-

thieu/lich-su-phat-trien/Pages/default.aspx?ItemID=4089

MOET. (2016b). Công văn số 3755/BGDĐT-GDTX về việc quy đổi chứng chỉ ngoại ngữ, tin học [Document

no. 3755/BGDĐT-GDTX on the equivalency of foreign languages and informatics]. Retrieved from

https://moet.gov.vn/van-ban/vbdh/Pages/chi-tiet-van-ban.aspx?ItemID=1985#content_1

Montagut, M., & Murtra, P. (2005). The Common European Framework of Reference for Languages and the

revision of the higher level Catalan language test. In L. Taylor & C. J. Weir (Eds.), Multilingualism

and assessment: Achieving transparency, assuring quality, sustaining, diversity – Proceedings of the

ALTE Berlin Conference May 2005 (pp. 117-129). Cambridge: Cambridge University Press.

Morrison, K. R. B. (1993). Planning and accomplishing school-centred evaluation. Dereham, U.K.: Peter

Francis.

Morrow, K. (1979). Communicative language testing: Revolution or evolution? In Brumfit, C. J., & Johnson,

K. (Eds.). The communicative approach to language teaching. Oxford: Oxford University Press.

Morrow, C. K. (2018). Communicative language testing. The TESOL Encyclopedia of English Language

Teaching, 1-7.

Mousavi, S. A. (Ed.) (2002). An encyclopaedic dictionary of language testing. Taiwan: Tung Hua Book

Company.

Muñoz, A., & Álvarez, M. E. (2010). Washback of an oral assessment system in the EFL classroom.

Language Testing January, 27(1), 33-49. doi:10.1177/0265532209347148

371

Murphy, J. M., & Baker, A. A. (2015). History of ESL pronunciation teaching. In M. Reed & J. M. Levis

(Eds.), The handbook of English pronunciation (pp. 36-65). Oxford: John Wiley & Sons, Inc.

My Ha. (2016). Dạy tiếng Anh chương trình 10 năm: Còn yếu và thiếu [Teaching the ten-year English

programme: Still weak and insufficient]. Retrieved from https://baomoi.com/day-tieng-anh-chuong-

trinh-10-nam-con-yeu-va-thieu/c/20679050.epi

Nagai, N., & O’Dwyer, F. (2011). The actual and potential impacts of the CEFR on language education in

Japan. Synergies Europe, 6, 141-152.

Nambiar, M. K., & Goon, C. (1993). Assessment of oral skills: A comparison of scores obtained through

audio recordings to those obtained through face-to-face evaluation. RELC Journal, 24(1), 15-31.

Nakatsuhara, F. (2004). An investigation into conversational styles in paired speaking test. M.A. dissertation,

University of Essex, UK.

NCELTR. (2004). About NCELTR. Macquarie University. Retrieved from www.anzgroup.com.ar/downfile/21

Neuman, L. W. (2011). Social research methods: Qualitative and quantitative approaches. New York:

Pearson Education, Inc.

Ngoc, K. M., & Iwashita, N. O. R. I. K. O. (2012). A comparison of learners' and teachers' attitudes toward

communicative language teaching at two universities in Vietnam. University of Sydney Papers in

TESOL, 7.

Ngoc Ha & Tran Huynh. (2014). Chứng chỉ ngoại ngữ A, B, C đã lạc hậu [Level A-B-C Cetificates have

become out of date]. Retrieved from http://ndh.vn/chung-chi-ngoai-ngu-a-b-c-da-lac-hau-

20141009084751156p125c133.news

Nguyen, D. C. (2009). Teaching grammar to learners [Dạy ngữ pháp cho học sinh]. Retrieved from

https://sites.google.com/site/eltsite/Home/tap-huan-giao-vien-teacher-training/phuong-phap-day-hoc-

mon-tieng-anh-thcs-elt-methodology---middle-schools/daynguphapchohocsinh

Nguyen, C. D. (2017). Connections between learning and teaching: EFL teachers’ reflective practice.

Pedagogies, 12(3), 237-255.

Nguyen, D. K. (2017). Tại sao tư duy phản biện trong giáo dục rất khó thực hiện ở Việt Nam? [Why is it very

challenging to implement critical thinking in Vietnamese education?]. Retrieved from

https://baomoi.com/tai-sao-tu-duy-phan-bien-trong-giao-duc-rat-kho-thuc-hien-o-viet-

nam/c/22314055.epi

Nguyen, H. H. (2017). Đông Kinh Nghĩa Thục: Cuộc cách mạng giáo dục đầu tiên ở Việt Nam [Dong Kinh

Nghia Thuc: The first educational revolution in Vietnam]. Retrieved from

http://nghiencuuquocte.org/2017/07/24/dong-kinh-nghia-thuc-cach-mang-giao-duc/

372

Nguyen, N. T. (2017). Thirty years of English language and English education in Vietnam: Current reflections

on English as the most important foreign language in Vietnam, and key issues for English education

in the Vietnamese context. English Today, 33(1), 33-35.

Nguyen, T. A. (2018). Thực phẩm chứa phụ gia và chất bảo quản nguy hại thế nào? [How harmful are

additives and preservatives?]. Retrieve from http://suckhoedoisong.vn/thuc-pham-chua-phu-gia-va-

chat-bao-quan-nguy-hai-the-nao-n129137.html

Nguyen, T. G. (2006). Chính sách ngôn ngữ ở Việt Nam qua các thời kì lịch sử [Language policies in

Vietnam via periods of history]. Retrieved from http://ngonngu.net/index.php?m=print&p=172

Nguyen, T. H. (2017). Đề án ngoại ngữ quốc gia: ‘Nguy cơ thất bại được báo trước’ [The national foreign

language scheme: ‘A foreseen risk of failure’]. Retrieved from https://news.zing.vn/de-an-ngoai-ngu-

quoc-gia-nguy-co-that-bai-duoc-bao-truoc-post711914.html

Nguyen, X. Q. (2016). 3 thách thức trong dạy và học tiếng Anh ở Việt Nam [3 challenges in teaching and

learning English in Vietnam]. Retrieved from https://vnexpress.net/tin-tuc/giao-duc/hoc-tieng-anh/3-

thach-thuc-trong-day-va-hoc-tieng-anh-o-viet-nam-3490098.html

Nguyen, V. S. (2016). Thế kỷ của văn học Quốc ngữ, thế kỷ XX [The century of Quoc Ngu literature, the

twentieth century]. Retrieved from https://vietbao.com/a255131/the-ky-cua-van-hoc-quoc-ngu-the-

ky-xx

Nguyen Thuy. (2013). Tràn lan hàng Trung Quốc chứa chất độc gây ung thư [Widespread Chinese goods

containing toxic substances causing cancer]. Retrieve from https://www.tienphong.vn/the-gioi/tran-

lan-hang-trung-quoc-chua-chat-doc-gay-ung-thu-637917.tpo

Nguyet Ha. (2017). Dạy và học khó khăn vì... trò giỏi tiếng Anh hơn thầy [Difficulties in teaching and

learning because… students are better at English than teachers]. Retrieved from

https://news.zing.vn/day-va-hoc-kho-khan-vi-tro-gioi-tieng-anh-hon-thay-post607757.html

Nhu Lich., & Ha Anh. (2014). Chứng chỉ ngoại ngữ A, B, C… tụt dốc: Tồn tại hay không tồn tại? [The A, B,

C foreign language certificates… have lagged behind: To be or not to be?]. Retrieved from

https://thanhnien.vn/giao-duc/chung-chi-ngoai-ngu-a-b-c-tut-doc-ton-tai-hay-khong-ton-tai-

502140.html

Nitta, R., & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre-task planning effects on

paired oral test performance. Language Testing, 31(2), 147-175.

Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing second language performance

assessments. Hawaii: Second Language Teaching & Curriculum Center, University of Hawaii -

Manoa.

Norton, J. (2005). The paired format in the Cambridge Speaking Tests. ELT Journal, 59(4), 287-297

373

North, B. (1995). The development of a common framework scale of descriptors of language proficiency

based on a theory of measurement. System, 23(4), 445-465.

North, B. (2009). The educational and social impact of the CEFR in Europe and beyond: A preliminary

overview. Language testing matters: Investigating the wider social and educational impact of

assessment, 357-378.

Nunan, D. (1989). Designing tasks for the communicative classroom. Cambridge: Cambridge University

Press.

Nunan, D. (2003). The impact of English as a global language on educational policies and practices in the

Asia‐Pacific Region. TESOL Quarterly, 37(4), 589-613.

O'Loughlin, K. J. (2001). The equivalence of direct and semi-direct speaking tests. Cambridge: Cambridge

University Press.

Osterlind, S. J. (1998). Constructing test items: Evaluation in education and human services (Vol. 47). The

Netherlands: Springer.

O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Language

Testing, 19(3), 277-295.

O’Sullivan, B. (2011). Introduction. In B. O’Sullivan (Ed.), Language testing: Theories and practices (pp. 1-

12). New York: Palgrave Macmillan.

O’Sullivan, B. (2012). A brief history of language testing. In C. Christensen, D. Peter, O. S. Barry, & S.

Stephen (Eds.), The Cambridge guide to second language assessment (pp. 9-19). New York:

Cambridge University Press.

O’Sullivan, B., & Green, A. (2011). Test taker characteristics. In L. Taylor (Ed.), Examining speaking.

Cambridge: Cambridge University Press.

O’Sullivan, B., Weir, C. J., & Saville, N. (2002). Using observation checklists to validate speaking-test

tasks. Language testing, 19(1), 33-56.

Oxford, R. (2001). Integrated skills in the ESL/EFL classroom. Retrieved from

http://resourcesforteflteachers.pbworks.com/f/Integrated%20Skills%20in%20the%20ESL%20EFL%

20Classroom.pdf

Oxford, R. L. (2003). Language learning styles and strategies: Concepts and relationships. Iral, 41(4), 271-

278.

Oxford, R., & Shearin, J. (1994). Language learning motivation: Expanding the theoretical framework. The

Modern Language Journal, 78(1), 12-28.

374

Ozer, I., Fitzgerald, S. M., Sulbaran, E., & Garvey, D. (2014). Reliability and content validity of an English as

a Foreign Language (EFL) grade-level test for Turkish primary grade students. Procedia-Social and

Behavioral Sciences, 112, 924-929.

Partington, J. (1994). Double‐marking students' work. Assessment & Evaluation in Higher Education, 19(1),

57-60.

Pham, H. H. (2005). Imported communicative language teaching: Implications for local teachers. English

Teaching Forum, 43(4), 2-9.

Pham, H. H. (2007). Communicative language teaching: Unity within diversity. ELT Journal, 61(3), 193-201.

Pham, K. (2017). Street Cred: Alexandre de Rhodes and the birth of chữ Quốc ngữ. Retrieved from

https://saigoneer.com/saigon-people/9498-street-cred-alexandre-de-rhodes-and-the-birth-of-chữ-

quốc-ngữ

Pham, L. A. (2017). Đổi mới trong kiểm tra đánh giá: Lộ trình và thách thức [Innovations in testing and

assessment: Itinerary and challenges]. Retrieved from https://dean2020.edu.vn/vi/news/Tin-tuc/doi-

moi-trong-kiem-tra-danh-gia-lo-trinh-va-thach-thuc-409.html

Pham, M. H. (Ed.) (1991). Education in Vietnam 1945-1991. Hanoi: MOET.

Phan, L. H. (2013). Issues surrounding English, the internationalisation of higher education and national

cultural identity in Asia: A focus on Japan. Critical Studies in Education, 54(2), 160-175.

Phan, C. (2014). Vấn nạn về tiêu chuẩn tiếng Anh cho sinh viên tốt nghiệp Đại học [Challenges in EFL

standardisation for university graduates]. Retrieved from http://cep.com.vn/index.php/Van-nan-ve-

tieu-chuan-tieng-Anh-cho-sinh-vien-tot-nghiep-Dai-hoc-544.html

Phillips, E. M. (1992). The effects of language anxiety on students' oral test performance and attitudes. The

Modern Language Journal, 76(1), 14-26.

Phung, V. M. (2015). Tầm quan trọng của thứ ngôn ngữ toàn cầu [The importance of a global language].

Retrieved from http://kenh14.vn/hoc-duong/tam-quan-trong-cua-thu-ngon-ngu-toan-cau-

2015012808478584.chn

Phuong Chinh. (2017). 85% sinh viên chưa đạt chuẩn trình độ tiếng Anh [85% of university students have not

reached the standardised level of English language proficiency]. Retrieved from

http://www.sggp.org.vn/85-sinh-vien-chua-dat-chuan-trinh-do-tieng-anh-456676.html

Piccardo, E. (2012). Multidimensionality of assessment in the Common European Framework of Reference

for languages (CEFR). OLBI Working Papers, 4, 37-54.

Pietila, P. (1998). Advanced foreign language learners in action: A look at two speaking tasks. Retrieved

from http://search.proquest.com/docview/62246966?accountid=10499

375

Pino, B. G. (1989). Prochievement testing of speaking. Foreign Language Annals, 22(5), 487-496.

Pirhonen, N. (2016). Students' perceptions about the use of oral feedback in EFL classrooms. Unpublished

master’s thesis. Finland: University of Jyväskylä, Department of Languages. Retrieved from

https://jyx.jyu.fi/bitstream/handle/123456789/49927/URN:NBN:fi:jyu-

201605252709.pdf?sequence=1

Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what's being reported?

Critique and recommendations. Research in Nursing & Health, 29(5), 489-497.

Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity?

Appraisal and recommendations. Research in Nursing & Health, 30(4), 459-467.

Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.),

Studies in language testing (Vol. 3, pp. 74-91). Cambridge: Cambridge University Press.

Pop, A., & Dredetianu, M. (2013). Assessment of speaking in English for specific purposes (ESP) including a

voice tool component. Studia Universitatis Petru Maior - Philologia, 15, 105-110.

Popham, W. J. (1987). The merits of measurement-driven instruction. The Phi Delta Kappan, 68(9), 679-682.

Porte, G. K. (2010). Appraising research in second language learning: A practical approach to critical

analysis of quantitative research (Vol. 28). Philadelphia, USA: John Benjamins Publishing

Company.

Psathas, G. (1995). Conversation analysis: The study of talk-in-interaction. Thousand Oaks, CA: Sage

Publications Inc.

Punch, K. F. (1997). Introduction to social research: Quantitative and qualitative approaches. Thousand

Oaks, CA: Sage.

Qandme. (2015). Survey on English study in Vietnam [Khảo sát về việc học tiếng Anh ở Việt Nam]. Retrieved

from https://qandme.net/vi/baibaocao/Khao-sat-ve-viec-hoc-tieng-Anh-o-Viet-Nam.html

Quality Training Solutions. (2017). 5 địa điểm trong thành phố tha hồ cho bạn nói tiếng Anh [5 venues in the

city for you to speak English freely]. Retrieved from https://www.qts.edu.vn/news/5-dia-diem-trong-

thanh-pho-cho-ban-tha-ho-noi-tieng-anh-293

Quang Nhat. (2017). Bạo lực học đường ngày càng đáng sợ [School violence is more and more terrifying].

Retrieved from https://nld.com.vn/thoi-su-trong-nuoc/bao-luc-hoc-duong-ngay-cang-dang-so-

20170209220257913.htm

Quynh Trang. (2017). Bộ Giáo dục sửa đề án ngoại ngữ 2020 [MOET changes the 2020 Foreign Language

Project]. Retrieved from https://vnexpress.net/giao-duc/bo-giao-duc-sua-de-an-ngoai-ngu-2020-

3631023.html

376

Quy Hien, & Nguyen Dung. (2014). Minh bạch và trung thực là vấn đề số một của thi cử [Transparency and

honesty is of top concern in examinations]. Retrieved from https://dantri.com.vn/giao-duc-khuyen-

hoc/minh-bach-va-trung-thuc-la-van-de-so-mot-cua-thi-cu-1406532521.htm

Radchenko, S. (2018). Why were the Russians in Vietnam? Retrieved from

https://www.nytimes.com/2018/03/27/opinion/russians-vietnam-war.html

Rapley, T. (2007). Doing conversation, discourse and document analysis. Cambridge: Sage Publications.

Read, J., & Hayes, B. (2003). The impact of IELTS on preparation for academic study in New Zealand. In R.

Tulloh (Ed.), International English Language Testing System (IELTS) Research Reports 2003:

Volume 4, 153, IELTS Australia, Canberra.

Reed, D. J., & Cohen, A. D. (2001). Revisiting raters and ratings in oral language assessment. In M.

Milanovic (Series Ed.) & C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, L. Lumley, F.

McNamara, & K. O’Loughlin (Eds.), Studies in Language Testing: Vol. 11. Experimenting with

uncertainly: Essays in honour of Alan Davies (pp. 82-96). Cambridge: Cambridge University Press.

Restrepo, A. P. M., & Villa, M. E. A. (2003). Estimating the validity and reliability of an oral assessment

instrument. Revista Universidad EAFIT, 39(132), 65-75.

Richards, B., & Chambers, F. (1996). Reliability and validity in the GCSC oral examination. Language

Learning Journal, 14, 28-34.

Richards, J. C. (2006). Communicative language teaching today. New York: Cambridge University Press.

Roach, J. O. (1945). Some problems of oral examinations in modern languages. An experimental approach

based on the Cambridge examinations in English for Foreign Students. University of Cambridge

Examinations Syndidate: Internal report circulated to oral examiners and local representatives for

these examinations.

Rodgers, G. (2017). Where is Saigon? And should you say "Ho Chi Minh City" or "Saigon"? Retrieved from

https://www.tripsavvy.com/where-is-saigon-1458405

Rodgers, T. S. (2001). Language teaching methodology. ERIC Issue Paper. Retrieved from

https://eric.ed.gov/?id=ED459628

Saif, S. (2006). Aiming for positive washback: A case study of international teaching assistants. Language

Testing, 23(1), 1-34.

Saldaňa, J. (2009). The coding manual for qualitative researchers. London: SAGE Publications Ltd.

Saville, N., & Hargreaves, P. (1999). Assessing speaking in the revised FCE. English Language Testing

Journal, 53(1), 42-51.

377

Scott, D., & Morrison, M. (2006). Key ideas in educational research. London: Continuum International

Publishing Group.

Searle, J. R. (1969). Speech acts: An essay in the philosophy of language. Cambridge: Cambridge University

Press.

Seidman, I. (2013). Interviewing as qualitative research. New York: Teachers College Press.

Seward, B. H. (1973). Measuring oral production in EFL. English Language Teaching Journal, 28(1), 76-80.

Shaw, S. D., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second language

writing, Study in language testing 26, Cambridge: UCLES/Cambridge University Press.

Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4-14.

Shohamy, E. (1983). Rater reliability of the oral interview speaking test. Foreign Language Annals, 16(3),

219-222.

Shohamy, E. (1993). The power of test: The impact of language testing on teaching and learning.

Washington, DC: National Foreign Language Centre Occasional Papers.

Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11, 99-123.

Shohamy, E. (2000). The relationship between language testing and second language acquisition,

revisited. System, 28(4), 541-553.

Shohamy, E., Reves, T., & Benjarano, Y. (1986). Introducing a new comprehensive test of oral proficiency.

English Language Teaching Journal, 40(3), 212-220.

Siddiek, A. (2010). The impact of test content validity on language teaching and learning. Asian Social

Science, 6(12), 133-143.

Skehan, P., & Foster, P. (1997). Task type and task processing conditions as influences on foreign language

performance. Language Teaching Research, 1(3), 185–211.

Skinner, B. F. (1957) Verbal behaviour. New York: Appleton-Century-Crofts.

Smith-Khan, L. (2015, December 9). Discrimination by any other name: Language tests and racist migration

policy in Australia. Retrieved from http://www.languageonthemove.com/discrimination-by-any-

other-name-language-tests-and-racist-migration-policy-in-australia/

Son Tra. (2017). Những lợi ích khi sở hữu chứng chỉ IELTS [Benefits from obtaining an IELTS Certificate].

Retrieved from https://news.zing.vn/nhung-loi-ich-khi-so-huu-chung-chi-ielts-post743339.html

Spence-Brown, R. (2001). The eye of the beholder: Authenticity in embedded assessment task. Language

Testing, 18(4), 463-481.

378

Spielberger, C. D. (1972). Anxiety as an emotional state. In C. D. Spielberger (Ed.), Anxiety: Current trends

in theory and research (pp. 23-49). New York: Academic Press.

Spolsky, B. (1978). Approached to language testing. Arlington, VA: Center for Applied Linguistics.

Spolsky, B. (1995). Measured words. Oxford: Oxford University Press.

Stahl, J. A., & Lunz, M. E. (1991). Judge performance reports: Media and message. Paper presented at the

annual meeting of the American Educational Research Association, San Francisco, CA.

Stoneman, B. (2005). An impact study of an exit English test for university graduates in Hong Kong:

Investigating whether the status of a test affects students’ test preparation activities. Unpublished

PhD thesis. Hong Kong Polytechnic University.

Study Guides and Strategies. (n.d.). Preparing for and taking oral exams. Retrieved July 10, 2017 from

http://www.studygs.net/oralexams.htm

Swain, M. (2001). Examining dialogue: Another approach to content specification and to validating

inferences drawn from test scores. Language Testing, 18(3), 275-302.

Swender, E. (2003). Oral proficiency testing in the real world: Answers to frequently asked questions.

Foreign Language Annals, 36(4), 520-526.

Taylor, L. (2011). Introduction. In L. Taylor (Ed.), Examining speaking: Research and practice in assessing

second language speaking (pp. 1-35). Cambridge: Cambridge University Press.

Taylor, L., & Galaczi, E. (2011). Scoring validity. In L. Taylor (Ed.), Examining speaking: Research and

practice in assessing second language speaking (pp. 171-233). Cambridge: Cambridge University

Press.

Tayeb, Y. A., Aziz, M. S. A., Ismail, K., & Khan, A. B. M. A. (2014). The washback effect of the General

Secondary English Examination (GSEE) on teaching and learning. GEMA Online Journal of

Language Studies, 14(3), 83-103.

Teddlie, C., & Tashakkori, A. (2009). Foundations of mixed methods research: Integrating quantitative and

qualitative approaches in the social and behavioral sciences. Thousand Oaks, CA: SAGE

Publications Inc.

Thanh Binh. (2015). 10 lý do bạn cần học tiếng Anh [10 reasons for your English learning]. Retrieved from

https://vnexpress.net/tin-tuc/giao-duc/hoc-tieng-anh/10-ly-do-ban-can-hoc-tieng-anh-3272046.html

Thanh Ha. (2008). Vì sao sinh viên ra trường không nói được tiếng Anh? [Why cannot graduates speak

English?]. Retrieved from https://tuoitre.vn/vi-sao-sinh-vien-ra-truong-khong-noi-duoc-tieng-anh-

291136.htm

379

Thanh Tam. (2016). Khung năng lực ngoại ngữ 6 bậc của Việt Nam thế nào [How is the 6-level foreign

language proficiency framework applied in Vietnam]. Retrieved from https://vnexpress.net/tin-tuc/

giao-duc/tu-van/khung-nang-luc-ngoai-ngu-6-bac-cua-viet-nam-the-nao-3472850.html

Thanh Tam, & Phuong Hoa. (2006). Thi trắc nghiệm 100% có thể là 'bước lùi' của môn tiếng Anh [100%

multiple-choice examination is possibly a ‘backward step’ of the English subject]. Retrieved from

https://vnexpress.net/tin-tuc/giao-duc/thi-trac-nghiem-100-co-the-la-buoc-lui-cua-mon-tieng-anh-

3465561.html

Thao Nguyen. (n.d.). Các loại chứng chỉ tiếng Anh hiện nay [English language certificates of today].

Retrieved from https://kenhtuyensinh.vn/cac-loai-chung-chi-tieng-anh-hien-nay

The Dan. (2017). Chứng chỉ TOEIC ngày càng cần thiết với sinh viên [The TOEIC Certificate is becoming

more and more essential for students]. Retrieved from https://vnexpress.net/tin-tuc/giao-duc/hoc-

tieng-anh/chung-chi-toeic-ngay-cang-can-thiet-voi-sinh-vien-3571965.html

Thompson, I. (1996). Assessing foreign language skills: Data from Russian. Modern Language Journal,

80(1), 47-65.

Ton, N. N. H., & Pham, H. H. (2010). Vietnamese teachers’ and students’ perceptions of global

English. Language Education in Asia, 1(1), 48-61.

Tran, T. P. H. (2009). Franco-Vietnamese schools and the transition from Confucian to a new kind of

intellectuals in the colonial context of Tonkin. Paper presented at the Harvard Graduate Students

Conference on East Asia, USA. http://www.harvard-yenching.org/sites/harvard-

yenching.org/files/TRAN%20Thi%20Phuong%20Hoa_Franco%20Vietnamese%20schools2.pdf

Truong, A. T., & Neomy, S. (2007). Investigating group planning in preparation for oral presentations in an

EFL class in Vietnam. RELC Journal, 38(1), 104-124.

Tran, H. H. (2008). Bauxite mining threatens Central Highlands. Retrieved from

http://vietnamnews.vn/environment/ 181686/bauxite-mining-threatens-central-

highlands.html#UfWPM5LRJmJ0FWoX.97

Tran Quan. (2015). ‘Sài Gòn vẫn hát’: Những trăn trở rất đời với nhạc xưa [‘Sai Gon still sings’: Very earthy

thoughts about the old-time music]. Retrieved from http://news.zing.vn/sai-gon-van-hat-nhung-tran-

tro-rat-doi-voi-nhac-xua-post599724.html

Tran, T. T. T. (2010). Way to design a traditional five-step lesson plan [Cách soạn giáo án theo 5 bước truyền

thống]. Retrieved from https://giaoan.violet.vn/present/show/entry_id/4158296

Tri Thuc Tre. (2018). Chuyện của những người trẻ “vào Sài Gòn” [Stories of young people “going into Sai

Gon”]. Retrieved from http://kenh14.vn/chuyen-cua-nhung-nguoi-tre-vao-sai-gon-

20171201002400189.chn

380

Tsushima, R. (2015). Methodological diversity in language assessment research: The role of mixed methods

in classroom-based language assessment studies. International Journal of Qualitative

Methods, 14(2), 104-121.

Tuoi Tre News. (2013). English teaching in Vietnam: Teacher ‘re-education’. Retrieved from

http://tuoitrenews.vn/education/8231/english-teaching-in-vietnam-teacher-reeducation

Tuong Han. (2017). Tiếng Anh ở đại học – 'nỗi ám ảnh' hay cơ hội? [English in tertiary education –

‘obsession’ or opportunity?]. Retrieved from https://tuoitre.vn/tieng-anh-o-dai-hoc-noi-am-anh-hay-

co-hoi-1306392.htm

Turner, C. E., & Upshur, J.A. (1996). Developing rating scales for the assessment of second language

performance. In G. Wigglesworth & C. Elder (Eds.), The language testing cycle: From inception to

washback (pp. 55-79). Melbourne, Australia: Applied Linguistics Association of Australia.

UCLES. (1997). Key English test: Instructions to oral examiners March 1997 – December 1997. Cambridge:

University of Cambridge Local Examinations Syndicate.

University of York. (n.d.). Appendix 2: Policy on the audio-recording of oral examinations for research

degrees. Retrieved from https://www.york.ac.uk/research/graduate-school/support/policies-

documents/recording-oral-examinations/

Upshur, J.A., &Turner, C.E. (1995). Constructing rating scales for second language tests. English Language

Teaching Journal, 49, 3-12.

Valette, R. M. (1992). A look at foreign language testing in the secondary schools: The state of the art in

1990. Georgetown University Round Table on Languages and Linguistics (GURT) 1991: Linguistics

and Language Pedagogy: The State of the Art, 432.

Van Moere, A. (2006). Validity evidence in a university group oral test. Language Testing, 23(4), 411-440.

Verma, G. K., & Mallick, K. (1999). Researching education: Perspectives and techniques. Great Britain:

Biddles Ltd.

Vidaković, I., & Robinson, M. (2016). A community-based participatory approach to test development: The

international Legal English certificate. In A. J. Moeller, J. W. Cresswell, & N. Saville (Eds.), Second

language assessment and mixed methods research (pp. 177-207). Cambridge: Cambridge University

Press.

Viet Vision Travel. (n.d.). From the first day of missionary work to 1884. Retrieved from

https://www.vietvisiontravel.com/post/from-the-first-day-of-missionary-work-to-1884/

Vietnamese Government. (2008). Quyết định số 1400/QĐ-TTG của Thủ tướng Chính phủ: Về việc phê duyệt

Đề án “Dạy và học ngoại ngữ trong hệ thống giáo dục quốc dân giai đoạn 2008 – 2020” [Decision

no. 1400/QĐ-TTg by the Prime Minister: Approval on the Project “Teaching and learning foreign

381

languages in the national education system for the period from 2008 to 2020”]. Retrieved from

http://www.chinhphu.vn/portal/page/portal/chinhphu/hethongvanban?class_id=1&_page=18&mode=

detail&document_id=78437

Vietnamese Government. (2017). Quyết định số 2080/QĐ-TTg của Thủ tướng Chính phủ: Phê duyệt điều

chỉnh, bổ sung Đề án dạy và học ngoại ngữ trong hệ thống giáo dục quốc dân giai đoạn 2017-2025

[Decision no. 2080/QĐ-TTg by the Prime Minister: Approval on adjustments, supplements to the

foreign language teaching and learning in the national education system for the period from 2017 to

2025]. Retrieved from

http://vanban.chinhphu.vn/portal/page/portal/chinhphu/hethongvanban?class_id=2&_page=1&mode

=detail&document_id=192343

VOV. (2015). Ho Chi Minh City - 40 years of development. Retrieved from http://vovworld.vn/en-US/40-

years-of-national-reunification/ho-chi-minh-city-40-years-of-development-331106.vov

VTC. (2015). Tại sao giáo viên nước ngoài đua nhau sang Việt Nam dạy tiếng Anh? [Why do many foreign

teachers go to Vietnam to teach English?]. Retrieved from https://baomoi.com/tai-sao-giao-vien-

nuoc-ngoai-dua-nhau-sang-viet-nam-day-tieng-anh/c/15774571.epi

VTC1. (2016). Tiếng Anh của người Việt: Nữ nói giỏi hơn nam! (Vietnamese people’s English: Females

speak better than males!). Retrieved from https://www.youtube.com/watch?v=QYhCrOnXgeI

Vu, P. A. (2018). National building and language in education policy. In J. Albright (Ed.), English Tertiary

Education in Vietnam (pp. 28-39). Australia: Routledge.

Vu, T. P. A. (2007). Học tiếng Anh 10 năm trong trường không sử dụng được: Kiểm tra đánh giá đang là

khâu yếu nhất [Learning English at school for ten years, students cannot use the language: Testing

and assessment is the weakest component]. Viet Bao. Retrieved from http://vietbao.vn/Giao-

duc/Hoc-tieng-Anh-10-nam-trong-truong-khong-su-dung-duoc-Kiem-tra-danh-gia-dang-la-khau-

yeu-nhat/40224569/202/

Vu, T. P. A., & Nguyen, B. H. (2004). Năng lực tiếng Anh của sinh viên các trường đại học trên địa bàn

TPHCM trước yêu cầu của nền kinh tế tri thức: thực trạng và giải pháp [English competence of

tertiary students in HCMC: Current situation and solutions]. Vietnam: HCMC University of Science,

Vietnam National University.

Vyas, M. A., & Patel, Y. L. (2009). Teaching English as a second language: A new pedagogy for a new

century. PHI Learning Pvt. Ltd.

Walia, D. N. (2012). Traditional teaching methods vs. CLT: A study. Frontiers of language and teaching,

3(1), 125-131.

Wall, D. (1996). Introducing new tests into traditional systems: Insights from general education and from

innovation theory. Language Testing, 13(3), 334-354.

382

Wall, D., & Alderson, J. C. (1996). Examining washback: The Sri Lankan impact study. In A. Cumming & R.

Berwick (Eds.), Validation in language testing (pp. 194-221). Clevedon, UK: Multilingual Matters.

Waltz, C., Strickland, O. L., & Lenz, E. (Eds.). (2010). Measurement in nursing and health research. New

York: Springer Publishing Company.

Wang, P. (2009). The inter-rater reliability in scoring composition. English Language Teaching, 2(3), 39.

Watanabe, Y. (1996). Does grammar translation come from the entrance examination? Preliminary findings

from classroom-based research. Language testing, 13(3), 318-333.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.

Weir, C. (1988). The specification, realisation and validation of an English language proficiency test. In

Hughes, A. (Ed.) (1988). Testing English for university study, ELT Document 127. London: Modern

English Publications/The British Council.

Weir, C. J. (1993). Understanding and developing language tests. New York: Prentice Hall.

Weir, C. J. (2005). Language testing and validation: An evidence-based approach. New York: Palgrave

Macmillan.

Weir, C., & Taylor, L. (2011). Conclusions and recommendations. In L. Taylor (Ed.), Examining speaking:

Research and practice in assessing second language speaking (pp. 293-313). Cambridge: Cambridge

University Press.

Welch, A. R. (2010). Internationalisation of Vietnamese higher education: Retrospect and prospect. In G.

Harman, M. Hayden, & T. Nghi Pham (Eds.), Reforming higher education in Vietnam (pp. 197–213).

Dordrecht, Netherlands: Springer.

Wiersma, W. (2000). Research methods in education: An introduction. Boston, MA: Allyn & Bacon.

Wiersma, W., & Jurs, S. G. (2005). Research methods in education: An introduction (8th ed.). New York:

Pearson Education, Inc.

Wiggins, G. (1994). Toward more authentic assessment of language performances. In C. Hancock

(Ed.), Teaching, testing, and assessment: Making the connection (pp. 69-85). Lincolnwood, IL:

National Textbook Company.

Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral test discourse.

Language Testing, 14(1), 85-106.

Wigglesworth, G. (2008). Task and performance-based assessment. In E. Shohamy & N. H. Hornberger

(Eds.), Encyclopedia of Language and Education: Vol. 7. Language Testing and Assessment (2nd

ed., pp. 111-122). New York: Springer Science+Business Media LLC.

383

Williams, B., Onsman, A., & Brown, T. (2010). Exploratory factor analysis: A five-step guide for

novices. Australasian Journal of Paramedicine, 8(3), 1-13.

Williamson, C. (2017). Teachers’ role in school-based assessment as part of public examinations. US-China

Education Review, 7(6), 301-307.

Woodrow, L. (2006). Anxiety and speaking English as a second language. RELC Journal, 37(3), 308-328.

World Bank. (2017). Improving assessment of student achievements in Vietnam. Retrieved from

https://www.worldbank.org/en/news/feature/2017/09/28/improving-assessment-of-student-

achievements-in-vietnam

Wu, J. R., & Wu, R. Y. (2007). Using the CEFR in Taiwan: The perspective of a local examination board.

The Language Training and Testing Center Annual Report, 56, 1-20.

Wright, S. (2002). Language education and foreign relations in Vietnam. In J. W. Tollefson (Ed.), Language

policies in education: Critical issues (pp. 225-244). London: Lawrence Erlbaum.

Wynd, C. A., Schmidt, B., & Schaefer, M. A. (2003). Two quantitative approaches for estimating content

validity. Western Journal of Nursing Research, 25(5), 508-518.

X3English. (2017). Top 23 Câu lạc bộ tiếng Anh hoạt động tích cực nhất Hà Nội và TP Hồ Chí Minh. (Top

23 most active English-speaking clubs in Hanoi and HCMC). Retrieved from

https://x3english.com/cau-lac-bo-tieng-anh/

Xuan Phuong. (2016). Cần thúc đẩy tư duy phản biện của giới trẻ Việt [The Vietnamese youth’s critical

thinking needs fostering]. Retrieved October 3, 2017 from https://thanhnien.vn/gioi-tre/can-thuc-day-

tu-duy-phan-bien-cua-gioi-tre-viet-755297.html

Yoneoka, J. (2011). From CEFR to CAFR: Place for a Common Asian Framework of Reference for

Languages in the East Asian Business World? Asian Englishes, 14(2), 86-91.

Young, D. J. (1986). The relationship between anxiety and foreign language oral proficiency ratings. Foreign

Language Annals, 19(5), 439-445.

Young, R., & Milanovic, M. (1992). Discourse variation in oral proficiency interviews. Studies in Second

Language Acquisition, 14(4), 403-424.

Yu, G. K. H., & Tung, R. H. C. (2005). The washback effects of JCEEEs in the past fifty years.

In Proceedings of 22nd Conference on English Teaching and Learning (pp. 379-403), Norman

University, Taipei, Taiwan.

Zahedi, K., & Shamsaee, S. (2012). Viability of construct validity of the speaking modules of international

language examinations (IELTS vs. TOEFL iBT): Evidence from Iranian test-takers. Educational

Assessment, Evaluation & Accountability, 24(3), 263-277.

384

Zhao, Y., Lei, J., Li, G., He, M. F., Okano, K., Megahed, N., . . . Ramanathan, H. (Eds.). (2011). Handbook of

Asian education: A cultural perspective. London: Routledge.

Zhao, Z. (2013). Diagnosing the English-speaking ability of college students in China – Validation of the

Diagnostic College English Speaking Test. RELC Journal, 44(3), 341-359.

385

APPENDICES

Appendix A: ETHICS APPROVAL DOCUMENTS

A.1 HREC Approval (17/11/2015)

386

387

388

A.2a Information Statement for Head of EFL Faculty

FACULTY OF EDUCATION AND ARTS Professor Ronald Laura School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 2 4921 5942 Email: [email protected]

Information Statement for the research project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Viet Nam:

An investigation into construct validity and reliability

(For Head of EFL Faculty)

Document version 2, dated 01/11/2015

Your university is invited to participate in the research project identified above which is being conducted by Nam Lam, a PhD candidate from the School of Education, Faculty of Education and Arts at the University of Newcastle. The research is part of Nam Lam’s studies at the University of Newcastle, supervised by Professor Ronald Laura from the School of Education, Faculty of Education and Arts, and co-supervised by A/Professor Neil Morpeth from Centre for English Language and Foundation Studies. Why is the research being done? The purpose of the research is to examine the oral assessment method(s) applied in EFL classrooms at universities with an attempt to find out the strengths and the shortcomings of these methods so that practical implications can be made to improve the quality of language testing in general and assessing EFL speaking skill in particular, which is currently still an under researched area in local tertiary education. Who can participate in the research? Selected second-year English major students in EFL speaking classes at the university can participate in this study. Students whose major is not English language, or TESOL, are not eligible to participate in this research project. Any teacher in charge of EFL speaking classes at the university who wishes to volunteer is eligible to participate in the study.

389

What choice do you have? Participation in this research is entirely a matter of your choice. Only those people who give their informed consent will be included in the project. Whether or not you decide to participate, your decision will not disadvantage you or your university. If you do decide to participate, you may withdraw from the project at any time without giving a reason and have the option of withdrawing any data which identifies you. What would you be asked to do? You as the responsible Head of the EFL Faculty will be requested to provide consent for your teachers and students to be approached, and for the selected teachers and students to participate in the study, and be aware of the study’s purposes, progress and results. Specifically, the researcher is seeking your consent to attend the last 15 minutes of the next meeting of your EFL teachers and to address EFL students in the last 10 minutes of their English-speaking lessons, to provide information on the research project, and hand out the research recruitment material. For the teachers of EFL speaking classes. They will be asked to:

1. Complete a questionnaire about their educational background, test administration practices, personal opinions and awareness about oral assessment of EFL;

Then two of these teachers will be selected at random, and they will be asked to:

2. Participate in an interview and have it recorded. The interview is about their

reflections of and attitudes towards testing spoken English administered in their

EFL classes;

3. Have their performances in the oral EFL test observed and audio-recorded; and

4. Supply their marking schemes and the results of the end-of-course speaking test.

For the students of EFL speaking classes of the two selected teachers, they will be asked to:

1. Complete a questionnaire about their educational background, test-taking

experience, their awareness of and personal opinions about oral assessment of

EFL;

2. Allow the researchers to observe and audio-record their oral test, and obtain their

marking sheet and test results; and

If they wish to volunteer for the next phase of the study, they may be randomly selected to:

3. Participate in a group interview with three other classmates and have it audio-

recorded. The interview is about their reflections of and attitudes towards testing

spoken English administered in their EFL class.

As an expert with responsibility for the EFL curriculum you are invited to review the test contents in terms of its relevance to the course objectives (see the judgement form attached). The test content form and specific course objectives will be delivered to you after the test finishes.

All your participation is voluntary, and the research will proceed only when the selected teachers and students agree to participate.

390

How much time will it take? The student questionnaire should take around 15 minutes to complete after the students finish their speaking test session, i.e. they will use their own time. The teacher questionnaire should take around 20 minutes to complete after the speaking test. The teachers might use their own time to do it at school or later at home. The observations of two tested classes will be taken over the full period of the testing session but will not disrupt normal class routines. Two interviews with teachers will take about 45 minutes each, and two interviews with groups of four students from each class might take about 60 minutes each. These interviews will occur outside normal class time at a time agreed with the participants. What are the risks and benefits of participating? We cannot promise you, your teachers or students any direct benefits from participating in this research. However, the results of this research project will contribute to building the knowledge and policy base to support change in Vietnamese universities including your university. Any risks for staff or students have been assessed as very minimal. How will your privacy be protected? Any information collected by the researchers which might identify you will be stored securely and only accessed by the researchers in de-identifiable form. All personal information about your university such as your students and teachers' names, their classes, and the university will be coded so it will be impossible to identify your university from the information your students and teachers provided. The information will only be accessed by the researcher unless you consent otherwise, except as required by law. Data will be stored in a locked cabinet at the researcher’s office in Vietnam while he is there. After he returns to Newcastle from his fieldwork in Vietnam, the data will be retained in a locked cabinet in the Chief Investigator’s Office at the University of Newcastle for at least five years after which they will be destroyed. How will the information collected be used? The data will be reported and presented in a thesis to be submitted for Nam Lam’s PhD degree. Parts of the findings may also be reported by the researchers at academic conferences and scholarly papers. Individual participants and your university will not be identified in any reports arising from the project. Your teachers and students will be able to review the recording to edit or erase their contribution. You will personally receive a summary of the findings if you would like to have it provided. What do you need to do to participate? Please read this Information Statement and be sure you understand its contents before you consent to participate. If there is anything you do not understand, or you have questions, feel free to contact the researcher. If you would like to have your teachers and students participate in this research, please complete the attached Consent Form and return it into the secure box labelled ‘Consent Forms for Nam Lam’s research’ and provided in the EFL Faculty office. Further information If you would like further information, please contact Nam Lam by email: [email protected] or mobile phone (+84) 901832***. Thank you for considering this invitation.

391

[Signature] Professor Ronald Laura Project Principal Supervisor [Signature] A/Professor Neil Morpeth Project Co-Supervisor [Signature] Nam Lam Research student Complaints about this research

This project has been approved by the University’s Human Research Ethics Committee, Approval No. H- 2015-0366 Should you have concerns about your rights as a participant in this research, or you have a complaint about the manner in which the research is conducted, it may be given to the researcher at 104 Nguyen Van Troi Street, Ho Chi Minh City, Vietnam. Office phone: 00-84-8-39970592, mobile phone: 00-84-901832*** Email: [email protected], or, if an independent person is preferred, to the Human Research Ethics Officer, Research Office, The Chancellery, The University of Newcastle, University Drive, Callaghan NSW 2308, Australia, telephone (02) 49216333, email [email protected].

392

A.2b Information Statement for EFL teachers

FACULTY OF EDUCATION AND ARTS Prof. Ronald Laura School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 2 4921 5942 Email: [email protected]

Information Statement for the research project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

(For EFL teachers)

Document version 2, dated 01/11/2015

You are invited to participate in the research project identified above which is being conducted by Nam Lam, a PhD candidate from the School of Education, Faculty of Education and Arts at the University of Newcastle. The research is part of Nam Lam’s studies at the University of Newcastle, supervised by Professor Ronald Laura from the School of Education, Faculty of Education and Arts, and co-supervised by A/Professor Neil Morpeth from Centre for English Language and Foundation Studies. Why is the research being done? The purpose of the research is to examine the oral assessment method(s) applied in EFL classrooms at universities with an attempt to find out the strengths and the shortcomings of these methods so that practical implications can be made to improve the quality of language testing in general and assessing EFL speaking skill in particular, which is currently still an under researched area in local tertiary education. Who can participate in the research? Selected second-year English major students in EFL speaking classes of the university will be invited to participate in the study. Students whose major is not English language, or TESOL, are not eligible to participate in this research project. EFL teachersin charge of speaking classes of the university who wish to volunteer can also participate inthe study. What choice do you have? Participation in this research is entirely your choice. Only those people who give their informed consent will be included in the project. Whether or not you decide to participate, your decision will not disadvantage you or your university. If you do decide to participate, you may withdraw from the project until such time as the completed questionnaire is returned without having to give a reason and have the option of withdrawing any data which identifies you.

393

What would you be asked to do? If you agree to participate, you will be asked to:

1. complete a questionnaire about your educational background, rating experience, your personal opinions and awareness about oral assessment of EFL;

If you wish to volunteer in the next phase of this study, you will be asked to: 2. have an EFL speaking test in which you are an examiner observed and audio-recorded. The researcher will sit silently at the corner of the test-room without any interruption or interference into your behavior, your communication with the examinees, or your rating and scoring during the test. 3. participate in a one-on-one interview and have it audio-recorded. The interview is about your reflections of and attitudes towards testing spoken English administered at your university. 4. supply your marking schemes and the test results at the end of the test. 5. If you are Head of the group of the speaking skill teachers, who is responsible for the EFL curriculum, you will also be invited to judge the test contents in terms of its relevance to the course objectives (see the judgement form attached). The completed form and specific course objectives will be delivered to you after the test finishes.

How much time will it take?

The questionnaire should take no more than20 minutes to complete after the test. The one-on-one interview will be scheduled after the test results are announced and should take about 45 minutes.

What are the risks and benefits of participating? There is no promise of individual benefits from your participation in this research. However, the results of this research project will contribute to the improvement of EFL education in Vietnamese universities including your university in the future. The study will provide you, as a test-rater, with an opportunity to confidentially comment on the content and procedures used in oral EFL testing. There are almost no foreseeable risks associated with your participation in this research. Any risks for you and your students have been assessed as very minimal. How will your privacy be protected? Any information collected by the researchers which might identify you will be transcribed, coded, and stored securely by the researcher in de-identifiable form. All your personal information such as your name, your class, and your university will be coded so it will be impossible to identify you from the information you provide. The information will only be accessed by the researcher unless you consent otherwise, except as required by law. Data will be stored in a locked cabinet at the researcher’s office in Vietnam while he is there. After he returns to Newcastle from his fieldwork in Vietnam, the data will be retained in a locked cabinet in the Chief Investigator’s Office at the University of Newcastle for at least five years after which they will be destroyed. How will the information collected be used? The data will be reported and presented in a thesis to be submitted for Nam Lam’s PhD degree. The data will also be reported in scholarly publications and conference presentations. Individual participants will not be identified in any reports or publications arising from the project. You can request to review the audio-recording to edit or erase any part of your contribution. You will personally receive a summary of the findings if you request to have that provided.

394

What do you need to do to participate? Please read this Information Statement and be sure you understand its contents before you consent to participate. If there is anything you do not understand, or if you have any questions, please feel free to contact the researcher. If you agree to participate in this research, please complete the attached Consent Form and return it into the secure box labelled ‘Consent Forms for Nam Lam’s research’ and provided in the EFL Faculty office. If applicable, the researcher will then contact you to complete the questionnaire survey and arrange test room observation and an interview at a convenient time for you following the completion of the speaking test. Further information If you would like further information, please contact Nam Lam by email: [email protected] or mobile phone (+84) 901832***. Thank you for considering this invitation. [Signature] Professor Ronald Laura Project Principle Supervisor [Signature] A/Professor Neil Morpeth Project Co-Supervisor [Signature] Nam Lam Research student Complaints about this research

This project has been approved by the University’s Human Research Ethics Committee, Approval No. H-2015-0366. Should you have concerns about your rights as a participant in this research, or you have a complaint about the manner in which the research is conducted, it may be given to the researcher at 104 Nguyen Van Troi Street, Ho Chi Minh City, Vietnam. Office phone: 00-84-8-39970592, mobile phone: 00-84-901832*** Email: [email protected], or, if an independent person is preferred, to the Human Research Ethics Officer, Research Office, The Chancellery, The University of Newcastle, University Drive, Callaghan NSW 2308, Australia, telephone (02) 49216333, email [email protected].

395

A.2c Information Statement for EFL students (*) FACULTY OF EDUCATION AND ARTS Professor Ronald Laura School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 2 4921 5942 Email: [email protected]

Information Statement for the research project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam:

An investigation of construct validity and reliability

(For EFL students)

Document version 2, dated 01/11/2015

You are invited to participate in the research project identified above which is being conducted by Nam Lam, a PhD candidate from the School of Education, Faculty of Education and Arts at the University of Newcastle. The research is part of Nam Lam’s studies at the University of Newcastle, supervised by Professor Ronald Laura from the School of Education, Faculty of Education and Arts, and co-supervised by A/Professor Neil Morpeth from Centre for English Language and Foundation Studies. Why is the research being done? The purpose of the research is to examine the testing methods used for oral assessment of EFL at universities to find out about the strengths and the shortcomings of these methods to help improve the quality of language testing in general and EFL speaking skill in particular. Who can participate in the research? Selected second-year English major students in EFL speaking classes of the university will be invited to participate in the study. Students whose major is not English language, or TESOL, are not eligible to participate in this research project. EFL teachers in charge of speaking classes of the university who wishes to volunteer can also participate in the study. What choice do you have? Participation in this research is entirely your choice. Only those people who give their informed consent will be included in the project. Whether or not you decide to participate, your decision will not disadvantage you, your studies or your university.

396

If you do decide to participate, you may withdraw from the project until such time as the completed questionnaire is returned without having to give a reason and have the option of withdrawing any data which identifies you. What would you be asked to do? If you agree to participate, you will be asked to:

1. complete a questionnaire about your educational background, test-taking experience, your awareness of and personal opinions about oral assessment of EFL; 2. have your speaking test performance observed and audio-recorded. 3. have the result of your English-speaking test provided to the researcher.

If you wish to volunteer in the next phase of the study, you will be asked to: 4. participate in a group interview with other EFL students and have it audio-recorded. The interview is about your reflections of and attitudes towards testing spoken English administered at your university. Participants are requested to maintain the confidentiality of the group discussion and not divulge any specific content to outside parties.

How much time will it take?

The questionnaire should take 15 minutes to complete following the test. The group interview with the researcher and three other students from your class will be scheduled right after the test results are announced and should take about 60 minutes. What are the risks and benefits of participating? There is no promise of individual benefits from your participation in this research. However, as an expression of thanks and recognition of voluntary contribution, participants will receive a small gift of their choice, which is either a red envelope of ‘lucky money’ or an Australian souvenir. More importantly, the results of this research project will contribute to the improvement of EFL education in Vietnamese universities, including your university in the future. The study will provide you, as a test-taker, with an opportunity to confidentially comment on the content and procedures used in oral EFL testing. There are almost no foreseeable risks associated with your participation in this research. How will your privacy be protected? Any information collected by the researchers which might identify you will be transcribed, coded, and stored securely by the researcherin de-identifiable form. All your personal information such as your name, your class, and your university will be coded so it will be impossible to identify you from the information you provide. The information will only be accessed by the researcher unless you consent otherwise, except as required by law. Data will be stored in a locked cabinet at the researcher’s office in Vietnam while he is there. After he returns to Newcastle from his fieldwork in Vietnam, the data will be retained in a locked cabinet in the Chief Investigator’s Office at the University of Newcastle for at least five years after which they will be destroyed. How will the information collected be used? The data will be reported and presented in a thesis to be submitted for Nam Lam’s PhD degree. The data will also be reported in scholarly publications and conference presentations. Individual participants will not be identified in any reports or publications arising from the project.

397

You can request to review the audio-recording to edit or erase any part of your contribution. You will personally receive a summary of the findings if you request to have that provided. What do you need to do to participate? Please read this Information Statement and be sure you understand its contents before you consent to participate. If there is anything you do not understand, or if you have any questions, please feel free to contact the researcher. If you agree to participate in this research, please complete the attached Consent Form and return it into the secure box labelled ‘Consent Forms for Nam Lam’s research’ and provided in the EFL Faculty office. If applicable, the researcher will then contact you to complete the questionnaire and arrange a convenient time for the interview following the completion of your speaking test. Further information If you would like further information please contact Nam Lam by email: [email protected] or mobile phone (+84) 901832*** Thank you for considering this invitation. [Signature] Professor Ronald Laura Project Principle Supervisor [Signature] A/Professor Neil Morpeth Project Co-Supervisor [Signature] Nam Lam Research student Complaints about this research

This project has been approved by the University’s Human Research Ethics Committee, Approval No. H-2015-0366. Should you have concerns about your rights as a participant in this research, or you have a complaint about the manner in which the research is conducted, it may be given to the researcher at 104 Nguyen Van Troi Street, Ho Chi Minh City, Vietnam. Office phone: 00-84-8-39970592, mobile phone: 00-84-901832*** Email: [email protected], or, if an independent person is preferred, to the Human Research Ethics Officer, Research Office, The Chancellery, The University of Newcastle, University Drive, Callaghan NSW 2308, Australia, telephone (02) 49216333, email [email protected].

398

KHOA GIÁO DỤC VÀ NGHỆ THUẬT Giáo sư Ronald Laura Ngành Giáo Dục Khoa Giáo dục và Nghệ thuật Trường Đại học Newcastle University Drive, Callaghan, NSW 2308, Australia Điện thoại: (+61) 2 4921 5942 Email: [email protected] Bảng thông tin về đề tài:

Nghiên cứu hiệu quả và độ tin cậy của việc đánh giá kỹ năng nói tiếng Anh của sinh viên đại học ở Việt Nam

(Dành cho Sinh viên tiếng Anh) Giáo sư Ronald Laura Lâm Thành Nam Hướng dẫn đề tài Nghiên cứu sinh

Phiên bản 2, ngày 01/11/2015

Bạn được mời tham gia đề tài nghiên cứu trên được tiến hành bởi Lâm Thành Nam, nghiên cứu sinh của Ngành Giáo dục, Khoa Giáo dục và Nghệ thuật thuộc Trường Đại học Newcastle. Nghiên cứu này là một phần trong chương trình học của Lâm Thành Nam tại trường Đại học Newcastle, được hướng dẫn bởi GS. Ronald Laura, Ngành Giáo dục/ Khoa Giáo dục và Nghệ thuật, và đồng hướng dẫn bởi PGS. Neil Morpeth thuộc Trung tâm Giáo dục Cơ bản và Ngôn ngữ Anh. Vì sao nghiên cứu này được tiến hành? Mục đích của nghiên cứu này nhằm tìm hiểu các phương pháp đánh giá kỹ năng nói tiếng Anh ở các lớp Đại học nhằm tìm ra ưu khuyết điểm của mỗi phương pháp giúp nâng cao chất lượng việc kiểm tra ngôn ngữ nói chung và đánh giá kỹ năng nói tiếng Anh nói riêng. Những ai có thể tham gia vào nghiên cứu này? Những sinh viên đại học năm thứ hai thuộc chuyên ngành tiếng Anh ở các lớp được chọn sẽ được mời tham gia cung cấp số liệu cho nghiên cứu này. Những sinh viên không thuộc chuyên ngành tiếng Anh, hoặc Phương pháp Giảng dạy tiếng Anh sẽ không tham gia vào nghiên cứu này. Các giảng viên đại học đã và đang giảng dạy kỹ năng nói tiếng Anh mong muốn và tự nguyện hợp tác đều có thể tham gia vào dự án nghiên cứu này. Bạn có những lựa chọn nào? Việc tham gia vào nghiên cứu này hoàn toàn là lựa chọn của bạn. Chỉ những ai đồng ý ký tên vào bảng thoả thuận mới chính thức tham gia vào dự án này. Cho dù bạn quyết định tham gia hay không thì quyết định của bạn cũng sẽ không gây bất lợi nào cho bản thân bạn, việc học của bạn, hay trường mà bạn đang theo học.

399

Một khi bạn quyết định tham gia, bạn vẫn có thể rút tên khỏi đề án trước khi nộp lại bản câu hỏi mà không cần phải thông báo lý do. Bạn cũng có quyền rút lại những thông tin cá nhân đã cung cấp. Bạn sẽ được yêu cầu làm gì? Nếu bạn đồng ý tham gia, bạn sẽ được yêu cầu: 1. trả lời một bảng câu hỏi khảo sát về quá trình học tập của bạn, kinh nghiệm thi cử, những hiểu biết và ý kiến của riêng cá nhân bạn về hình thức kiểm tra nói tiếng Anh 2. cho phép phần thi nói của bạn được quan sát và ghi âm 3. kết quả bài thi nói của bạn được cung cấp cho nghiên cứu sinh Nếu bạn có nhã ý muốn tình nguyện tham gia ở giai đoạn tiếp theo của dự án này, bạn sẽ được yêu cầu: 4. tham gia vào một buổi phỏng vấn nhóm có thu âm cùng một số sinh viên khác trong lớp. Nội dung của buổi phỏng vấn sẽ hỏi về những đánh giá cũng như thái độ của bạn đối với việc kiểm tra kỹ năng nói tiếng Anh được thực hiện trong trường của bạn. Những người tham gia cuộc phỏng vấn nhóm được yêu cầu giữ bí mật nội dung thảo luận và không tiết lộ bất cứ chi tiết nào ra bên ngoài. Bạn sẽ mất bao nhiêu thời gian? Bạn sẽ mất khoảng 15 phút để hoàn thành bảng câu hỏi khảo sát. Buổi phỏng vấn nhóm cùng 3 sinh viên khác trong lớp sẽ được tổ chức sau khi kết quả thi được công bố và diễn ra trong khoảng 60 phút. Thời gian và địa điểm do nhóm thỏa thuận. Những rủi ro và lợi ích nào khi bạn tham gia? Không thể hứa hẹn bất kỳ lợi ích cá nhân nào cho bạn khi tham gia vào dự án nghiên cứu. Tuy nhiên, một chút quà nhỏ sẽ gửi đến bạn như lời cảm ơn và ghi nhận những đóng góp của bạn. Quan trọng hơn, đây sẽ là một cơ hội tốt để bạn đóng góp ý kiến của mình với tư cách là một thí sinh về cách thức và nội dung kiểm tra nói tiếng Anh, nhằm nâng cao tính khách quan trong việc đánh giá giáo dục góp phần cải thiện chất lượng dạy và học ngoại ngữ ở Việt Nam, bao gồm cả trường mà bạn đang theo học. Hầu như không có bất cứ rủi ro nào khi tham gia vào nghiên cứu này. Quyền riêng tư của bạn sẽ được bảo mật như thế nào? Tất cả những thông tin bạn cung cấp sẽ được chuyển thành văn bản, mã hóa và bảo quản cẩn thận. Các thông tin cá nhân như tên riêng, lớp học và trường của bạn cũng sẽ được mã hoá, vì thế bạn sẽ không thể bị nhận diện qua những thông tin bạn cung cấp. Chỉ có nghiên cứu sinh mới được truy nhập vào thông tin này trừ khi bạn có thoả thuận khác, hoặc do yêu cầu của pháp luật. Dữ liệu sẽ được giữ trong tủ có khoá tại văn phòng của nghiên cứu sinh ở Việt Nam trong thời gian nguời này ở Việt Nam. Sau khi nghiên cứu sinh trở về Newcastle, dữ liệu sẽ được lưu giữ trong tủ có khoá trong văn phòng của Giáo sư hướng dẫn đề tài tại Trường đại học Newcastle trong khoảng thời gian ít nhất là 5 năm, sau đó chúng sẽ được tiêu hủy. Thông tin thu thập sẽ được sử dụng như thế nào? Dữ liệu sẽ được trình bày trong luận văn tiến sĩ của Lâm Thành Nam, trong các báo cáo chuyên đề và hội thảo khoa học. Những cá nhân tham gia sẽ không bị nhận diện trong bất cứ báo cáo nào liên quan đến nghiên cứu này.

400

Bảng thu âm cũng sẽ được gởi cho bạn nếu bạn cần thay đổi hoặc hủy bỏ các thông tin đã cung cấp. Bảng tóm tắt kết quả nghiên cứu sẽ được gửi đến cá nhân bạn nếu bạn có yêu cầu. Bạn cần làm gì để tham gia? Xin vui lòng đọc kỹ bảng thông tin này và chắc chắn rằng bạn hiểu rõ nội dung trước khi thoả thuận tham gia. Nếu có bất cứ điều gì bạn không hiểu, hoặc có thắc mắc xin vui lòng liên hệ với nghiên cứu sinh. Nếu bạn đồng ý tham gia, xin vui lòng ký tên vào bảng thoả thuận đính kèm và gửi vào hộp an toàn có ghi nhãn ‘Consent Forms for Nam Lam’s research’ và được đặt ở văn phòng Khoa Ngoại Ngữ. Nghiên cứu sinh sẽ mời bạn trả lời bảng câu hỏi khảo sát sau khi bạn hoàn tất buổi thi và sẽ liên hệ với bạn để sắp xếp thời gian thuận tiên cho buổi phỏng vấn. Thông tin khác Nếu bạn cần thêm thông tin gì khác, vui lòng liên hệ Lâm Thành Nam qua email: [email protected] hoặc số điện thoại di động (+84) 901832***. Cảm ơn bạn đã quan tâm đến lời mời này. [ký tên] Giáo sư Ronald Laura Hướng dẫn chính [ký tên] PGS. Neil Morpeth Đồng hướng dẫn đề tài [ký tên] Lâm Thành Nam Nghiên cứu sinh Ý kiến phản hồi về nghiên cứu này Đề án này đã được Hội đồng nghiên cứu khoa học của trường thông qua, với mã số hồ sơ: H-2015-0366. Nếu bạn quan tâm đến quyền lợi của bạn khi tham gia vào nghiên cứu này, hoặc khiếu nại về cách thức tiến hành nghiên cứu, bạn có thể liên hệ với nghiên cứu sinh tại địa chỉ 104 đường Nguyễn Văn Trỗi, TP. Hồ Chí Minh, Việt Nam. Điện thoại văn phòng 00-84-8-39970592, di động: 00-84-901832***, email [email protected], hoặc nếu cần một cá nhân độc lập, bạn có thể liên hệ Phòng Nghiên cứu Khoa học, Trường Đại học Newcastle, University Drive, Callaghan NSW 2308, Australia, điện thoại 00-61-2- 49216333, email: [email protected].

401

A.3a Consent form for Head of EFL Faculty

FACULTY OF EDUCATION AND ARTS Professor Ronald Laura School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 2 4921 5942 Email: [email protected]

CONSENT FORM FOR HEAD OF EFL FACULTY

Research Project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam:

An investigation into construct validity and reliability

Professor Ronald Laura Nam Lam

Project Supervisor Research Student

Document version 2, dated 01/11/2015

I have the authority to consent to this research taking place on behalf of my University.

I agree for the researcher to attend the EFL teacher’s monthly meeting and address EFL students at

the start of their class, to discuss the research and disseminate the recruitment materials.

I agree for the researcher to observe and audio-record EFL Teaching within two test room classes.

I give permission for EFL teachers within my faculty to release information to the researcher concerning

their attitudes and EFL speaking assessment practices.

I agree to the researcher being supplied with the course syllabuses, EFL teacher’s marking schemes

and the results of the end-of-course speaking test.

I agree to complete the forms of the test content judgement in terms of its relevance to the course

objectives (document attached).

I understand that the project will be conducted as described in the Information Statement, a copy of

which I have retained.

I understand that all personal information will remain confidential to the researchers.

I understand that the University will not be identifiable in this research.

I understand I can withdraw consent for my university to participate in the study at any time and do

not have to give any reason for withdrawing.

I have had the opportunity to have questions answered to my satisfaction.

I would like to receive a summary of the findings . ¨ Yes ¨ No

402

Print name: ...................................................................................................................................

Name of University: ......................................................................................................................

Contact details: Phone: ................................... Email: ......................................................

Signature: .......................................................................... Date: ...... / ...... / 2015

403

A.3b Consent Form for EFL teachers FACULTY OF EDUCATION AND ARTS Professor Ronald Laura School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 2 4921 5942 Email: [email protected]

CONSENT FORM FOR EFL TEACHERS

Research Project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

Professor Ronald Laura Nam Lam

Project Supervisor Research Student Document Version 2, dated 01/11/2015

I agree to participate in the above research project and give my consent freely.

I understand that the project will be conducted as described in the Information Statement, a copy of

which I have retained.

I understand I can withdraw from the project until such time as the completed questionnaire is

returned and do not have to give any reason for withdrawing.

I consent to (please tick):

� complete the questionnaire ¨ Yes ¨ No

� have my EFL speaking test session observed

and audio-recorded ¨ Yes ¨ No

� participate in an interview and have it audio-recorded ¨ Yes ¨ No

� supply my marking schemes and test results ¨ Yes ¨ No

I understand that my personal information will remain confidential to the researchers.

I will have the chance to review the recording of the interview.

I have had the opportunity to have questions answered to my satisfaction.

I would like to receive a summary of the findings ¨ Yes ¨ No

404

Print name: ...............................................................................................................................

Signature: ..................................................................................... Date: ...... / ...... / 2016

Please provide contact details below for the interview arrangement if you would like

to participate in.

Phone number: ........................................... Email address: ...................................................

405

A.3c Consent Form for EFL students (*)

FACULTY OF EDUCATION AND ARTS Professor Ronald Laura School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 2 4921 5942 Email: [email protected]

CONSENT FORM FOR EFL STUDENTS

Research Project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam:

An investigation into construct validity and reliability

Professor Ronald Laura Nam Lam Project Supervisor Research Student

Document Version 2, dated 01/11/2015

I agree to participate in the above research project and give my consent freely.

I understand that the project will be conducted as described in the Information Statement, a copy of

which I have retained.

I understand I can withdraw from the project until such time as the completed questionnaire is

returned and do not have to give any reason for withdrawing.

I consent to (please tick):

� have my speaking test performance observed and audio-recorded. ¨ Yes No

� complete the questionnaire. ¨ Yes ¨ No

� have the result of my oral test provided to the researcher. ¨ Yes ¨ No

� participate in an interview and having it audio-recorded. ¨ Yes ¨ No

I agree to maintain the confidentiality of the group discussion and not divulge any specific content

to outside parties.

I understand that my personal information will remain confidential to the researcher.

I understand that I can withdraw my consent immediately before the test if I change my mind

I have had the opportunity to have questions answered to my satisfaction.

I would like to receive a summary of the findings. ¨ Yes ¨ No

406

Print name: ...............................................................................................................................

Signature: ..................................................................................... Date: ...... / ...... / 2015

Please provide contact details below for the interview arrangement if you would like

to participate in.

Phone number: ............................................. Email address: ...................................................

407

KHOA GIÁO DỤC VÀ NGHỆ THUẬT

Giáo sư Ronald Laura Chuyên ngành Giáo Dục Khoa Giáo dục và Nghệ thuật Trường Đại học Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 2 4921 5942 Email: [email protected]

BẢNG THỎA THUẬN DÀNH CHO SINH VIÊN

Tên đề tài:

Nghiên cứu hiệu quả và độ tin cậy của việc đánh giá kỹ năng nói tiếng Anh

của sinh viên đại học ở Việt Nam

Giáo sư Ronald Laura Lâm Thành Nam

Hướng dẫn chính Nghiên cứu sinh

Phiên bản 2, ngày 01/11/2015

Tôi đồng ý tham gia vào đề tài nghiên cứu trên và thoả thuận độc lập.

Tôi biết rằng đề tài trên sẽ được tiến hành như được miêu tả trong bảng thông tin về đề tài mà tôi đã

được nhận một bản sao.

Tôi biết rằng tôi có thể rút khỏi đề tài này trước khi gửi lại bảng câu hỏi khảo sát và không phải nêu

bất cứ lý do gì cho việc này.

Tôi đồng ý (xin vui lòng đánh dấu vào ô lựa chọn):

� phần thi nói của tôi được quan sát và ghi âm ¨ Có ¨ Không

� hoàn tất một bảng câu hỏi khảo sát ¨ Có ¨ Không

� kết quả bài thi nói của tôi được cung cấp cho nghiên cứu sinh ¨ Có ¨ Không

� tham gia phỏng vấn trong nhóm có ghi âm ¨ Có ¨ Không

Tôi đồng ý giữ bí mật nội dung cuộc phỏng vấn của nhóm và không tiết lộ bất cứ chi tiết nào cho người

ngoài nhóm.

Tôi biết rằng thông tin cá nhân của tôi sẽ được các nghiên cứu viên giữ bí mật.

Tôi hiểu rằng tôi có quyền rút lại quyết định tham gia của mình ngay trước buổi thi nếu tôi đổi ý.

Tôi đã có cơ hội được nhận câu trả lời thỏa đáng cho các câu hỏi của tôi.

Tôi muốn nhận một bản tóm tắt kết quả nghiên cứu. ¨ Có ¨ Không

408

Họ và tên: ..............................................................................................................................

Ký tên: ..................................................................................... Ngày: ...... / ...... / 2015

Xin vui lòng cung cấp thông tin liên lạc để sắp xếp buổi phỏng vấn nếu bạn có nhã ý muốn tham gia.

Điện thoại: ................................................. Đia chi e-mail: ..................................................

409

A.4 Notification of HREC Expedited Approval of a protocol variation (22/02/2017)

410

A.5a Information Statement for EFL teacher raters

FACULTY OF EDUCATION AND ARTS Professor Jim Albright School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 1 4921 5901 Email: [email protected]

Information Statement for the research project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

(For EFL teacher raters) Document version 1, dated 30/10/2016

You are invited to participate in the research project identified above which is being conducted by Nam Lam, a PhD candidate from the School of Education, Faculty of Education and Arts at the University of Newcastle. The research is part of Nam Lam’s studies at the University of Newcastle, supervised by Professor Jim Albright, and co-supervised by Professor John Fischetti and Professor Greg Preston from the School of Education, Faculty of Education and Arts. Why is the research being done? The purpose of the research is to examine the oral assessment method(s) applied in EFL classrooms at universities with an attempt to find out the strengths and the shortcomings of these methods so that practical implications can be made to improve the quality of language testing in general and assessing EFL speaking skill in particular, which is currently still an under researched area in local tertiary education. Who can participate in the research? Selected second-year English major students in EFL speaking classes of the university will be invited to participate in the study. Students whose major is not English language, or TESOL, are not eligible to participate in this research project. EFL teachers in charge of speaking classes of the university who wish to volunteer can also participate in the study. What choice do you have? Participation in this research is entirely your choice. Only those people who give their informed consent will be included in the project. Whether or not you decide to participate, your decision will not disadvantage you or your university. If you do decide to participate, you may withdraw from the project until such time as your completed scoring sheet is returned to the researcher without having to give a reason and have the option of withdrawing any data which identifies you.

411

What would you be asked to do? If you agree to participate, you will be asked to:

1. independently mark the candidates’ speaking samples that were audio-recorded in Stage 1 by completing a scoring sheet as you did in Stage 1. The candidates belonged to test rooms other than the test room in which you were a rater.

If you wish to participate more, you will be asked to: 2. independently mark the speaking samples of the candidates who were in your test room 3. participate in a group discussion with other raters about your experience in re-rating the candidates’ speaking performance from audio-recordings. The discussion will be audio-recorded.

How much time will it take? The marking should take about 60 to 90 minutes. The group discussion will be scheduled based on the participants’ agreement and should take about 45 minutes. What are the risks and benefits of participating? The reimbursement for the time you spent marking the speaking samples will be AUD$30, and for a group discussion it will be a souvenir gift as a best wish for the New Year. The study will provide you, as a test-rater, with an opportunity to share professional ideas with colleagues about oral assessment. Furthermore, the results of this research project will contribute to the improvement of EFL education in Vietnamese universities including your university in the future. There are almost no foreseeable risks associated with your participation in this research. Any risks for you and your students have been assessed as very minimal. How will your privacy be protected? Any information collected by the researchers which might identify you will be transcribed, coded, and stored securely by the researcher in de-identifiable form. All your personal information such as your name, your class, and your university will be coded so it will be impossible to identify you from the information you provide. The information will only be accessed by the researcher unless you consent otherwise, except as required by law. Data will be stored in a locked cabinet at the researcher’s office in Vietnam while Nam is there. After he returns to Newcastle from his fieldwork in Vietnam, the data will be retained in a locked cabinet in the Chief Investigator’s Office at the University of Newcastle for at least five years after which they will be destroyed. How will the information collected be used? The data will be reported and presented in a thesis to be submitted for Nam Lam’s PhD degree. The data will also be reported in scholarly publications and conference presentations. Individual participants will not be identified in any reports or publications arising from the project. You will personally receive a summary of the findings if you request to have that provided. What do you need to do to participate? Please read this Information Statement and be sure you understand its contents before you consent to participate. If there is anything you do not understand, or if you have any questions, please feel free to contact the researcher. If you agree to participate in this research, please complete the attached Consent Form and return it into the secure box labelled ‘Consent Forms for Nam Lam’s research’ and provided in the EFL Faculty office. If applicable, the researcher will then contact to hand you the speaking files stored as a USB or CD and a scoring sheet. The group discussion will be arranged at a convenient time for you and the other members.

412

Further information If you would like further information, please contact Nam Lam by email: [email protected] or mobile phone (+84) 918 678 ***. Thank you for considering this invitation. [Signature] Professor Jim Albright Project Principal Supervisor [Signature] Professor John Fischetti Project Co-Supervisor [Signature] Professor Greg Preston Project Co-Supervisor [Signature] Nam Lam Research student Complaints about this research

This project has been approved by the University’s Human Research Ethics Committee, Approval No. H-2015-0366. Should you have concerns about your rights as a participant in this research, or you have a complaint about the manner in which the research is conducted, it may be given to the researcher at104 Nguyen Van Troi Street, Ho Chi Minh City, Vietnam. Office phone: 00-84-8-39970***, mobile phone: 00-84-918678*** Email: [email protected], or, if an independent person is preferred, to the Human Research Ethics Officer, Research Office, The Chancellery, The University of Newcastle, University Drive, Callaghan NSW 2308, Australia, telephone (02) 49216333, email [email protected].

413

A.5b Information Statement for EFL experts FACULTY OF EDUCATION AND ARTS Professor Jim Albright School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 1 4921 5901 Email: [email protected]

Information Statement for the research project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

(For EFL experts)

Document version 1, dated 30/10/2016 You are invited to participate in the research project identified above which is being conducted by Nam Lam, a PhD candidate from the School of Education, Faculty of Education and Arts at the University of Newcastle. The research is part of Nam Lam’s studies at the University of Newcastle, supervised by Professor Jim Albright, and co-supervised by Professor John Fischetti and Professor Greg Preston from the School of Education, Faculty of Education and Arts. Why is the research being done? The purpose of the research is to examine the oral assessment method(s) applied in EFL classrooms at universities with an attempt to find out the strengths and the shortcomings of these methods so that practical implications can be made to improve the quality of language testing in general and assessing EFL speaking skill in particular, which is currently still an under researched area in local tertiary education. Who can participate in the research? Selected second-year English major students in EFL speaking classes of the university will be invited to participate in the study. Students whose major is not English language, or TESOL, are not eligible to participate in this research project. EFL teachers in charge of speaking classes of the university who wish to volunteer can also participate in the study. What choice do you have? Participation in this research is entirely your choice. Only those people who give their informed consent will be included in the project. Whether or not you decide to participate, your decision will not disadvantage you or your university. If you do decide to participate, you may withdraw from the project until such time as your completed scoring sheet is returned to the researcher without having to give a reason and have the option of withdrawing any data which identifies you.

414

What would you be asked to do? If you agree to participate, you will be asked to:

1. make judgements on the relevance of three sets of speaking test items based on the course books and the course outlines provided.

If you wish to participate further, you will be asked to: 2. participate in a group discussion with other experts about your judgements on the speaking test contents. The discussion will be audio-recorded.

How much time will it take? The completion of test content judgements should take about 60 to 90 minutes. The group discussion will be scheduled based on the participants’ agreement and should take about 45 minutes. What are the risks and benefits of participating? The reimbursement for the time you spent completing the forms provided will be AUD$50, and for a group discussion it will be a souvenir gift as a best wish for the New Year. The study will provide you, as an expert, with an opportunity to share professional ideas with colleagues about oral assessment. Furthermore, the results of this research project will contribute to the improvement of EFL education in Vietnamese universities including your university in the future. There are almost no foreseeable risks associated with your participation in this research. Any risks for you and your students have been assessed as very minimal. How will your privacy be protected? Any information collected by the researchers which might identify you will be transcribed, coded, and stored securely by the researcher in de-identifiable form. All your personal information such as your name, your class, and your university will be coded so it will be impossible to identify you from the information you provide. The information will only be accessed by the researcher unless you consent otherwise, except as required by law. Data will be stored in a locked cabinet at the researcher’s office in Vietnam while Nam is there. After he returns to Newcastle from his fieldwork in Vietnam, the data will be retained in a locked cabinet in the Chief Investigator’s Office at the University of Newcastle for at least five years after which they will be destroyed. How will the information collected be used? The data will be reported and presented in a thesis to be submitted for Nam Lam’s PhD degree. The data will also be reported in scholarly publications and conference presentations. Individual participants will not be identified in any reports or publications arising from the project. You will personally receive a summary of the findings if you request to have that provided. What do you need to do to participate? Please read this Information Statement and be sure you understand its contents before you consent to participate. If there is anything you do not understand, or if you have any questions, please feel free to contact the researcher. If you agree to participate in this research, please complete the attached Consent Form and return it into the secure box labelled ‘Consent Forms for Nam Lam’s research’ and provided in the EFL Faculty office. If applicable, the researcher will then contact to hand you the course books, the course outlines and a judgement sheet. The group discussion will be arranged at a convenient time for you and the other members.

415

Further information If you would like further information, please contact Nam Lam by email: [email protected] or mobile phone (+84) 918 678 *** Thank you for considering this invitation. [Signature] Professor Jim Albright Project Principal Supervisor [Signature] Professor John Fischetti Project Co-Supervisor [Signature] Professor Greg Preston Project Co-Supervisor [Signature] Nam Lam Research student Complaints about this research

This project has been approved by the University’s Human Research Ethics Committee, Approval No. H-2015-0366. Should you have concerns about your rights as a participant in this research, or you have a complaint about the manner in which the research is conducted, it may be given to the researcher at104 Nguyen Van Troi Street, Ho Chi Minh City, Vietnam. Office phone: 00-84-8-39970592, mobile phone: 00-84-918678*** Email: [email protected], or, if an independent person is preferred, to the Human Research Ethics Officer, Research Office, The Chancellery, The University of Newcastle, University Drive, Callaghan NSW 2308, Australia, telephone (02) 49216333, email [email protected].

416

A.6a Consent form for EFL teacher raters

FACULTY OF EDUCATION AND ARTS Professor Jim Albright School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 1 4921 5901 Email: [email protected]

CONSENT FORM FOR EFL TEACHER RATERS

Research Project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

Professor Jim Albright Nam Lam

Project Principal Supervisor Research Student

Document version 1, dated 15/10/2016

I agree to participate in the above research project and give my consent freely.

I understand that the project will be conducted as described in the Information Statement, a copy of

which I have retained.

I understand I can withdraw from the project until such time as the completed marking scheme and the

questionnaire are returned and do not have to give any reason for withdrawing.

I consent to (please tick):

� completing the scoring sheet for the speaking test ¨ Yes ¨ No

samples of candidates other than my test room

� completing the scoring sheet for the speaking test ¨ Yes ¨ No

samples of candidates previously in my test room

� attending a group discussion regarding my marking ¨ Yes ¨ No

and issues associated with the speaking test

I understand that my personal information will remain confidential to the researchers.

I have had the opportunity to have questions answered to my satisfaction.

I would like to receive a summary of the findings ¨ Yes ¨ No

417

Print name: ...........................................................................................................................

Signature: ................................................................................. Date: ...... / ...... / 2017

Please provide contact details below for the interview arrangement if you would like

to participate in.

Phone number: ............................................. Email address: ...............................................

418

A.6b Consent form for EFL experts

FACULTY OF EDUCATION AND ARTS Professor Jim Albright School of Education Faculty of Education and Arts The University of Newcastle University Drive, Callaghan, NSW 2308, Australia Phone: (+61) 1 4921 5901 Email: [email protected]

Information Statement for the research project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

(For EFL experts)

Document version 1, dated 30/10/2016

You are invited to participate in the research project identified above which is being conducted by Nam Lam, a PhD candidate from the School of Education, Faculty of Education and Arts at the University of Newcastle.

The research is part of Nam Lam’s studies at the University of Newcastle, supervised by Professor Jim Albright, and co-supervised by Professor John Fischetti and Mr Greg Preston from the School of Education, Faculty of Education and Arts.

Why is the research being done?

The purpose of the research is to examine the oral assessment method(s) applied in EFL classrooms at universities with an attempt to find out the strengths and the shortcomings of these methods so that practical implications can be made to improve the quality of language testing in general and assessing EFL speaking skill in particular, which is currently still an under researched area in local tertiary education.

Who can participate in the research?

Selected second-year English major students in EFL speaking classes of the university will be invited to participate in the study. Students whose major is not English language, or TESOL, are not eligible to participate in this research project.

EFL teachers in charge of speaking classes of the university who wish to volunteer can also participate in the study.

What choice do you have?

Participation in this research is entirely your choice. Only those people who give their informed consent will be included in the project. Whether or not you decide to participate, your decision will not disadvantage you or your university.

If you do decide to participate, you may withdraw from the project until such time as your completed scoring sheet is returned to the researcher without having to give a reason and have the option of withdrawing any data which identifies you.

419

What would you be asked to do?

If you agree to participate, you will be asked to:

1. make judgements on the relevance of three sets of speaking test items based on the course books and the course outlines provided.

If you wish to participate further, you will be asked to:

2. participate in a group discussion with other experts about your judgements on the speaking test contents. The discussion will be audio-recorded.

How much time will it take?

The completion of test content judgements should take about 60 to 90 minutes. The group discussion will be scheduled based on the participants’ agreement and should take about 45 minutes.

What are the risks and benefits of participating?

The reimbursement for the time you spent completing the forms provided will be AUD$50, and for a group discussion it will be a souvenir gift as a best wish for the New Year. The study will provide you, as an expert, with an opportunity to share professional ideas with colleagues about oral assessment. Furthermore, the results of this research project will contribute to the improvement of EFL education in Vietnamese universities including your university in the future.

There are almost no foreseeable risks associated with your participation in this research. Any risks for you and your students have been assessed as very minimal.

How will your privacy be protected?

Any information collected by the researchers which might identify you will be transcribed, coded, and stored securely by the researcher in de-identifiable form. All your personal information such as your name, your class, and your university will be coded so it will be impossible to identify you from the information you provide. The information will only be accessed by the researcher unless you consent otherwise, except as required by law.

Data will be stored in a locked cabinet at the researcher’s office in Vietnam while Nam is there. After he returns to Newcastle from his fieldwork in Vietnam, the data will be retained in a locked cabinet in the Chief Investigator’s Office at the University of Newcastle for at least five years after which they will be destroyed.

How will the information collected be used?

The data will be reported and presented in a thesis to be submitted for Nam Lam’s PhD degree. The data will also be reported in scholarly publications and conference presentations. Individual participants will not be identified in any reports or publications arising from the project.

You will personally receive a summary of the findings if you request to have that provided.

What do you need to do to participate?

Please read this Information Statement and be sure you understand its contents before you consent to participate. If there is anything you do not understand, or if you have any questions, please feel free to contact the researcher.

If you agree to participate in this research, please complete the attached Consent Form and return it into the secure box labelled ‘Consent Forms for Nam Lam’s research’ and provided in the EFL Faculty office. If applicable, the researcher will then contact to hand you the course books, the course outlines and a judgement sheet. The group discussion will be arranged at a convenient time for you and the other members.

Further information

If you would like further information, please contact Nam Lam by email: [email protected] or mobile phone (+84) 918 678 ***.

Thank you for considering this invitation.

420

[Signature]

Professor Jim Albright Project Principal Supervisor

[Signature]

Professor John Fischetti Project Co-Supervisor

[Signature]

Professor Greg Preston Project Co-Supervisor

[Signature]

Nam Lam Research student

Complaints about this research

This project has been approved by the University’s Human Research Ethics Committee, Approval No. H-2015-0366.

Should you have concerns about your rights as a participant in this research, or you have a complaint about the manner in which the research is conducted, it may be given to the researcher at104 Nguyen Van Troi Street, Ho Chi Minh City, Vietnam.

Office phone: 00-84-8-39970592, mobile phone: 00-84-918678***

Email: [email protected]

or, if an independent person is preferred, to the Human Research Ethics Officer, Research Office, The Chancellery, The University of Newcastle, University Drive, Callaghan NSW 2308, Australia, telephone (02) 49216333, email [email protected].

421

A.7 Verification of translated documents

28 August 2015

Human Research Ethics Committee The Chancellery The University of Newcastle University Drive, Callaghan NSW 2308, Australia Telephone +61 2 4921 6333 Email [email protected]

To Whom It May Concern:

Investigator’s name: Nam Lam

Title of Research Project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam:

An investigation into construct validity and reliability

Language of original documents: English

Language to be translated into: Vietnamese

Documents to be verified for faithful translation:

1. Consent form for students

2. Information statement for students

3. Interview questions for teachers

4. Interview questions for students

5. Questionnaire for students

I am Thi Loan Lam, MA in Education and Training for Development. I am a lecturer of Khanh Hoa University (Nha Trang City, Vietnam) and currently a PhD student (Education) at the University of Newcastle, Australia.

I, the undersigned, verify that all translated materials related to the above named study reflect the intent and spirit of the English texts.

Please do not hesitate to contact me should you require any further information.

My contact details:

Mobile: +61 416 860 ***, Email: [email protected].

Yours sincerely,

[Signature]

Thi Loan Lam

422

Appendix B: DATA COLLECTION INSTRUMENTS

B.1 Test room observation protocol Document version 2, dated 01/11/2015

Examiner 1: (name) …………… U2-R1 Examiner 2: U2-R2 Examinees’ attendance: 43 / 48 Date: 22 /01/ 2016 Room: 10x

Time: From 7.30 AM to 11.30 AM University: B 1. What test format is used? ¨ Personal oral interview

þ Paired speaking test. How are the candidates paired? Candidates whose names were next to each other in the candidate list of each test room were usually paired with each other at random. Pairing was flexible.

¨ Group discussion How are examinees grouped? ....................................................................................................

¨ Other ..................................................................................... 2. Is an interlocutor included? ¨ No.

þ Yes. Is s/he known or unknown to the examinees? Maybe. Two examiners were present in the test room at a time. One took the role of an interlocutor who delivered test tasks to candidates and made overall judgement using a holistic rating scale. The other was an assessor made detailed judgement using an analytic scale. Neither of the examiners was the candidates’ teacher in charge of their Listening/Speaking class. Acquaintanceship between the testers and the testees was quite low.

3. Is each speaking performance timed equally? þ Yes. How long for each individual/ pair/ group? 6-7 minutes ¨ No. 4. What test task(s) is/are used?

Task 1: Question-and-answer (2-3 minutes) The interlocutor asked each candidate 1-2 questions from the testing materials and the candidates gave immediate responses. Task 2: Discussion (3-4 minutes) The interlocutor showed a mind map to each pair of candidates. They had 30 seconds to prepare for the discussion question and talked in about 3 minutes.

5. What interactional patterns occur during the test? ¨ Assessor – candidate þ Interlocutor – candidate þ Candidate – candidate ¨ Assessor – interlocutor

6. Are specific questions or topics given to examinees for preparation in advance (at home) prior to the test? þ No.

Candidates did not know the specific content of the questions until they entered the test room. Task 1: spontaneous responses, no time for preparation; Task 2: about 30 seconds allowed for each pair to prepare for a discussion. ¨ Yes. How long are they allowed to prepare?

7. Are assessment criteria explicitly explained to examinees? þ No.

¨ Yes. When? ........................................................................................................ How? .........................................................................................................

423

8. Does the examiner use a rating scale? ¨ No. þ Yes. 9. Is there a break (interval) during the test? þ No.

¨ Yes. From .................. to ................... (For how long? .................. minutes.) 10. Do any external factors affect the test operation? þ No.

¨ Yes. What factors? ¨ Noise ¨ Temperature ¨ Seating arrangement ¨ Technical error ¨ Candidates’ health problems

¨ Others .................................. 11. Test venue þ The test room is large enough to accommodate 4 people each turn, including 1 assessor, 1 interlocutor, and 2 candidates þ The test room is situated away from noise and disturbance þ The room is equipped with a suitably sized table and sufficient number of chairs ¨ Many tests are held simultaneously in a shared room.

Candidates bring mobile phones into the test room. þ Candidates taking their speaking test are separated from those who are waiting for their turns. þ The waiting area contains enough chairs for candidates to sit in reasonable comfort while they are waiting to take their test.

12. How does the speaking test take place? (Chronological order goes from top to bottom)

Examinees’ activities Examiner’s activities All candidates waiting outside until an usher call them to take turn entering the test room in pairs.

Each test room has two examiners. The usher decides the pairing. Two candidates’ names next to each other in the list are usually paired together.

Candidates go into the test room, greet examiners, take seats, tell the interlocutor their names and sign in the student list.

One examiner (interlocutor) invited candidates to sit down, asked their names, and handed them the student list to sign for attendance confirmation.

Candidates give spontaneous responses right after each question they hear.

The interlocutor starts the speaking performance test with Task 1 by asking each individual candidate 2-3 question from the test book.

Each pair has 30 seconds to prepare for their discussion, and 3 minutes to discuss.

In Task 2, the interlocutor shows candidates a mind map with prompts related to a topic and a discussion question

Candidates say thank you and goodbye to examiners and walk out of the test room.

A speaking test session for each pair of candidates finishes when the interlocutor tells them the time is over, say thank you and goodbye.

424

13. Test room plan

Main door 14. Examinees’ attendance

No. Name (*)

Gender (M/F)

Individual/ Pair/ Group

Time (from... to...)

(A) (B) (C) (D) (E) (F) (G) Note

1. An-Bich F-F p 8.05-8.12 x x x Topic 2 2. Huy-Vi M-F p 8.14-8.22 x x x x x Topic 3 3. Lan-Cuc F-F p 8.23-8.31 x x x x Topic 7 4. … 5. 6. 7.

(*) Names are aliases for this example. Check list:

(A) The candidate is given a “fresh start”. (B) The interlocutor is friendly towards the candidate. (C) The candidate has difficulty understanding the test task. (D) The candidate is encouraged to make a second attempt to express what they want to say. (E) The candidate shows good behaviour. (F) The candidate is given a relaxed and comfortable feeling after the test. (G) The examiner is seen to make notes on the candidate’s performance during their speaking turn.

15. Other notes ……………………………..…………………………………………………………………… ……………………………..…………………………………………………………………… ……………………………..…………………………………………………………………… ……………………………..……………………………………………………………………

Candidates � �

l Interlocutor

n Assessor

Teacher desk

Observer

425

B.2a Questionnaire for EFL teachers

Research project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam: An investigation into construct validity and reliability

QUESTIONNAIRE FOR EFL TEACHERS (TEST RATERS)

Document version 2, dated 01/11/2015 Completing this questionnaire should require no more than 20 minutes. Please give your responses as accurately as you can. Personal information The University where you are teaching: ……………………………………………………….. Gender: ¨ Male ¨ Female Age group: ¨ Less than 25 ¨ From 25 to 30 ¨ From 31 to 40 ¨ Above 40 First language: ¨ Vietnamese ¨ English ¨ Other …………………… Highest professional qualification:

¨ BA ¨ MA ¨ PhD ¨ Other …………… Questions 1-7: Tick (ü) the answer you choose. TEACHING AND TESTING EXPERIENCE 1. How long have you been teaching English?

¨ This is my first year. ¨ More than 1 year but less than 3 years. ¨ 3 years but less than 5 years. ¨ 5 years or more.

2. How many times have you acted as a speaking test rater, including this time? ¨ Once. ¨ Two or three times. ¨ Four or five times. ¨ More than five times.

3. How much training in oral assessment have you received? ¨ None. ¨ 1 month or less. ¨ More than 1 month but less than 6 months. ¨ 6 months or more.

4. Do you find the assessment of EFL speaking difficult? ¨ Not difficult at all. ¨ A little difficult. ¨ Fairly difficult. ¨ Very difficult.

5. How satisfied are you with your rating of the students’ English-speaking skills? ¨ I am not satisfied.

¨ I am not very satisfied. ¨ I am fairly satisfied. ¨ I am completely satisfied. 6. How often do you use spoken English in your Speaking class(es)?

¨ Seldom (less than 50%). ¨ About half the time (approximately 50%). ¨ Often (over 50% but less than 80%). ¨ Very frequently (80% or over but not 100%). ¨ Always (100%).

426

7. How are the English-speaking lessons arranged in your school? ¨ Separated as listening, reading or writing, etc. ¨ Integrated skills ¨ According to your own arrangements

Questions 8-49: For each item, choose only one number that best describes your opinion about the last end-of-course test of spoken English you experienced as a test rater. 1----------------2----------------3----------------4----------------5 Strongly Disagree Neither Agree Strongly

disagree agree nor agree disagree TEST PREPARATION AND ADMINISTRATION

8 Students were informed of the assessment criteria before the test. 1 2 3 4 5 9 Each student’s speaking was timed equally. 1 2 3 4 5 10 The atmosphere of the test room was stressful. 1 2 3 4 5 11 The atmosphere of the test room was formal. 1 2 3 4 5 12 There was sufficient time for students to prepare for this test. 1 2 3 4 5 13 Students need to learn the language material and skills outlined in the course

objectives to get good results for this test. 1 2 3 4 5

14 I believe that computer-assisted speaking tests are more accurate and reliable than human raters.

1 2 3 4 5

15 Ensuring consistency in administering the test is very important to me. 1 2 3 4 5 16 I give feedback on students’ speaking performances after the test. 1 2 3 4 5

TEST TASK

17 The information required to complete the speaking task was within the course syllabus.

1 2 3 4 5

18 Students were already familiar with this kind of test task. 1 2 3 4 519 The test task was authentic. 1 2 3 4 520 The test task was highly structured. 1 2 3 4 521 The speaking task gave you an adequate opportunity to evaluate students’

English-speaking ability. 1 2 3 4 5

22 Students had opportunities to use the language they had learnt in this course to perform the test task.

1 2 3 4 5

23 The test task was designed on the basis of the course objectives. 1 2 3 4 524 The test task engaged students by using spoken English interactively. 1 2 3 4 525 Students needed to attend class regularly to be able to do the task well. 1 2 3 4 526 There was sufficient time available to complete the task. 1 2 3 4 527 There was sufficient planning time available for each task. 1 2 3 4 5

ASSEESMENT CRITERIA & RATING SCALE

28 I used a rating scale with clearly-defined descriptors when rating the students’ performances.

1 2 3 4 5

29 The criteria for rating were clear to me as a rater. 1 2 3 4 530 The components in the rating scale were sufficient to make fair judgements of

candidates’ performances. 1 2 3 4 5

31 The assessment criteria of the rating scale corresponded well with the course objectives.

1 2 3 4 5

32 Students were rated by the quality rather than the quantity of their speaking 1 2 3 4 533 The current rating scale was useful for you when assessing students’ speaking

abilities. 1 2 3 4 5

34 I used the same rating scale to score candidates’ speaking performances. 1 2 3 4 5

427

RATING PROCESS

35 I was able to work without any disturbance or distraction during the rating process.

1 2 3 4 5

36 Ensuring consistency in scoring students’ performance is very important to me.

1 2 3 4 5

37 Candidates’ non-verbal behaviors affected your judgement of their speaking 1 2 3 4 538 I focused on the number of errors candidates made in per stretch of their

speed. 1 2 3 4 5

39 I found it difficult to decide the score awarded to those who were close to the Pass/Fail boundary.

1 2 3 4 5

40 I tended to do the rating more consistently at the beginning than to the end of the test.

1 2 3 4 5

TEST IMPACT ON TEACHING& LEARNING

41 The test helped me identify what the strengths and weaknesses of students’ English-speaking skills.

1 2 3 4 5

42 The test helped you identify what test-taking skills your students need for class activities.

1 2 3 4 5

43 I exploited the content of the current course book to teach students speaking skill.

1 2 3 4 5

44 I regularly got students involved in speaking activities in class. 1 2 3 4 545 I used the course objectives as the criteria to assess students’ speaking. 1 2 3 4 546 I used tasks similar to those of the test to help students practice speaking

English in class 1 2 3 4 5

47 Before the test, you gave students full details of all aspects of the test tasks, e.g. goals, contents, format, assessment criteria, etc.

1 2 3 4 5

48 Before the test, you spent time in class discussing with students various topics so they would be familiar with the test structures, vocabulary, and format

1 2 3 4 5

49 Before the test, students were informed of all the questions they would be asked in the test

1 2 3 4 5

OTHER QUESTIONS Questions 50-60: Tick (ü) the answer(s) you choose and/or give your own opinions. 50. What aspects of assessing EFL speaking do you find difficult? (More than one answer is possible.)

¨ Students’ current English level ¨ Class size ¨ Noisy testing environment ¨ Inadequate time – having to make too many judgements in a limited time frame ¨ Lack of training ¨ Inadequate aids and facilities for testing ¨ Other: …………………………………………………………

51. What do you think the purpose of this test is? (More than one answer is possible.) – The test was designed to

¨ measure students’ spoken language proficiency ¨ measure the degree to which students have successfully learnt the language material and skills coved in the course ¨ identify students’ levels of English to put them into the right English class ¨ identify students that have aptitudes for English learning ¨ decide whether they should be promoted to the next level of study ¨ decide whether they are qualified to study abroad ¨ determine the students’ strengths and weaknesses in their learning process ¨ learn whether the current curriculum fits with the students’ English levels

428

¨ compare the university students’ English levels with those at other universities ¨ Other: …………………………………………………………

52. What aspects in the students’ English-speaking ability did you pay attention to when rating their speaking performances? (More than one answer is possible.)

¨ Pronunciation ¨ Intonation ¨ Grammar accuracy ¨ Vocabulary ¨ Fluency ¨ Appropriate body language ¨ Coherence ¨ Interactional competence ¨ Appearance and dressing ¨ Content ¨ Other: …………………………….……………………………..

53. What major changes are you are likely to make in your teaching to help your students improve their performance in their speaking tests?

¨ Putting more emphasis on pronunciation ¨ Putting more stress on role play and group discussion ¨ Putting less emphasis on pronunciation ¨ Employing language tasks that are more authentic and closer to real life ¨ Adopting new teaching methods ¨ Encouraging more student participation in class ¨ Organizing mock tests in preparation for actual ones ¨ Other: …………………………………………………………

54. What types of activities do you think should be involved in this course to prepare students for the end-of-course test?

¨ Language games ¨ Task-oriented activities ¨ Exposure to various media ¨ Authentic teaching and learning materials ¨ Role play and group discussion ¨ Student interactions with the teacher ¨ English-speaking practice outside the language classroom ¨ Other: …………………………………………………………

55. What are the factors that most influence your teaching of this course? ¨ Professional training ¨ Teaching experience and belief ¨ Coursebook ¨ Teaching syllabus ¨ Past experience as a language learner ¨ Learners’ expectations ¨ Peers’ expectations ¨ Social expectations ¨ End-of-course exam ¨ International EFL exams ¨ Other: …………………………………………………………

56. Do you tend to give bonus points to a candidate in the test if he/she … ? ¨ has a clear voice

¨ is punctual ¨ has good behaviour ¨ has have good appearance ¨ is smartly-dressed ¨ has good listening comprehension ¨ Other: …………………………………………………………

57. How often was your rating students’ speaking affected by their performances in class? ¨ Never ¨ Rarely ¨ Sometimes ¨ Usually ¨ Very often ¨ Always

429

Please explain why (not): …………………………………………………………….…………………………………… …………………………………………………………….…………………………………… …………………………………………………………….……………………………………

58. Did you experience any unexpected occurrence that might have affected your rating during the speaking test?

¨ No ¨ Yes. What were they? ¨ Noise ¨ Temperature ¨ Technical error ¨ Health problems

¨ Other: ………………… 59. What do you think students should do to perform better in this test?

…………………………………………………………….…………………………………… …………………………………………………………….…………………………………… …………………………………………………………….…………………………………… …………………………………………………………….…………………………………… …………………………………………………………….……………………………………

60. What changes would you recommend being made about this test so that it will become more effective?

…………………………………………………………….…………………………………… …………………………………………………………….…………………………………… …………………………………………………………….…………………………………… …………………………………………………………….…………………………………… …………………………………………………………….……………………………………

Thank you for your cooperation! __________________________________________________ Acknowledgements This questionnaire is adapted from the questionnaires introduced in: • Cheng, L. (2005). Changing language teaching through language testing: A washback study,

Studies in Language Testing 21, UK: Cambridge University Press. • Fulcher, G. (2003). Testing second language speaking, Harlow: Longman/Pearson Education Ltd. • Taylor, L., Milanovic, M., & Weir, C. J. (Eds.) (2011). Examining Speaking: Research and

practice in assessing second language speaking, Studies in Language Testing 30, UK: Cambridge University Press.

• Weir, C. J. (2005). Language testing and validation: An evidence-based approach, Basingstoke: Palgrave Macmillan.

430

B.2b Questionnaire for EFL students (*)

Research project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam: An investigation into construct validity and reliability

QUESTIONNAIRE FOR EFL STUDENTS (TEST TAKERS)

Document version 2, dated 01/11/2015

Completing this questionnaire should require no more than 15 minutes. Please give your responses as accurately as you can. Personal information The University where you are studying: …..……………………………………………………….. Gender: ¨ Male ¨ Female First language: ¨ Vietnamese ¨ Other …………………… Age group: ¨ Less than 20 ¨ From 20 to less than 23

¨ From 23 to less than 25 ¨ Above 25 Questions 1-7: Tick (ü) the answer you choose. LEARNING AND TEST-TAKING EXPERIENCE 1. How long have you been learning English?

¨ One year or less ¨ Over 1 but less than 4 years ¨ 4 but less than 7 years ¨ 7 years or more

2. How many times have you taken an English-speaking test? ¨ Once ¨ Twice ¨ Three times ¨ Four times or more

3. How would you rate your proficiency in spoken English? Generally: ¨ Very poor ¨ Poor ¨ Average ¨ Fairly good ¨ Good ¨ Very good Accuracy: ¨ Very poor ¨ Poor ¨ Average ¨ Fairly good ¨ Good ¨ Very good Fluency: ¨ Very poor ¨ Poor ¨ Average ¨ Fairly good ¨ Good ¨ Very good 4. Were you suffering any short-term ailments during the test? (e.g. unpredictable toothache, earache, sore throat, cold or flu…) ¨ Yes ¨ No 5. Do you have any long-term illness or disability that adversely affected your speaking? (e.g. stammer, lisp, deformity of the mouth/throat, problem with hearing/seeing…) ¨ Yes ¨ No 6. How regularly did you attend the speaking lessons?

¨ Less than 10% ¨ 50% to less than 70% ¨ 10% to less than 30% ¨ 70% to less than 90% ¨ 30% to less than 50% ¨ 90% or more

7. How would you rate your preference of these oral test formats? (Number from 1 to 3. 1= Like least;2= Like; 3=Like most)

¨ Personal oral interview ¨ Paired speaking test ¨ Group oral assessment ¨ Other test format: …………………………………………

431

Questions 8-39: For each item, choose only one number that best describes your opinion about the last end-of-course test of spoken English you experienced as a test rater. 1----------------2----------------3----------------4----------------5 Strongly Disagree Neither Agree Strongly

disagree agree nor agree disagree TEST PREPARATION AND ADMINISTRATION

8 I had enough time to prepare for this test. 1 2 3 4 59 The test was well administered. 1 2 3 4 510 The atmosphere of the test room was stressful. 1 2 3 4 511 I was clearly informed of the assessment criteria to prepare for the test 1 2 3 4 512 I need the teacher’s feedback on my speaking performance in the test, so I

can improve my speaking skills. 1 2 3 4 5

13 I believe that computer-assisted speaking tests are more accurate and reliable than those by human raters.

1 2 3 4 5

14 I achieved the best speaking performance that I am capable of. 1 2 3 4 5 TEST TASK

15 The speaking task evoked my interest. 1 2 3 4 516 The test task was too difficult for me. 1 2 3 4 517 I understood what I was supposed to do in the speaking task. 1 2 3 4 518 I believe that the speaking task provided me with an adequate opportunity

to demonstrate my ability to speak English. 1 2 3 4 5

19 I had enough time to demonstrate my speaking ability as the test task required.

1 2 3 4 5

20 The test task was related to what I had learnt in class. 1 2 3 4 521 I needed to attend class regularly to be able to do the task well. 1 2 3 4 522 The test task was practical. 1 2 3 4 523 The test task was boring. 1 2 3 4 524 I had opportunities to use the language I had learnt in this course to

perform the test task. 1 2 3 4 5

RATING AND SCORING

25 The rater used the course objectives as the criteria to assess my speaking. 1 2 3 4 526 The rater was consistent in his/her manner of asking questions. 1 2 3 4 527 The rater was fair in scoring. 1 2 3 4 528 The judgement should be done independently by two raters. 1 2 3 4 529 The rating scale made it difficult for candidates to get high scores. 1 2 3 4 530 My teacher examiner was generous in his/her scoring. 1 2 3 4 531 I would like my classmate(s) to take part in assessing my speaking

performance. 1 2 3 4 5

32 The speaking session should be audio-recorded in case reconsideration will be needed later.

1 2 3 4 5

33 My speaking performance in the test should be used as a solo determiner of my score, not including my speaking performance in class.

1 2 3 4 5

432

TEST IMPACT ON LEARNING

34 The test helped me identify what I need to improve about my spoken English.

1 2 3 4 5

35 The test helped me build more effective learning strategies to improve my English-speaking skills.

1 2 3 4 5

36 I have learnt useful test-taking skills from this test. 1 2 3 4 537 The test has made me more confident in speaking English. 1 2 3 4 538 I revised the lessons I had learnt from the English-speaking course to

prepare for the test. 1 2 3 4 5

39 This test is one of the most important motivations for my trying to improve English speaking skills.

1 2 3 4 5

OTHER QUESTIONS Questions 40-46: Tick (ü) the answer(s) you choose and/or provide your own opinion. 40. Did you feel nervous during the test?

¨ No, not at all. ¨ Yes, I am nervous. ¨ Yes, very nervous.

If yes, what made you nervous? ………………………………………………………………. …………………………………………………………….…………………………………… …………………………………………………………….…………………………………… 41. You would have done better in the test if you had …

¨ been rated by another rater ¨ done another kind of test task ¨ taken the test at another time ¨ taken the test at another place ¨ been assessed individually ¨ been assessed in pairs (with another partner) ¨ been assessed in groups (with other classmates) ¨ Other: …………………………………………

42. Did you experience any unexpected occurrence that you think might have affected your performance during the speaking test? ¨ No ¨ Yes If yes, what do you think it was? ¨ Noise ¨ Temperature ¨ Seating arrangement ¨ Technical error ¨ Other: ………………………………………… 43. In what aspects do you think you will be affected by the test score?

¨ Self-image ¨ Motivation to learn English ¨ Teacher-student relationship ¨ Anxiety and emotional tension ¨ Future job opportunities ¨ Overall academic result ¨ Reprimand from family ¨ Other: …………………………………………..

44. Which of the following activities had you done before the test that you found useful in your speaking performance?

¨ Oral presentation ¨ Paired speaking practice ¨ Role play and group discussion ¨ Joining an English-speaking club ¨ Mock test taking ¨ Other: …………………………………………..

433

45. What do you think the examiner should do (could have done) to help you perform better in this test?

…………………………………………………………….…………………………………………… …………………………………………………………….…………………………………………… …………………………………………………………….…………………………………………… …………………………………………………………….…………………………………………… …………………………………………………………….……………………………………………

46. What changes to the oral assessment of EFL would you recommend so as to make it more effective? …………………………………………………………….…………………………………………… …………………………………………………………….…………………………………………… …………………………………………………………….…………………………………………… …………………………………………………………….…………………………………………… …………………………………………………………….……………………………………………

Thank you for your cooperation! _____________________________________ Acknowledgements This questionnaire is adapted from the questionnaires introduced in: • Cheng, L. (2005). Changing language teaching through language testing: A washback study,

Studies in Language Testing 21, UK: Cambridge University Press. • Fulcher, G. (2003). Testing second language speaking, Harlow: Longman/Pearson Education Ltd. • Taylor, L., Milanovic, M., & Weir, C. J. (Eds.) (2011). Examining Speaking: Research and

practice in assessing second language speaking, Studies in Language Testing 30, UK: Cambridge University Press.

• Weir, C. J. (2005). Language testing and validation: An evidence-based approach, Basingstoke: Palgrave Macmillan.

434

Vietnamese translation Tên đề tài: Nghiên cứu hiệu quả và độ tin cậy của việc đánh giá kỹ năng nói tiếng Anh của sinh viên đại học ở Việt Nam

BẢNG CÂU HỎI DÀNH CHO SINH VIÊN (THÍ SINH) Phiên bản 2, ngày 01/11/2015

Hoàn thành bảng câu hỏi này mất khoảng 15 phút. Xin bạn vui lòng trả lời các câu hỏi càng chính xác như bạn nghĩ càng tốt. Thông tin cá nhân Trường Đại học: …………………………………………………………………………………… Giới tính: ¨ Nam ¨ Nữ Tiếng mẹ đẻ: ¨ Tiếng Việt ¨ Ngôn ngữ khác………………………… Độ tuổi: ¨ Nhỏ hơn 20 tuổi ¨ Từ 20 đến dưới 23 tuổi

¨ Từ 23 đến dưới 25 tuổi ¨ Từ 25 tuổi trở lên Câu hỏi 1-7: Đánh dấu (ü) ở câu trả lời bạn chọn. KINH NGHIỆM HỌC TẬP VÀ THI CỬ 1. Bạn học tiếng Anh được bao lâu?

¨ Chưa quá một năm ¨ Hơn 1 năm nhưng ít hơn 4 năm ¨ Từ 4 nhưng ít hơn 7 năm ¨ Từ 7 năm trở lên

2. Bạn tham gia kỳ thi nói tiếng Anh được mấy lần rồi? ¨ Một lần ¨ Hai lần ¨ Ba lần ¨ Từ bốn lần trở lên

3. Bạn tự đánh giá khả năng nói tiếng Anh của mình ở mức nào? Nhìn chung: ¨ Rất tệ ¨ Tệ ¨ Trung bình ¨ Khá ¨ Tốt ¨ Rất tốt Độ chính xác: ¨ Rất tệ ¨ Tệ ¨ Trung bình ¨ Khá. ¨ Tốt ¨ Rất tốt Độ lưu loát: ¨ Rất tệ ¨ Tệ ¨ Trung bình ¨ Khá ¨ Tốt ¨ Rất tốt 4. Trong lúc kiểm tra, bạn có gặp những vấn đề gì về sức khỏe nhất thời không? (chẳng hạn như đau răng, đau tai, đau họng, cảm lạnh, cảm cúm, v.v.)

¨ Có ¨ Không 5. Bạn có những chứng bệnh thường xuyên hay khuyết tật nào gây ảnh hưởng bất lợi cho phần thi nói của bạn không? (chẳng hạn như tật nói lắp/cà lăm, nói ngọng, khuyết tật ở miệng/họng, khả năng nghe/nhìn hạn chế, v.v.)

¨ Có ¨ Không 6. Bạn ước lượng việc thường xuyên tham gia lớp học kỹ năng nói của mình là bao nhiêu?

¨ Ít hơn 10% ¨ Từ 50% đến dưới 70% ¨ Từ 10% đến dưới 30% ¨ Từ 70% đến dưới 90% ¨ Từ 30% đến dưới 50% ¨ Từ 90% trở lên

7. Vui lòng cho biết mức độ bạn thích những hình thức kiểm tra nói sau đây: (Đánh số từ 1 đến 3: 1= Ít thích nhất, 2= Thích, 3= Thích nhất)

¨ Phỏng vấn cá nhân ¨ Kiểm tra nói theo cặp ¨ Kiểm tra nói theo nhóm ¨ Khác: …………………………….

435

Câu hỏi 8-39: Chỉ chọn một con số miêu tả đúng nhất ý kiến của bạn cho mỗi câu sau đây.

1----------------2----------------3----------------4----------------5 Rất Phản đối Không Đồng ý Rất đồng ý phản đối phản đối

cũng không đồng ý CÁCH TỔ CHỨC THI

8 Tôi có đủ thời gian để chuẩn bị cho buổi thi này. 1 2 3 4 5 9 Buổi thi được tổ chức chu đáo. 1 2 3 4 510 Bầu không khí tại phòng thi thật căng thẳng. 1 2 3 4 5 11 Tôi được thông báo rõ ràng về các tiêu chí đánh giá để chuẩn bị cho buổi thi. 1 2 3 4 5 12 Tôi cần những nhận xét góp ý của giám khảo về phần thi của tôi, để tôi có thể

cải thiện kỹ năng nói của mình. 1 2 3 4 5

13 Tôi cho rằng kiểm tra nói trên máy tính sẽ cho kết quả chính xác hơn giám khảo chấm.

1 2 3 4 5

14 Tôi đã thể hiện khả năng nói của mình ở mức tốt nhất. 1 2 3 4 5 YÊU CẦU CỦA PHẦN THI NÓI

15 Yêu cầu của phần thi nói thú vị. 1 2 3 4 5 16 Yêu cầu của phần thi nói quá khó đối với tôi. 1 2 3 4 517 Tôi hiểu rõ yêu cầu đặt ra trong phần thi nói. 1 2 3 4 518 Tôi cho rằng yêu cầu của phần thi nói đã cho tôi đầy đủ cơ hội để thể hiện khả

năng nói tiếng Anh của mình. 1 2 3 4 5

19 Tôi có đủ thời gian để thể hiện khả năng nói tiếng Anh của mình theo như yêu cầu đặt ra.

1 2 3 4 5

20 Yêu cầu của phần thi nói liên quan đến những gì tôi đã học trong lớp. 1 2 3 4 521 Tôi cần tham gia lớp học đều đặn để có thể áp dụng vào thực hiện tốt yêu cầu

của phần thi nói. 1 2 3 4 5

22 Yêu cầu của phần thi nói mang tính thực tế. 1 2 3 4 523 Yêu cầu của phần thi nói thật tẻ nhạt. 1 2 3 4 524 Tôi có cơ hội sử dụng ngôn ngữ đã học trong lớp để thực hiện những yêu cầu

đặt ra trong phần thi nói. 1 2 3 4 5

ĐÁNH GIÁ VÀ CHO ĐIỂM

25 Giám khảo đánh giá tôi dựa trên những mục tiêu đề ra trong đề cương môn học. 1 2 3 4 526 Giám khảo có sự nhất quán trong cách đặt câu hỏi cho thí sinh. 1 2 3 4 5 27 Giám khảo công bằng trong việc cho điểm. 1 2 3 4 528 Việc đánh giá nên do hai giám khảo thực hiện độc lập. 1 2 3 4 5 29 Thí sinh khó đạt được điểm cao với thang điểm hiện tại. 1 2 3 4 5 30 Giám khảo cho điểm rộng rãi. 1 2 3 4 531 Tôi thích bạn bè trong lớp tham gia vào việc đánh giá khả năng nói của mình. 1 2 3 4 532 Phần thi nói cần được ghi âm để có khi cần xem xét lại. 1 2 3 4 5 33 Điểm số cho phần thi nói tiếng Anh của tôi nên do việc thể hiện khả năng nói

trong buổi thi quyết định, không nên tính đến việc thể hiện trong lớp học kỹ năng nói.

1 2 3 4 5

436

ẢNH HƯỞNG CỦA PHẦN THI NÓI ĐỐI VỚI VIỆC HỌC

34 Buổi thi giúp tôi nhận ra những điểm tôi cần cải thiện để có thể nói tiếng Anh tốt hơn.

1 2 3 4 5

35 Buổi thi giúp tôi xây dựng những chiến thuật học tập kỹ năng nói tiếng Anh hiệu quả hơn.

1 2 3 4 5

36 Tôi học được những kỹ năng hữu ích cho thi cử từ buổi thi này. 1 2 3 4 5 37 Buổi thi giúp tôi tự tin hơn khi nói tiếng Anh. 1 2 3 4 538 Tôi ôn lại những kiến thức về ngôn ngữ trong suốt khóa học nói để chuẩn bị

cho buổi kiểm tra này. 1 2 3 4 5

39 Buổi kiểm tra này là một trong những động lực quan trọng để tôi chú ý trau dồi kỹ năng nói tiếng Anh của mình.

1 2 3 4 5

MỘT SỐ CÂU HỎI KHÁC Câu hỏi 40-46: Chọn một hoặc nhiều câu trả lời và/hoặc cho ý kiến của riêng bạn. 40. Bạn có cảm thấy lo sợ trong khi kiểm tra không?

¨ Không có lo sợ gì hết ¨ Có lo sợ (lo lắng) ¨ Rất lo sợ Nếu có, điều gì làm bạn lo sợ? ……………………………………………..…………………… …………………………………………………………………………………………………… ……………………………………………………………………………………………………

41. Bạn sẽ thể hiện phần thi nói của mình tốt hơn nếu bạn … ¨ được một giám khảo khác đánh giá ¨ thực hiện một loại yêu cầu khác ¨ tham gia kỳ thi vào một thời điểm khác ¨ kỳ thi tổ chức ở một địa điểm khác ¨ được đánh giá riêng lẻ một mình ¨ được đánh giá theo cặp (với một thí sinh khác) ¨ được đánh giá theo nhóm (với ít nhất hai thí sinh khác) ¨ Câu trả lời khác: ……………………………………………….

42. Bạn có gặp phải sự việc nào xảy ra ngoài ý muốn mà bạn cho là có thể ảnh hưởng đến phần thi nói của bạn không?

¨ Không ¨ Có Nếu có, bạn nghĩ điều đó là gì? ¨ Tiếng ồn ¨ Nhiệt độ ¨ Cách bố trí chỗ ngồi ¨ Sự cố kỹ thuật ¨ Câu trả lời khác: ……………………………………………….

43. Điểm số có ảnh hưởng đến bạn không, khi xét đến những mặt sau đây? ¨ Hình ảnh cá nhân (Thể diện) ¨ Động cơ học tiếng Anh ¨ Mối quan hệ thầy-trò ¨ Lo lắng, căng thẳng ¨ Cơ hội nghề nghiệp tương lai ¨ Kết quả học tập chung ¨ Khiển trách từ phía gia đình ¨ Câu trả lời khác: ……………………………………………….

44. Những hoạt động nào sau đây bạn đã thực hiện trước khi thi mà bạn thấy hữu ích cho phần thi nói của mình?

¨ Thuyết trình (miệng) ¨ Luyện tập nói theo cặp ¨ Tập đóng vai và thảo luận nhóm ¨ Tham gia câu lạc bộ nói tiếng Anh ¨ Tham gia kiểm tra thử ¨ Câu trả lời khác: …………………………………………………

437

45. Theo bạn giám khảo nên làm gì để có thể giúp bạn thể hiện tốt hơn trong phần thi nói? …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… ……………………………………………………………………………………………………

46. Bạn có kiến nghị gì để việc kiểm tra nói tiếng Anh trở nên hiệu quả hơn? …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… …………………………………………………………………………………………………… ……………………………………………………………………………………………………

Cảm ơn sự hợp tác của bạn! _______________________________________ Ghi chú Bảng câu hỏi này được thiết kế dựa theo những bảng câu hỏi từ các nguồn sau đây: • Cheng, L. (2005). Changing language teaching through language testing: A washback study,

Studies in Language Testing 21, UK: Cambridge University Press. • Fulcher, G. (2003). Testing second language speaking, Harlow: Longman/Pearson Education Ltd. • Taylor, L., Milanovic, M., & Weir, C. J. (Eds.) (2011). Examining Speaking: Research and

practice in assessing second language speaking, Studies in Language Testing 30, UK: Cambridge University Press.

• Weir, C. J. (2005). Language testing and validation: An evidence-based approach, Basingstoke: Palgrave Macmillan.

438

B.3a Interview protocol for EFL teachers (*)

Research project: Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam: An investigation into construct validity and reliability

INTERVIEW QUESTIONS Document version 2, dated 01/11/2015

For examiners (EFL teachers) Interviews will be conducted in participants’ first language – Vietnamese. Interview with the individual examiners will not exceed 45 minutes each. Interviewee’s name: ………………………………………………. Gender: M/F University: ………………………………………………………… Contact: Email: …………………………………………… Ph: ………..……….. Interview date: …… / …… / 2016 Time: From …………..….. to …….……… Venue/Room: ..…………………………………………………………………………………… Introductory statement Thank you for agreeing to work with me on this project. The purpose of this project is to learn about the English-speaking test that you have just given this semester. I will ask you some questions to obtain your opinions and perceptions about this test. The answers you provide will be used for this research study only, and not for any other purposes. With your permission, I will record all of our interviews (using an MP3 recorder) to ensure accuracy and coverage in understanding your responses. Please answer the questions as completely and accurately as you can. 1. In your opinion, how important is the English-speaking test in the training process of EFL majors? 2. How were the speaking tasks you used in the oral EFL test constructed? 3. What method(s) did you use to score examinees’ oral performances? 4. Did you find any difficulty in your rating process? If yes, what were these difficulties? 5. To what extent do you think you could judge different candidates using the same criteria as others? 6. Were there any other factors that might have affected your judgment of the candidates’ oral performances? (e.g. candidate’s body language, their appearance, their order in the test, your familiarity with them, your fatigue, etc.) 7. Based on the students’ performances in this test, what course objectives do you think have been achieved? (e.g. language knowledge, language functions in use, cross-cultural awareness, etc.) 8. Do you think you will make any changes or improvements in your teaching to help students improve their English-speaking skill if you are assigned to teach this course again?

- If yes, why? What changes or improvements would you like to make? - If no, why not?

9. In your opinion, what are the strengths and weaknesses of oral language testing this way? 10. Have you ever given a computer-assisted oral test at your university?

- If yes, what do you see as the advantages and disadvantages of this testing method? - If no, why not? Do you expect this testing method to be used at your university in the future?

439

11. Do you think it necessary to give candidates feedback and comments on their speaking performances after the test day? Why (not)? 12. In your opinion, is that two independent raters involved in the scoring process advisible and feasible for your institution? Why (not)? 13. In your opinion, should speaking tests be (audio-) recorded? Why (not)? 14. Do you have any other opinions about the speaking test you have just given?

Thank you for your cooperation!

440

Vietnamese translation

CÂU HỎI PHỎNG VẤN DÀNH CHO GIÁM KHẢO (GIÁO VIÊN)

Cảm ơn quý Thầy/Cô đã đồng ý tham gia vào chương trình nghiên cứu này. Mục đích của dự án là nhằm tìm hiểu về phần kiểm tra nói mà quý Thầy/Cô đã cho trong học kỳ này. Người phỏng vấn sẽ hỏi quý Thầy/Cô một số câu hỏi liên quan đến những cảm nhận và ý kiến của quý Thầy/Cô về bài kiểm tra này. Những thông tin quý Thầy/Cô cung cấp sẽ chỉ dùng cho việc nghiên cứu, và hoàn toàn không vì bất kỳ mục tiêu nào khác. Với sự cho phép của quý Thầy/Cô, toàn bộ cuộc phỏng vấn sẽ được ghi âm (bằng thiết bị MP3) để đảm bảo việc hiểu một cách đúng đắn và đầy đủ những câu trả lời. Xin Thầy/Cô vui lòng trả lời các câu hỏi một cách trọn vẹn và chính xác như những gì Thầy/Cô muốn cung cấp. 1. Theo Thầy/Cô việc kiểm tra đánh giá khả năng nói tiếng Anh có tầm quan trọng như thế nào trong quá trình đào tạo sinh viên chuyên ngữ? 2. Những yêu cầu của phần kiểm tra nói (đề thi) được xây dựng như thế nào? 3. Thầy/Cô đã cho điểm phần trình bày của thí sinh bằng phương pháp nào? 4. Thầy/Cô có gặp khó khăn gì trong quá trình cho điểm của mình không? Nếu có, đó là những khó khăn gì? 5. Thầy/Cô nhận thấy mình đã có thể đánh giá các thí sinh với cùng tiêu chí nhất quán ở mức độ nào? 6. Có thể có những yếu tố khác ảnh hưởng đến sự đánh giá của Thầy/Cô đối với phần trình bày của các thí sinh không? (Chẳng hạn như diện mạo thí sinh, ngôn ngữ hình thể, mức độ quen biết, trật tự trong danh sách, sự mệt mỏi của giám khảo khi càng về cuối buổi thi,...) 7. Những điểm nào liên quan tới mục tiêu môn học mà Thầy/Cô cho rằng sinh viên đã đạt được thể hiện qua phần thi nói của các em? (Chẳng hạn như về kiến thức ngôn ngữ, kỹ năng sử dụng ngôn ngữ, hiểu biết về văn hóa...) 8. Thầy/Cô có nghĩ rằng mình sẽ có những thay đổi hay cải thiện nào đó nhằm giúp sinh viên nâng cao khả năng nói tiếng Anh nếu Thầy/Cô sẽ dạy lại chương trình này trong tương lai?

- Nếu có, tại sao? Những thay đổi hay cải thiện đó là gì? - Nếu không, tại sao không?

9. Theo nhận định của Thầy/Cô, điểm được và chưa được của hình thức kiểm tra nói theo cách đã áp dụng vừa qua là gì? 10. Thầy/Cô có bao giờ tổ chức cho sinh viên thi nói trên máy tính chưa? - Nếu có, những thuận lợi và bất lợi của phương pháp kiểm tra này là gì?

- Nếu chưa, Thầy/Cô có mong cách thức kiểm tra này sẽ được thực hiện trong tương lai ở trường nơi Thầy/Cô đang công tác không? Vì sao (không)? 11. Theo Thầy/Cô, có cần cho thí sinh những nhận xét về phần thi nói của các em sau khi thi không? Tại sao (không)? 12. Theo ý kiến của Thầy/Cô, việc đánh giá cho điểm do hai giám khảo thực hiện độc lập có nên và có khả thi ở trường của Thầy/Cô không? Tại sao (không)? 13. Theo ý kiến của Thầy/Cô, có nên ghi âm lại buổi thi nói không? Tại sao (không)? 14. Thầy/Cô có ý kiến nào khác về bài kiểm tra nói vừa qua nữa không?

Cảm ơn sự cộng tác và giúp đỡ của quý Thầy/Cô!

441

B.3b Interview protocol for EFL students (*)

Research project: Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam: An investigation into construct validity and reliability

INTERVIEW QUESTIONS Document version 2, dated 01/11/2015

For candidates (EFL students) Interviews will be conducted in participants’ first language (Vietnamese). Each group interview with four interviewees will take about 60 minutes. Interview date: …… / …… / 20….. Time: from …….………to …….….…… University: ……………………………………………………………... Class: ……………… Venue/Room: .……………………………………………………………………………………

No. Name Gender Phone Email 1. M/F 2. M/F 3. M/F 4. M/F

Introductory statement Thank you for agreeing to work with me on this project. The purpose of this project is to learn about the English-speaking test that you have just taken this semester. I will ask you some questions to get your opinions and perceptions about this test. The answers you provide will be used for this research study only, and not for any other purposes. With your permission, I will record all of our interviews (using an MP3 recorder) to ensure accuracy and coverage in understanding your responses. Please answer the questions as completely and accurately as you can. 1. Do you think you gave the best speaking performance you are capable of? 2. Do you think there were any external factors (e.g. noise, temperature, health problems, examiner’s attitude, etc.) might have adversely affected your oral performance? 3. Do you think the test was an opportunity for you to demonstrate what you had learnt in this course? 4. Did the speaking class prepare you well for this test? If yes, in what way(s)? If no, why not? 5. Were you clearly-informed of the assessment criteria and the testing procedure before the test? If yes, how was this information helpful to you in preparation for the test? If no, what difficulties did this cause you? 6. Did you receive feedback from the examiner on your oral test performance? If yes, how was the feedback helpful to you? If no, do you think feedback is needed after the test? 7. Do you think the test helped you build more effective strategies for your EFL learning, particularly for improving your English-speaking skill? If yes, in what way(s) did the test help you? If no, why not?

442

8. Do you think the test score reflects your English-speaking ability? Why (not)? 9. Do you think the score you received is fair? Please give your reasons. 10. Have you ever taken a computer-assisted oral test at your university? If yes, how useful do you think it is/was? 11. Which format of speaking test do you like best: personal oral interview, paired speaking test, group oral assessment, or any other format? Why? 12. Do you prefer your speaking performance to be rated by one or two raters? Why? 13. Do you think candidates’ speaking performances should be (audio-)recorded in case they might need reconsidering later? Why (not)? 14. Do you have any other opinions about the speaking test you have taken?

Thank you all for your cooperation!

443

Vietnamese translation

CÂU HỎI PHỎNG VẤN

DÀNH CHO THÍ SINH (SINH VIÊN) Cảm ơn các bạn đã đồng ý tham gia vào chương trình nghiên cứu này. Mục đích của dự án là nhằm tìm hiểu về phần kiểm tra nói mà các bạn đã tham gia trong học kỳ này. Ngừơi phỏng vấn sẽ hỏi các bạn một số câu hỏi liên quan đến những cảm nhận và ý kiến của các bạn về bài kiểm tra này. Những thông tin cácbạn cung cấp sẽ chỉ dung cho việc nghiên cứu, và hoàn toàn không vì bất kỳ mục tiêu nào khác. Với sự cho phép của các bạn, toàn bộ cuộc phỏng vấn sẽ được ghi âm (bằng thiết bị MP3) để đảm bảo việc hiểu chính xác và đầy đủ thông tin các bạn cung cấp. Mong các bạn vui lòng trả lời các câu hỏi một cách trọn vẹn và chính xác nhất. 1. Bạn có nghĩ là trong kỳ thi cuối khoá vừa qua, bạn đã có cơ hội thể hiện tối đa khả năng nói tiếng Anh của mình chưa? 2. Có yếu tố bên ngoài nào ảnh hưởng đến phần thi nói của bạn không (ví dụ: tiếng ồn, nhiệt độ, điều kiện sức khoẻ, cách cư xử của giám khảo…)? 3. Bạn có nghĩ bài kiểm tra là dịp để bạn thể hiện những gì đã học trong khoá này không? 4. Lớp học kỹ năng nói có giúp bạn trang bị tốt cho kỳ thi này không? 5. Bạn có được thông tin rõ ràng về cách thức kiểm tra và những tiêu chí đánh giá trước kỳ kiểm tra hay không? Nếu có, thông tin này hữu ích đối với bạn như thế nào? Nếu không, (việc thiếu thông tin) điều này có gây khó khăn gì cho bạn không? 6. Sau ngày thi, bạn có nhận được những nhận xét từ giám khảo cho phần thi nói của bạn không? Nếu có, những nhận xét đó có hữu ích đối với bạn không? Nếu không, bạn có nghĩ những lời nhận xét góp ý là cần thiết không? 7. Bạn có cho rằng bài kiểm tra này đã giúp bạn xây dựng được những chiến thuật học tiếng Anh hiệu quả hơn không? Cụ thể là đối với kỹ năng nói tiếng Anh? Nếu có, việc kiểm tra đã giúp bạn như thế nào? Nếu không, tại sao không? 8. Theo bạn thấy điểm số các bạn nhận được có phản ánh (đúng) khả năng nói tiếng Anh của bạn không? Tại sao (không)? 9. Bạn có nghĩ điểm số bạn nhận được là công bằng không? Vui lòng cho biết lý do của bạn. 10. Bạn có bao giờ tham gia kỳ thi nói trên máy tính chưa? Nếu có, hình thức này hiệu quả như thế nào, theo em thấy? 11. Bạn thích hình thức kiểm tra nói nào nhất: phỏng vấn cá nhân, thi theo cặp, theo nhóm, hay một hình thức nào khác? Tại sao? 12. Bạn có ngĩ phần kiểm tra nói nên do một hay hai giám khảo đánh giá không? Tại sao (khộng)? 13. Bạn có nghĩ buổi thi nói nên được ghi âm phòng khicần xem xét lại sau này không? Tại sao (không)? 14. Bạn có ý kiến nào khác về bài kiểm tra nói tiếng Anh vừa qua không?

Cảm ơn sự hợp tác và giúp đỡ của các bạn!

444

B.4 Documents for EFL experts

B.4a Test content judgement protocol for EFL experts: University A

Research project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

No. Test Content Code: 01

Expert Code:

JUDGEMENTS ON THE RELEVANCE OF TEST CONTENTS

Document version 2, dated 01/11/2015

The items of language skills and knowledge in the list below are from an EFL speaking test for second-year English majors. As an expert with qualification and experience in TESOL and/or EFL curriculum, you are invited to rate each test item in terms of its relevance to the course objectives and the contents of the course book (Documents attached) by marking a tick (ü) to indicate whether it is (1) Highly irrelevant, (2) Not relevant, (3) Relevant, or (4) Highly relevant. Please give further comments on level of difficulty and/or make suggestions for revision of each item if necessary.

YOUR FURTHER COMMENTS (on level of difficulty, wording, language focus, etc.) and/or SUGGESTIONS for revision of particular items wherever possible: …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….…………………

No.

Language skill/ Knowledge assessed in

the test

(1) Highly

irrelevant

(2) Not

relevant

(3) Relevant

(4) Highly relevant

Further comments and/or suggestions

for revision

1 Topic1.1

2 Topic1.2

3 Topic1.3

4 Topic1.4

5 Topic2.1

6 …

… …

445

B.4b Test content judgement protocol for EFL experts: University B Research project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

No. Test Content Code: 02

Expert Code:

JUDGEMENTS ON THE RELEVANCE OF TEST CONTENTS

Document version 2, dated 01/11/2015

The items of language skills and knowledge in the list below are from an EFL speaking test for second-year English majors. As an expert with qualification and experience in TESOL and/or EFL curriculum, you are invited to rate each test item in terms of its relevance to the course objectives and the contents of the course book (Documents attached) by marking a tick (ü) to indicate whether it is (1) Highly irrelevant, (2) Not relevant, (3) Relevant, or (4) Highly relevant. Please give further comments on level of difficulty and/or make suggestions for revision of each item if necessary.

YOUR FURTHER COMMENTS (on level of difficulty, wording, language focus, etc.) and/or SUGGESTIONS for revision of particular items wherever possible: …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….…………………

No.

Task 1: Answer the questions

individually

(1) Highly

irrelevant

(2) Not

relevant

(3) Relevant

(4) Highly

relevant

Further comments and/ or suggestions

1A. FOOD & SCIENCE 1 2 3 1B. SUCCESS & GAP YEARS 4 … No.

Task 2: Look at the mind map and discuss with your partner.

Then, decide

(1) Highly

irrelevant

(2) Not

relevant

(3) Relevant

(4) Highly relevant

Further comments and/ or suggestions

61 which one is the most important?

446

B.4c Test content judgement protocol for EFL experts: University C

Research project:

Oral assessment of English as a foreign language (EFL) in tertiary education

in Vietnam: An investigation into construct validity and reliability

No. Test Content Code: 03

Expert Code:

JUDGEMENTS ON THE RELEVANCE OF TEST CONTENTS

Document version 2, dated 01/11/2015

The items of language skills and knowledge in the list below are from an EFL speaking test for second-year English majors. As an expert with qualification and experience in TESOL and/or EFL curriculum, you are invited to rate each test item in terms of its relevance to the course objectives and the contents of the course book (Documents attached) by marking a tick (ü) to indicate whether it is (1) Highly irrelevant, (2) Not relevant, (3) Relevant, or (4) Highly relevant. Please give further comments on level of difficulty and/or make suggestions for revision of each item if necessary.

YOUR FURTHER COMMENTS (on level of difficulty, wording, language focus, etc.) and/or SUGGESTIONS for revision of particular items wherever possible: …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….………………… …………………………………………………………………………………………….…………………

No.

Language skills/

Knowledge assessed in the test

(1) Highly

irrelevant

(2) Not

relevant

(3) Relevant

(4) Highly relevant

Further comments and/ or suggestions

GENERALKNOWLEDGEFOCUS 1 2 3 4 .. LECTURELANGUAGEFOCUS

41 42 …

447

B.4d Demographic information about EFL experts

No. Test Content Code:

Expert Code:

Research project:

Oral assessment of English as a foreign language (EFL) in tertiary education in Vietnam: An investigation into construct validity and reliability

INFORMATION ABOUT EFL CONTENT EXPERT

Please provide some information about you as an EFL expert judging the relevance of the

speaking test contents.

1. The university/educational institution where you are working:

……………………………………………………………………………………………………

2. Gender: ¨ Male ¨ Female

3. Age group: ¨ Less than 25 ¨ From 25 to less than 30

¨ From 30 to less than 35 ¨ From 35 to less than 40

¨ From 40 to less than 45 ¨ 45 or above

4. First language:

¨ Vietnamese ¨ English ¨ Other (Please specify) ……………………

5. Highest professional qualification:

¨ BA in …………….………………………….. (your major/field of study)

¨ MA in …………….…………………………..

¨ PhD in …………….………………………….

¨ Other …………….……………………………

6. Experience in EFL/ESL curriculum, testing and assessment: ……… years

7. Current position in your Department/Faculty:

…………………………………………………………………………………………………...

…………………………………………………………………………………………………...

Thank you for your cooperation!

448

B.5 Focus group discussion protocol for EFL teacher raters (*)

RATERS’ DISCUSSION PROTOCOL

1. Your perceptions and/or comments on rating EFL speaking skills at your institution (in

terms of test design, test administration, test usefulness, test impact on EFL teaching and

learning, etc.)

2. How you rate candidates’ EFL speaking ability (i.e., how you interpret a candidate’s

speaking performance, how you decide the score you give to a candidate, whether your rating

on a candidate’s performance is affected by other candidates’ performances before him/her or

paired with him/her, etc.)

3. Comparing the first rating (face-to-face) with the second rating (on audio-recordings) in

terms of effectiveness, consistency, difficulty, feasibility, satisfaction, etc.

4. Your preference: marking speaking tests on your own or with another rater? Do you think

pairs of raters should have a short discussion/negotiation (right after each candidate’s speaking

performance or at the end of the speaking session) to come to an agreement on the final score

awarded to him/her? Why (not)?

5. Your suggestions (if any) to enhance the quality of the administration of EFL speaking tests

and/or the rating of EFL speaking skills at your university

449

DISCUSSION QUESTIONS

Topic 1: Your perceptions and/or comments on rating EFL speaking skills at your institution

- What do you think of the speaking test in which you acted as an examiner/rater?

- Please give your comments on test design, test administration, test usefulness, test impact on

the teaching and learning of EFL speaking skills at your institution, etc.

Topic 2: The method you use to rate candidates’ EFL speaking ability

- How do you interpret a candidate’s speaking performance?

- How you do decide the score you give to a candidate?

- Is your rating on a candidate’s performance affected by other candidates’ performances before

him/her or paired with him/her, etc.?

- What do you do if a candidate cannot answer a question?

- What do you do if one candidate tends to be dominant in the discussion with their partner in a

paired speaking test?

Topic 3: Comparing the first rating (face-to-face) with the second rating (based on audio-

recordings)

- How consistent do you find yourself in each form of rating?

- How difficulty do you find it rating candidates’ speaking skills face-to-face in comparison

with rating based on audio-recordings?

- How satisfied are you with each form of rating?

- How effective do you find each form of rating? …

Topic 4: Your preference in marking speaking tests

- Do you prefer marking speaking tests on your own or with another rater?

- Do you think pairs of raters should have a short discussion/negotiation (right after each

candidate’s speaking performance or at the end of the speaking session) to come to an

agreement on the final score awarded to him/her? Why (not)?

Topic 5: Your suggestions to enhance the quality of assessing EFL speaking

- Do you have any suggestions to improve the administration of EFL speaking tests and/or the

rating of EFL speaking skills at your university?

450

NỘI DUNG THẢO LUẬN VỚI GIÁM KHẢO

1. Cảm nhận và/hoặc ý kiến nhận xét của Thầy/Cô đối với việc kiểm tra đánh giá

kỹ năng nói tiếng Anh của sinh viên chuyên ngữ ở trường nơi Thầy/Cô đang công tác (về

cách thiết kế đề thi, tổ chức thi, sự hữu ích, ảnh hưởng của việc kiểm tra đối với việc dạy

và học tiếng Anh, v.v.)

2. Cách thức Thầy/Cô đánh giá kỹ năng nói của thí sinh (chẳng hạn, cách hiểu việc

trình bày phần thi nói của thí sinh, quyết định điểm số cho thí sinh, ảnh hưởng phần thi nói

của thí sinh này đến việc đánh giá của Thầy/Cô cho thí sinh khác, v.v.)

3. So sánh việc chấm điểm đánh giá lần đầu (đối mặt với thí sinh) và lần sau (qua

các đoạn ghi âm) khi xét về tính hiệu quả, tính nhất quán, sự khó khăn khi chấm, tính khả

thi, mức độ hài lòng, v.v.)

4. Thầy/Cô thích hình thức nào hơn: tự mình đánh giá và quyết định điểm số cho thí

sinh hay cùng làm việc với 1 giám khảo khác? Thầy/Cô nghĩ cặp giám khảo cùng chấm có

nên thảo luận trao đổi ngắn với nhau để cùng đi đến sự đồng thuận về điểm số cho thí sinh

hay không? Tại sao (không)? Nếu có thì nên thảo luận lúc nào, ngay sau mỗi phần thi của

thí sinh hay cuối buổi thi?

5. Đề xuất/Kiến nghị của Thầy/Cô (nếu có) để việc kiểm tra đánh giá kỹ năng nói

tiếng Anh của trường ngày càng hiệu quả hơn

451

Appendix C: Additional resources for data collection

C.1 Procedure and speaking notes for initial contact with potential participants

Rationale of the preparation for initial contacts • Prospective participants need to have a clear understanding about the study before they want to

participate in;

• It is impossible to gather classes and raters to inform them of the research project on the oral test day

because of the limited conditions of space and time. Teachers and students are not normally in good

mood to listen to research information right before the exam;

• A brief oral presentation needs to be sufficiently informative and consistent across the intended

classes.

Purposes of initial contact • Inform students of the study

• Invite and encourage voluntary participation

• Make a friendly impression and create trustworthiness

• Answer prospective participants’ questions regarding the study

• Estimate number of participants for each activity (questionnaire survey, audio-recording, using test

scores, and individual/group interviews) to make necessary arrangements prior to the test days

(observation assistant, photocopies of documents, gifts, etc.)

Procedure and contents • Greetings

• Approval from the EFL Faculty of the institution and the teacher in charge

• Purpose of the initial meeting: inviting participation, reasons of selecting the class, answer further

questions from potential participants

• Self-introduction: lecturer of English, research student, research topic, research approval and support

from the Vietnamese government and the UoN

• Significance of the study

• Necessity of voluntary participation

• Explain what participants will do to contribute to the study

• Hand out copies of documents (Information Statement and Consent Form) and give time for students

to read the information provided

• Answer students’ further questions

• Collect the signed (or unsigned) Consent Form from the students

• Seek the monitor’s assistance with delivering and collecting students’ completed questionnaires

• Say thank you and goodbye

452

C.2 Instructions for raters’ speech sample recording in test rooms

How to use the Sony voice recorder:

u Switch POWER on

v Press [REC/PAUSE] to start/pause recording

for a test session

� Press [STOP] to finish a test session

Note:

The symbol [REC] will be on the screen when

the recording is in progress.

Please note that the screen turns into black (standby mode) after a few minutes if you

do not press any key. However, the recording is still in progress.

Please ALWAYS include in the recording the moment when you check

candidates’ ID so as to facilitate voice recognition later.

Thank you for your cooperation!

v �

u

453

C.3 Checklist of documents to be collected

I used this checklist to manage the collection of documents associated to testing and oral assessment. Except for the course books that were available in book stores, these internal documents were provided by either by Head of the Listening-Speaking division, or Secretary of the EFL Faculty. For each kind of document in the list, the box (�) was ticked when the document was fully collected, and “Pages” was recorded with the total number of pages of each document.

No.

Documents Uni. A Uni. B Uni. C

Provider/ Source Done Pages Done Pages Done Pages

1. Course book (copied) � � � 2. Course outline � � � 3. Assessment criteria � � � 4. Rating scale � � � 5. Student lists � � � 6. Scoring sheet (Draft) � � � 7. Test scores (Official) � � � 8. Test administration

guide lines � � �

9. Testing material/ Test questions

� � �

10. Curriculum design guidelines

� � �

11. Special consideration form

� � �

12. Other: ……………………

� � �

Total � �

454

C.4 Interview transcription template

(with a partial sample transcript of a student group interview)

File name: U2G1 Institution: ………………………………………………

Date: 02/03/2016 Time: from 11.45 to 12.45

Venue: in campus Address: ………….……………………………...………

Voices in this interview transcript: 5

• IN (Interviewer)

• S1 (Student 1, fem.) ……………… • S3 (Student 3, fem.) ………………

• S2 (Student 2, fem.) ……………… • S4 (Student 4, mal.) ………………

Timing Transript

44:30 44:45 44:53 45.25 45:47 46:10 46:21

IN: Em thích hình thức kiểm tra nói nào nhất, phỏng vấn cá nhân, chia theo cặp, theo nhóm hay là một hình thức nào khác và cho biết lí do? [S4].

S4: Em thích phỏng vấn cá nhân, nói với giám khảo thì tuy có sợ thiệt nhưng mà nó chắc cú hơn là nói với một người bạn mà mình không biết như thế nào. Nếu hên thì gặp một người rất là giỏi, có thể giúp mình đỡ một phần nào đó hoặc nếu không may thì… hai đứa rất khó nói.

IN: Còn các bạn khác thì thế nào? S2: Em thì có hai ý thầy, hoặc là cá nhân hoặc là một nhóm nhỏ khoảng bốn bạn như vầy

nè thầy. IN: [S2] không thích theo cặp như vừa rồi? S2: Dạ, tại vì ví dụ như bạn [S4] nói, gặp trúng bạn tốt, một bạn giỏi hơn hoặc là dở hơn

mình cũng không biết được. Mà ví dụ như trong nhóm khoảng ba bốn bạn thì sẽ có bạn này bạn kia, ví dụ như có một ý mà mình chưa kịp nghĩ ra thì tự nhiên có một bạn nào đó bổ sung cho mình, thì mình có thể là có ý để mà nói hơn. Nên em không thích thi theo cặp.

S3: Em thì em thích là vô một lúc hai bạn, nhưng mà thích hình thức phỏng vấn cá nhân nhất, tại vì giám khảo là người lắng nghe mình, mình nói gì cũng được hết, giám khảo chỉ hỏi thôi, mình hoàn toàn bộc lộ ý cá nhân của mình. Thành ra thì, đúng là em giống ý bạn [S4] đó, nhưng mà vô (phòng thi) thì vô hai bạn đi cho đỡ run (cười) … Ngồi kế bạn mình.

IN: Vô hai bạn một lượt rồi thì bạn kia làm gì? S3: Nghĩa là phỏng vấn từng người, thay phiên nhau qua lại hỏi. IN: Còn bạn kia ngồi kế bên mình hay là sẽ thi sau? S3: Đúng rồi thầy, ngồi kế bên. Nhưng mà câu hỏi có thể giống hoặc là khác nhau, em thi

với bạn [S2] nè. Hồi đó em thi hai lần rồi, nói chung là em thích hình thức cá nhân nhất.

IN: Giám khảo cũng là thầy cô trong trường mà đâu có gì phải lo lắng? S3: Theo cặp thì em may nên gặp được bạn thân nó hiểu ý, nói qua nói lại vô tư luôn, mà

không hên thì buồn lắm (cười). IN: Còn [S1]? S1: Em thì em thích thi cá nhân hơn, mình nói với giám khảo, giám khảo hiểu mình là tốt

quá rồi. Em không thích thi theo cặp là tại vì… kiểu là mấy bạn nói đó, không có tương tác tốt với mình. Còn một vấn đề là, thi theo cặp cũng được đi nhưng mà đừng có chấm theo quán tính là hai người đó nói qua nói lại, ai nói được nhiều hơn thì người đó cao điểm hơn, hoặc là chấm theo cặp đó, cặp đó nói tốt thì điểm cao còn nói dở không cần biết đứa nào nói tốt nói dở gì rồi chấm hai đứa điểm thấp hết.

455

Appendix D: Examination documents

D.1a Rating scales and scoring sheets: University A Assessment criteria

1. Fluency, coherence, and interaction 2. Grammatical range and accuracy 3. Lexical resource 4. Pronunciation

Rating scale

10

• Very good response • Message clear and complete • Wide range of appropriate grammar and vocabulary with few serious errors • Appropriate linking phrases used • Speed, rhythm, and pronunciation of sounds help the listener to understand easily • Appropriate length

8

• Good response • Message clear • Range of appropriate grammar and vocabulary with few serious errors • Appropriate linking phrases mostly used • Speed, rhythm, and pronunciation of sounds help the listener to understand • Appropriate length

6

• Satisfactory response • Message mostly clear • Range of grammar and vocabulary with some errors but these do not impede

communication • Some use of appropriate linking words and phrases • Speed, rhythm, and pronunciation of sounds rarely prevent the listener from

understanding • Appropriate length

4 • Weak response • Message difficult to understand at times • Limited range of grammar and vocabulary with frequent errors

2 • Very weak response • Message difficult to understand • Frequent basic errors in grammar and vocabulary • Weak pronunciation (some parts are unclear or difficult to follow)

0 • Not enough language to assess

Scoring sheet

FINAL EXAM IN LANGUAGE SKILLS 3B - SPEAKING SECTION Exam date: …………………….. Class: ……….

Examiner 1: …………………………………………………………. (signature and full name) Examiner 2: …………………………………………………………. (signature and full name)

No. Student name Task 1 (10 points)

Task 2 (10 points)

Average (10 points)

1. 2. 3. … …

456

D.1b Rating scales and scoring sheets: University B Assessment criteria

1. Communication 2. Coherence 3. Fluency 4. Pronunciation 5. Lexical range and grammatical accuracy

Rating scales

Examiner 1: INTERLOCUTOR (The examiner interviewing candidates)

Examiner 2: ASSESSOR (The examiner listening to the interviews between the interlocutor and candidates)

457

Scoring sheet

ASSESSMENT SHEET FOR EXAMINERS

SPEAKING TEST 4 Date: …….…. Room: ……

No. Student’s

name Examiner 1 ASSESSOR Examiner 2

INTERLOCUTOR Total (10)

Average TOTAL

(10 points) GV (2.5)

DM (2.5)

P (2.5)

IC (2.5)

Total (10)

1

2 3 … …

Note: GV: Grammar and Vocabulary DM: Discourse management P: Pronunciation IC: Interactive communication

D.1c Rating scales and scoring sheets: University C Assessment criteria and rating sclate

1. Understanding and use of lecture language (LL): 30% 2. Content, including vocabulary and grammar range (CT): 30% 3. Pronunciation and intonation (PI): 20% 4. Fluency (FL): 20%

Scoring sheet

END-OF-COURSE SPEAKING TEST

Class: …………… Date: …….……... Room: ……

No. Student’s name LL (3.0) CT (3.0) PI (2.0) FL (2.0) Total (10pts)

1

2

3

University #2

458

D.2 Example of testing material for oral examiners (Interlocutor outline/frame)

Modified task types from FCE Speaking Exam of Cambridge ESOL

459

D.3 Speaking skills and discussion strategies introduced in course books

Institution UniversityA UniversityB UniversityC

Course book title Skillfullisteningandspeaking4

Q:Skillsforsuccess4 Lectureready3

Publisher, year MacmillanEducation,2014 OxfordUniversityPress,2011 OxfordUniversityPress,2013

Speaking skills

and

discussion

strategies

• Agreeing and

disagreeing – degrees

of formality

• Identifying sources of

information

• Managing conversation

• Supporting proposals

• Emphasizing important

information – repetition

and contrastive pairs

• Negotiating

• Adding points to an

argument

• Softening criticism

• Managing conflict –

reformulating and

monitoring

• Express interest during a

conversation to encourage

the speaker to continue

• Use rising intonation to

indicate attitudes and

purposes

• Take notes for a discussion

• Change the topic and move

a conversation into a

comfortable area

• Talk about real and unreal

conditions to speculate

about choices

• Use questions to maintain

listener interest

• Use direct and indirect

quotations to report

information from sources

• Use persuasive language to

encourage positive

attitudes towards your

positions

• Use reduced forms of

pronouns and verbs to

achieve a proper tone

• Add to a speaker’s

comments to become an

active conversation partner

• Use thought groups to

segment sentences into

understandable pieces

• Express ideas during a

discussion

• Ask for clarification and

elaboration during a

discussion

• Give opinion and ask for

the opinions of others

during a discussion

• Express interest and ask

for elaboration during a

discussion

• Agree and disagree

during a discussion

• Learn to compromise and

reach a consensus during

a discussion

• Expand on your own

ideas during a discussion

• Keep the discussion on

topic

• Indicate to others when

you are preparing to

speak or pausing to

collect your thoughts

• Support your ideas by

paraphrasing and quoting

others.

460

D.4 Scoring sheets for semi-direct assessment

SCORING SHEET FOR ORAL EXAMINER Rater’s full name: …………………………………………….. Rater code: ………..

Pair

Topic

Candidate

Code

Task1

(10pts)

Task2

(10pts)

Average

(10pts)

Commentsonthequalityoftherecording

Notes/Furthercomments

onthecandidate’sspeakingperformance

P OK G

1) 1 C01.U1 F

C02.U1 F

2) 5 C03.U1 F

C04.U1 M

3) 3 C05.U1 F

C06.U1 F

4) 6 C07.U1 M

C08.U1 F

5)

6)

Note.

Assessment criteria • Fluency and Coherence • Grammatical range and accuracy • Lexical resource • Pronunciation

Gender M: Male F: Female Comments on the quality of the recordings P: Poor OK: Acceptable G: Good

University A

Gend

er

461

SCORING SHEET FOR ORAL EXAMINER

Rater’s full name: …………………………………………….. Rater code: ………..

Pair

Candidate

Code

GV (2.5)

DM (2.5)

P

(2.5)

IC (2.5)

TOTAL

Commentsonthequality

oftherecording

Notes/Furthercomments

onthecandidate’soralperformance

P

OK

G

1)

C01.U2 F

C02.U2 F

2) C03.U2 F

C04.U2 F

3) C05.U2 F

C06.U2 F

4)

C07.U2 F

C08.U2 M

5)

6)

Note. Criteria for assessment

• GV: Grammar and Vocabulary (2.5 pts) • DM: Discourse management (2.5 pts) • P: Pronunciation (2.5 pts) • IC: Interactive communication (2.5 pts)

Gender M: Male F: Female Comments on the quality of the recordings P: Poor OK: Acceptable G: Good

University #2

Gend

er

University B

462

SCORING SHEET FOR ORAL EXAMINER

Rater’s full name: …………………………………………….. Rater code: ………..

No.

Candidate

Code

LL (30%)

CT

(30%)

PI

(20%)

FL

(20%)

Total

Commentsonthequalityoftherecording

Notes/Furthercommentsonthe

candidate’sspeaking

performance

P

OK

G

1 C01.U3

F

2 C02.U3

F

3 C03.U3

M

4 C04.U3

F

5 C05.U3

F

6 C06.U3

F

7 C07.U3

M

8 C08.U3

M

Notes. Criteria for assessment

• LL: Understanding and use of lecture language: 30% • CT: Content (vocabulary, grammar range): 30% • PI: Pronunciation, Intonation: 20% • FL: Fluency: 20%

Gender M: Male F: Female Comments on the quality of the recordings P: Poor OK: Acceptable G: Good

University C

Gend

er

463

Appendix E: Data management

E.1 Coding scheme of participants and data sources

The information provided in this table helps me to identify the data sources and research participants involved in the study.

CODE EXAMPLE MEANING RANGE

Teachers from interviews U1.T1 Univerity A, Teacher 1 T1 � 3

U2.T1 Univerity B, Teacher 1 T1 � 3

U3.T1 Univerity C, Teacher 1 T1 � 3

Teacher raters from test scoring samples U1.R1 Univerity A, Rater 1 R1 � 4

U2.R1 Univerity B, Rater 1 R1 � 4

U3.R1 Univerity C, Rater 1 R1 � 4

Responses to open-ended questions from the teacher’s survey TQ1 Response from Teacher 1’s questionnaire TQ1 � 35

SQ1 Response from Student 1’s questionnaire SQ1 � 352

Students from group interviews U1G1.S1 Univerity A, Group 1, Student 1 G1 � 3, S1� 4

U2G1.S1 Univerity B, Group 1, Student 1 G1 � 2, S1� 4

U3G1.S1 Univerity C, Group 1, Student 1 G1 � 2, S1� 3

Student candidates from test samples U1.C1 Univerity A, Candidate 1 C1 � 34

U2.C1 Univerity B, Candidate 1 C1 � 25

U3.C1 Univerity C, Candidate 1 C1 � 27

Question items for oral tests U1.Q1 Univerity A, Test question item 1 Q1 � 40

U2.Q1 Univerity B, Test question item 1 Q1 � 70

U3.Q1 Univerity C, Test question item 1 Q1 � 45

EFL content experts E1 Expert 1 E1 � 6

464

E.2 Template for managing recorded speech samples

This template assisted me with managing almost 50 speech samples of oral test sessions collected from the three institutions involved in the study. Each session was recorded in one MP3 file. Mapping the recorded data this way enabled me to make a reference across three sources of data: test materials, audio-recordings, and the transcripts. The table facilitated checking the accuracy of the transcripts, recognising the outstanding features or themes, and making a comparison across speech samples of candidates’ performances of the same kind of tasks and across tasks, in the same test room and across test rooms.

Testroom

Filecode

Candidates

Topic

Timing

Contentsoftesttasksinsequence

Commentsandnotes

UNIVERSITYA

U1.

Rm1

U1.C3-4

C3’sname(f)……………C4’sname(m)……………

TOPIC5:CHANGE

0:00-1:52 Task1:C3.Picture5.1(Q17)TimeforplanB

•Taskdelivery:only1minuteforresponsetimeDescriptive:aguystanding,highmountain,thinkingabouthavinganotherplanInterpretative(message):cannotfollowthecurrentpath,toseeanotherresult-->makechangesforbetterpathstosolveproblems

1:53-3:39 Task1:C4.Picture5.2(Q18)Transformationfromacacoontoabutterfly

Descriptive:developmentfromacocoontoabutterfly,anevolutionInterpretative(message):somepeoplechoosetodevelopintosomethingnewandbetter,othersprefersafelifewithoutchanges•Widerangeofvocabulary

3:40-7:00 Task2:C3&4.Discussiontopic5.3(Q19)Shouldpeoplemakechangesevenwhentheyhavenoproblemswiththeircurrentachievements?

C3:notsupportchangesbecauseofrisksC4:naturalneedsdependingonindividuals’perspectives•Goodbalanceofturn-taking•C3:useoftoocomplexsentences

U1.

Rm2

U1.C11-12

C11’sname(f)……………C12’sname(m)……………

1:00-2:16 Task1:C11.Picture4.1(Q13)BillGates(presidentofMicrosoftCorporation)

•Inconsistenttastdelivery•Mostofthemonologueareknowledgeaboutthecharacter(company,schooling,achievement,family,personality)•Somehesitations

465

2:17-4:10 Task1:C12.Picture4.2(Q14)SteveJobs(CEOofAppleInc.)

•Verylittledescriptioncomesfromthephoto.Mostareknowledgeaboutthecharacter(interest,famoussaying,life)andpersonalreflection•Alotofhesitations

4:11-7:57 Task2:C11&12.Discussiontopic4.4(Q16)Newinventionsbringnewthingsandnewproblems.Whatcanbegoodaboutsuchinventionsandwhatcanbedonetominimizethedownsides?

Co-constructed:inventionsofthelightbulb,smartphonePositive:enjoymoretimeatnight,stayintouchwithfamily,entertainmentNegative:wasteoftime,overuseofsmartphoneloosenfamilyrelationship,lackofcareforchildren•Goodbalanceofturn-taking•Rater’sinterventiontoendthediscussion

U1.

Rm2

U1.C27-28

C27’sname(f)……………C28’sname(m)……………

TOTIC1:RISK

0:00-1:30 Task1:C27.Picture1.1(Q1)Awavesurfer

Descriptive:dangeroussportInterpretative(message):socialknowledgeandpersonalreflection•Mostoftheperformanceisbeyondthephoto

1:31-2:50 Task1:C28.Picture1.2(Q2)Arockclimber

Descriptive:mountainclimbing,dangeroussportInterpretative(message):socialknowledgeandpersonalintepretation•PerformanceismostlybeyondthephotoPrompt

2:51-6:20 Task2:C27&28.Discussiontopic1.4(Q4)“Onlyapersonwhorisksisfree.”Doyouagreewiththestatement?

•Goodco-constructedperformance•Widerangeofspeechfunctions:agreeing/disagreeing,givingexamples,askingopinion,elaboration,meaningcheck,persuading,changingtopic,conversationalrepair,etc.)

UNIVERSITYB

U2.

Rm1

U2.C9-10

C9’sname(f)……………C10’sname(f)…………… T

ESTTOPIC7 0:00-2:50 Task1:Pairedinterview

C9:answersQ42(Human&nature)C10:answersQ42againC10:answersQ37(Gapyear)C9:answersQ38(Gapyear)

•Eachcandidateanswers2questions•Interlocutor(IN)playsadominantroleintheinterviewandcontrolsthetopics•Nointeractionbetweenpairedcandidates•C10speakslessfluentlybutproducesalongerturnthanC9.

TOPIC4:LEGACY

466

2:51-7:58 Task2:Discussiontopic7(Q67)Theme:advantagesanddisadvantagesofcannedfoodandfastfoodPrompts:Portable(+),lessnutritious(+/-),time-saving(+),exposedtoharmfulsubstances(-),long-timepreserved(+/-),highinsugarandfat(-)Question:Whichdisadvantagesisthemostserious?

•Candidatesarefirstrequiredtotalkaboutpros(+)andcons(-)ofcannedfoodandfastfood•Havetodistinguishbetweenprosandcons.Somepromptsmightbeinterpretedasboth,dependingonwhomtheyareadvantageous/disadvantageousto•Noraters’interventionduringthediscussion.

U2.

Rm2

U2.C18-19

C18’sname(f)……………C19’sname(f)……………

TESTTOPIC2

0:00-2:55 Task1:PairedinterviewC18:answersQ8(Human&nature)C19:answersQ10(Discovery)

•Eachcandidateanswers1question•Interlocutorplaysadominantroleintheinterviewandcontrolsthetopics•Nointeractionbetweenpairedcandidates•C19speaksmorefluentlyandusesmorecomplexstructuresthanC18.

2:56-6:20 Task2:Discussiontopic2(Q62)Theme:waysforparentstohelptheirchildrendevelopsportmanshipPrompts:sendingchildrentoaspecialschool,discouragingkidsfromplayingjustonesport,modellinghowkidsshouldreacttotheirteammates,etc.Question:Whichwayisthemostimportant?

•Candidatestaketurntotalkaboutthepromptsprovided–100%agreed•Noraters’interventionduringthediscussion.•Candidateshavenotmadeadecisiononthefinalquestionwhentheinterlocutorstopsthem.

U2.

Rm2

U2.C26-27

C26’sname(f)……………C27’sname(f)……………

TESTTOPIC2

0:00-2:40 Task1:PairedinterviewC26:answersQ7(Humanandnature)C27:answersQ8(Humanandnature)C27:answersQ10(Discovery)C26:answersQ11(Discovery)

•Eachcandidateanswers2questionsofdifferenttopics•Interlocutorplaysadominantroleintheinterviewandcontrolsthetopics•Nointeractionbetweenpairedcandidates•C26speaksmorefluently.•C27askstheinterlocutortorepeatQ8

467

2:41-7:30 Task2:Discussiontopic2(Q62)Theme:waysforparentstohelptheirchildrendevelopsportmanshipPrompts:sendingchildrentoaspecialschool,discouragingkidsfromplayingjustonesport,modellinghowkidsshouldreacttotheirteammates,etc.Question:Whichwayisthemostimportant?

•Candidatesincorporatesomeprompts(3)intheirdiscussion–100%agreed•C26asksC27questionstoinitiatethediscussionandelaborateananswer.•C27’sperformanceshowslessfluency,butshespeaksbetterthaninTask1.•Noraters’interventionduringthediscussion.•Candidateshavenotdecidedonthefinalquestionwhentheinterlocutorhastostopthem.

UNIVERSITYC

U3.

Rm1

U3.C8 C8’sname(m)…………

VARIEDTOPICS 0:00-5:50 Task:One-on-oneinterview

Questionsinsequence:Q26-Q29-Q30-Q32-Q20-Q21•Besidesthepredeterminedquestions,Examinerasksquestionsforclarificationorcomprehensioncheck.

•Candidateanswers6questionsofdifferenttopics•Examinerplaysadominantroleintheinterviewandcontrolsthetopics•Candidateshowsgoodlisteningcomprehensibility•Candidateisquiteconfidentandfluent.•Goodsocialknowledge

U3.

Rm2

U3.C13

C13’sname(f)…………

VARIEDTOPICS 0:00-3:37 Task:One-on-oneinterview

Questionsinsequence:Q3-Q28-Q4-Q9•Examinerincludessomeexplanationoradditionalwordsintothepredeterminedquestions

•Candidateanswers4questionsofdifferenttopics•Examinerplaysadominantroleintheinterviewandcontrolsthetopics•Confidentandfluent,verylittlehesitation•Evidenceofusingcohesivedevices

U3.

Rm2

U3.C15

C15’sname(f)…………

VARIEDTOPICS 0:00-4:25 Task:One-on-oneinterview

Questionsinsequence:Q44-Q1-Q17-Q4-Q28-Q29-Q30•Examinerincludessomeexplanationoradditionalwordsintothepredeterminedquestions

•Candidateanswers7questionsofdifferenttopics•Examinerplaysadominantroleintheinterviewandcontrolsthetopics•Candidateshowssomehesitation.•CandidateasksExaminertorepeatquestions

468

E.3 Overview of multiple data sources used to examine key aspects in speaking test validation

This table helped me to locate appropriate data sources and integrate relevant data types to answer questions of particular topics of interest.

Data sources

Key Data topics type of interest

EFL student questionnaire

EFL teacher questionnaire

Test room observation

Speech samples

Teachers interviews

Student interviews

Docu-ments

EFL content experts

QUAN QUAL QUAN QUAL QUAN QUAL QUAL QUAL QUAL QUAL QUAN QUAL

Test administration ü ü ü ü ü ü ü ü

Test contents ü ü ü ü ü ü ü ü

Test tasks ü ü ü ü ü ü ü ü

Raters and rating ü ü ü ü ü ü

Test impact ü ü ü ü ü ü ü

Note. QUAN=quantitative data QUAL=qualitative data

469

E.4 Tabulated qualitative data

(a) EFL students’ responses to an interview question

Interview question no.12: Doyoupreferyourspeakingperformancetoberatedbyoneortworaters?Why?

Preference for One rater Two raters Further comments

University A - Multi-dimensional evaluation [G1.S1]

[G1.S3]

- Multi-dimensional feedback [G1.S2]

- If one (examiner) looks too ‘cold’, then

looking at the other makes me calmer. [G1.S4]

University B - More raters, more stressful

[G1.S1]

- Hard for one candidate to

deal with two raters

simultaneously [G2.S3]

- More raters, more objective. Lessen

prejudice of a rater, if any [G1.S2] [G1.S3]

- More accurate in rating [G1.S3]

- Fairer assessment, indulgence for rater-

candidate familiarity was reduced [G2.S1]

- Raters’ different preferences helped to

balance the strengths and weaknesses in a

candidate’s performance [G2.S4]

University C One rater sitting there as a

listener rather than a judge can

make objective assessment

[G2.S2]

- Fairer [G1.S2]

- Co-rating makes a sense of justice, but more

stressful if there’s one rater observing [G1.S3]

- One or two does not matter.

Self-evaluation is more

important [G1.S1] [G2.S1]

- Raters’ respect to candidates

makes them want to talk more

[G2.S3]

Total 3 turns of response 11 turns of response 3 turns of response

470

(b) EFL experts’ comments and suggestions on test items

Expert Test item

#01 #02 #03 #04 #05 #06 Notes

U2.Q18 (3) (2) What if ss do not watch cooking shows

(3) (4) (4) course book p.124 (3)

U2.Q19 (3) (3) (3) (3) (3) pre-listening task p.202 (3) U2.Q21 (3) (2) Sometimes ss

cannot answer b/c they don’t have children

(3) (4) (4) course book p.216

(4) interesting for learners

U2.Q23 (3) (3) (2) see notes: should be ‘take a year off/ take a gap year’

(4) (3) (3)

U2.Q25 (3) (3) (4) (3) (4) course book p.123

(3) U2.Q26 (3) (4) (2) should add more

questions such as ‘why?’ (3) (3)

Further comments & sugges-tions on Uni. B’s test

x Most of the questions are suitable. However, there should be more questions relating to real life activities which give students more interest to talk about.

Yes/No questions should be followed by Wh-questions, e.g. what, why… to encourage students’ to elaborate on their answers.

x The content of the test items is highly relevant to the topics covered in the course book. Most of the questions are taken from the speaking section while some others are from the listening section (pre- and post-listening). In regards to the course syllabus, there is no clear identification of the language outcome. … it is suggested that language outcome should be developed and development of detailed assessment criteria should be taken into account to facilitate the assessment as well as to assure the reliability and validity of the speaking exam.

Some questions are more difficult than the others because the required answers are not the same levels of difficulty. Some are too general, and some too specific

Note. The numbers in brackets indicate Experts’ judgement on the test item’s content relevance:

(1) Highly irrelevant, (2) Not relevant, (3) Relevant, and (4) Highly relevant

471

Appendix F: Examples of preliminary data processing

F.1 Coding a rater interview transcript in two cycles

Original text First coding cycle Second coding cycle

Inmyopinion,assessingstudents’speakingskillsisvery

importantbecausestudentsmightexpresstheir

languageabilitynaturallyintheirlearningprocess

withoutanypressure,butitisvirtuallytruethat

psychologicalandotherfactorsdohaveinfluencealot

onthem.Infact,withoutexaminations,wecannot

evaluatestudents’languagecompetency.Such

assessmentandratingenabledusteacherstowithdraw

someexperienceinwhatneedsimproving,howwehave

toteachersothatstudentscanobtainbetterresults.

Whathasbeenachievedinthetestmakethestudents

themselvesmoreconfidenttocontinuetostudyinthe

nextstage.Learningwithoutassessmentislikehaving

noideasabouthowoneis.SoIthinkit(assessing

speaking)isveryimportant[U2.T2].

‘veryimportant’naturally speaking in learning process withoutpressurepsychologicalfactorsaffectingoralproductionintestingexamstoevaluatestudents’languageabilitypracticalexperienceforteachersfromtestswhattoimprovehowtoteachforbetterresultsachievementmakesstudentsconfidentconfidenceinmulti-stagestudylearningwithoutassessment

‘veryimportant’

importancedifferencebetweenspeakinginclassandspeakinginatest

psychologicalimpactlearners’evaluationexperienceforteachersadjustmentmethodologyachievementasmotivationcontinuallearningprocess

importance

472

F.2 Rough transcript of a speech sample

473

F.3 Additional excerpts of task-based speaking performances

EXAMPLES OF PICTURE-BASED LONG TURN TASK

Excerpt F.3a File U1.C11-12 2:16 - 4:10

1 2 ---> 3 4 ---> 5 ---> 6 7 8 ---> 9 10 11 12 13 14 ---> 15 16 17 18 19 20 21 ---> 22

C12: um, this picture (.) this picture depicted, um, a one (.) one of the most fa fa uh, famous person (.) uh, famous person in the world. he -- yeah, his name is steve job. and um, um, when his youth true story in stanford university, um, um, then he shot (.) shot off in the middle and he started to learn calligraphy and then he, uh, he -- and then his patient to -- to cal (.) calligraphy um, begin, and then he, um, he -- he -- and then he start way to, um, to -- to invent the technology. Um, and as you can see in the picture, that is he-- uh, choco -- ice, um, apple. iphone um, uh, doesn't make, um, that is uh, the most of, uh, the most of breathtaking, uh, technology in the world as in that's many millions of people in the world that uses. and, I-I also impressed by his sentence about the apple is, um, not only the (.) the outer appearance of iphone, but also the every detail inside. uh, its the device need to be, uh, beautiful and that his care -- he take care of every detail, minor detail. um, he also -- and now he, uh, he die (.) he just (.) he have die in ah, last year maybe. I'm not sure about it, but he has die but uh, his, um, uh, his rep (.) reputations also are still -- um, still -- is still % (exit) % until now.

Excerpt F.3a illustrates a speaking sample of a picture-based long turn in which the candidate

was required to talk about a picture showing the portrait of a male character that she could

recognise to be named Steve Jobs. Despite the lack of fluency as there were continuous pauses,

hesitations, lexical and structural repair, most of the candidate’ talking about the man revealed

that she had quite good knowledge about Steve Jobs. The candidate mentioned his world-

renown status (lines 2-3), education (line 4), hobby (lines 5-6), work (line 8), famous saying

(lines 14-16), and long-lasting reputation (lines 21-22). The candidate could express her

reflections on the man’s achievement (lines 14, 16). The only detail from the picture that the

candidate could include in her long turn was the Apple iPhone the man is holding. The picture

prompt did not help much in providing the candidate with useful hints so that she could

incorporate in her talk. What the picture could do was to remind the candidate what she knew

about the character and used that knowledge to generate ideas to complete the speaking task.

474

Excerpt F.3b – File U1.C5-6 3:10 – 5:23

1 2 3 4 5 ---> 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ---> 20 21 22

C6: (3:10) about this picture, >at the first time I see it< I think it’s just about life. this is a (.) there is a modern city >over the picture over the way< and there’s a (.) like a village with (the) poor people the poor and (.) this is this is this is sentence >we’re waiting for the city to come to us< some people they want to have a a better life so they go and they find, and they go to (the) modern city, but some some of them >they don’t they don’t< go anywhere they just stand still and (.) wait (.) something a happen to them, like they don’t want to, >they want< something better but they don’t (.) they are so lazy to find %yeah% and to find and you see this is the (.) the like >how far is it< from the poor city to the modern city. first of all this is fifty kilometre, and then they lower thirty kilometre, but they still don’t go. they just (.) stand still er their area, and even though (.) the er er %the log the% length of ( ) er like lower about ten kilometre, and they still don’t go. it’s mean they so lazy, they want something better but they don’t go anywhere. they just want sit and (.) >something better come with them< as it’s very bad. and (.) this picture also teach me that >if you want something< you need to do (.) need to do ( ) that’s don’t just sit and wait. Nothing happen to you. (5.23)

Excerpt F.3c – File U1.C33-34 2:52 – 4:00

1 ---> 2 3 4 5 6 7 8 9 10

C34: so in this picture you can see that in one side there’s a construction of a new built country >er city< and >on the other hand< is a small (.) house which wrote on that is home home sweet home. all the citizen from the country move to (.) the city to live, and ( ) closer and closer. it’s represents for the industrials er area of the world. peoples leaving (.) the country their homeland to live in the er city to find opportunities for their life as well as their future career er er in the future there might not have anybody living in the rural area any more.

475

EXAMPLE OF MINDMAP-BASED DISCUSSION TASK

Excerpt F.3d File U2.C26-27 3:45 – 7:08

1 2 3 4 ---> 5 6 7 8 9 10 11 12 ---> 14 15 16 17 18 19 20 21 22 23 ---> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

C26: Can you tell me the ways parents can help %their children develop good sports?% C27: yeah I think the most important thing er when parents help their children develop good sportsmanship is sending children to special school C26: why do you think so? C27: er in this school uhm children have (.2) the (.3) can show (.) their talent, can ( ) tch C26: I think uh spe- special school have a good condition for students uh, that they can have uh the best uh the best chance to-to-to study. C27: I think so. moreover helping children decide when they want to play sport uhm is a the another way to help children develop skill. um, when um parents uh can have children decide the sport they like and uh all little children to pursue the best and I think children can be success. C26: yeah, I think so, because uh if the children like=like this uh, like this uh sport and they have passion they can like they can play it very well. C27: mm-hmm, um (.) C26: I think it’s in about (.2) this one. C27: um, between it’s: also significant WAY when parent learning about the injuries and safety in sport. um, they can um support their children uh when- when children playing and-and have accident- yes, accident, injury. uh, for example, they-they have uh broken leg. C26: mm-hmm. C27: um, so uhm C26: um-um I think because the play sport maybe uh, dangerous or safety also they (.) because they um (.) um uh, the parent uh should let their- uh support their children learn about the injury. also, it is all that they can protect themselves. (.6) %I think is% this is uh he (.) helping to decide whether they want to play sport either best fac factor C27: I think so. so that uh (.) that-the-the (large) part for children to uh develop their skill. IN: thank you, can I have the handout please? that’s the end of the test. goodbye.

476

EXAMPLE OF INTERVIEW TASK

Excerpt F.3e – File U3.C15 0:00 - 4:25

1 2 3 4 5 ---> 6 7 8 9 10 11 12 13 14 15 16 ---> 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ---> 31 32 33 34 35 36 37 38

IN: g’morning. C15: good morning. IN: so, are you ready? C15: yeah. IN: so, can you tell me three types of lecture language you have learned? and give me one example for me. C15: uh, translation-translation language? uh, translation language, little-little language. uh, and uh, close [unintelligible 00:24] language into topic language. IN: mm-hmm. C15: uh, the (.) uh, 'cause in fact language is, um, uh, the reason why I are (.) the reason I talk today and, uh, that's (.2) that reason (.) little-little problem is, um, something. I just know that. IN: so, what do you know about target market? C15: the target markets is, uh, end of the market th-that the brochure, the little want to (.) need to-need to their product to the customer. IN: mm-hmm. so how about in store service? C15: in store service, um, I can (.) uh, black (.2) I can, uh, repeat after that? Can you answer, uh (.) IN: yeah, in store some place means the stuff that they make in the (.) C15: in the store? IN: yeah, yeah. okay. they implement in the store and check, um (.) so what do you want to test or make sure or want to know? C15: uh, want to test some their (.) want to know their customer? IN: customer. okay. um, so what factors do you consider when choosing a product? C15: uh, I will, uh, consi (.) when I wen=when I buy a product, I, uh, I think about their what h-how is look. ho-how does work? Uh, the smell, the taste, and how soft it is and, uh, how-how it make feel (.) how it make=I feel. IN: mm-hmm. Yeah. So can you tell me, uh, nowaday, we have a copyright laws in many fields like, um, music- C15: yes.

477

39 40 41 ---> 42 43 44 45 46 47 48 49 ---> 50 51 52 53 54 55 56 57 58 59 60 61 62 ---> 63 64 65 66 67 68 69 70 71

IN: -book, film, um, paintings, for example. but why are copyright laws unclear until now? C15: because some people are don't-- they think that copyright law is not important so much. because they can earn a lot of money from the other things such as advertising for something, so they-they want to share their product to the customer to make their-make their name more popular. IN: and what's your idea about music free downloading and file sharing? C15: what's I (.) you can repeat your ans-- your question? IN: what's your idea about music free downloading? C15: uh, as I-as I said earlier, they want their know-- they want their customer, they want their file load down. load their music so they share (.) they have a file sharing and, uh, their customer when they listen their-- uh, they buy their, uh-- they listen their product-- IN: hmm. C15: they feel good and they will share to the others and their name will popular (.) more popular. IN: so how can the singers or the artist can get money back? because they have to invest time and money for their product, for their work. C15: uh, they can make a-advertise (.) uh, advertising. the-the customer or the fans will like to see their face on the TV, also the concert. they can create the concert to earn their money back. and, uh, uh, join in the advertising to get more money. IN: mm-hmm. C15: they have a lots of way to do (.) to get the money so they don't worry about their product to share with the audience. IN: mm-hmm. okay, thank you so much.

As can be seen from Excerpt F.3e, there was a continual change of topics during the

interview that the candidate had to catch up with. The entire interview comprised five topics

that varied from types of lecture language (lines 5-9 ), knowledge about target market (line 16),

factors to consider in choosing a product (lines 30-35), reasons of unclear copyright laws (lines

41-46), and ways for artists to get income from their online products (lines 62-63).

478

Appendix F.4 Occurence of language functions in speech samples Languagefunctions Uni.A Uni.B Uni.C Example (University, candidate) Task1 Task2 Task1 Task2 Task Informational functions Providing personal information

Past x x x x When I was a child, er I used to be a very quiet person. (U3.C8) Present x x x x x When I buy a product, I care about price. (U3.C12) Future x x They’ll find the land of freedom. (U1.C28)

Expressing opinions x x x x x In my opinion, when we are satisfied with our achievement… (U1.C3) Elaborating x x x x x It also convenient for people who don’t have time to go to the market or supermarket to buy their fresh food. (U2.C10) Justifying opinions x x x x x I think taking a gap (year) in Vietnam is not popular because they have er an entrance examination to university (U2.C9) Comparing x x x x x The whole team become a great team, become more fit to work. (U1.C14) Speculating x x x x the copyright laws unclear maybe most in Vietnam and some Asian some… in Asian uh country. (U3.C12) Describing x x x He was also standing with, uh, two hands spreading out. (U1.C24)

Sequence of events x First of all this is fifty kilometre, and then they lower thirty kilometre, but they still don’t go. (U1.C6) Scene x It’s a very high mountains and clouds. (U1.C23)

Summarising x x In conclusion, in order to be a successful person we have to have er a lot of characters. (U2.C17) Suggesting x x We should just use it uhm uh like a method… (U1.C11) Expressing preferences x x x x To me I want to make change every day to better myself even if I am in the current situation. (U1.C4) Interactional functions Agreeing x x Yes, I agree with you that smart phone, uh, beside its advantages, it also has some disadvantages… (U1.C11) Disagreeing x x x Uhm no, I don’t think so. (U3.C4) Modifying x x x x x I just now to play to reduce stress or play with my friends to have fun, not to do any tournament any more. (U3.C24) Asking for opinions x x How about you? Are there just soft skills? (U2.C17) Persuading x x x It give us not only experience… from our real life, but… during the gap year, we also have a chance to know that what our

passion is (U2.C10) Asking for information x x Do you understand what does it mean ‘free’? (U1.C27) Conversational repair x x x x I want to tell you a story…when I was young, uh when I was a child, I used to be a very quiet person (U3.C8) Negotiating meaning x x Making change doesn’t mean that you are careless, (or) you are reckless to get into the change. (U1.C4)

Check meaning x x x What do you mean by ‘warranty’? (U3.R2) Understanding x x I understand your point. (U1.C4) Common ground x x x I also think that when you take risk you have to run into a lot of problems or obstacles. (U1.C27) Asking clarification x x x Control us? (U1.C12) Respond to required clarification

x x x Don’t, do not let them control us. (U1.C11)

Correcting utterance x x x x x They don’t think about their themselves in the future. (U3.C12) Managing interaction Initiating x x First I will show my opinion. (U1.C28) Changing x x Thank you. now I’d like you to talk together about . . . (U2.R2) Reciprocating x x x x Ahh, myself I pick up trash in the street and ahh put put them into the basket. (U2.C4) Deciding x x For you which? (U2.C16) – Softskill. (U2.C15-17)

Note: (x) indicates that this function in the observation checklist (O’Sullivan, Weir, & Saville, 2002) was identified as occuring in the transcripts of oral performances

479

Appendix G: Sample rating scales for assessing speaking skills

G.1a Example of a holistic (global) rating scale Interview assessment scale (Carroll, 1980)

Band

9 Expert speaker. Speaks with authority on a variety of topics. Can initiate, expand and develop a theme.

8 Very good non-native speaker. Maintains effectively his own part of a discussion. Initiates, maintains and elaborates as necessary. Reveals humour where needed and responds to attitudinal tones.

7 Good speaker. Presents case clearly and logically and can develop the dialogue coherently and constructively. Rather less flexible and fluent than Band 8 performer but can respond to main changes of tone or topic. Some hesitation and repetition due to a measure of language restriction but interacts effectively.

6 Competent speaker. Is able to maintain theme of dialogue, to follow topic switches and to use and appreciate main attitude markers. Stumbles and hesitates at times but is reasonably fluent otherwise. Some errors and inappropriate language but these will not impede exchange of views. Shows some independence in discussion with ability to initiate.

5 Modest speaker. Although gist of dialogue is relevant and can be basically understood, there are noticeable deficiencies in mastery of language patterns and style. Needs to ask for repetition or clarification and similarly to be asked for them. Lacks flexibility and initiative. The interviewer often has to speak rather deliberately. Copes but not with great style or interest.

4 Marginal speaker. Can maintain dialogue but in a rather passive manner, rarely taking initiative or guiding the discussion. Has difficulty in following English at normal speed; lacks fluency and probably accuracy in speaking. The dialogue is therefore neither easy nor flowing. Nevertheless, gives the impression that he is in touch with the gist of the dialogue even if not wholly master of it. Marked L1 accent.

3 Extremely limited speaker. Dialogue is a drawn-out affair punctuated with hesitations and misunderstandings. Only catches part of normal speech and unable to produce continuous and accurate discourse. Basic merit is just hanging on to discussion gist, without making major contribution to it.

2 Intermittent speaker. No working facility; occasional, sporadic communication.

1 Non-speaker. Not able to understand and/or speak.

480

G.1b Example of an analytic rating scale The Foreign Services Institute (FSI) analytic rating scale for language proficiency interview testing (Keitges, 1982).

Accent 1. Pronunciation frequently unintelligible. 2. Frequent gross errors and a very heavy accent make understanding difficult, require frequent repetition. 3. “Foreign accent” requires concentrated listening and mispronunciations lead to occasional misunderstanding and apparent errors in grammar or vocabulary. 4. Marked “foreign accent” and occasional mispronunciations that do not interfere with understanding. 5. No conspicuous mispronunciations but would not be taken for a native speaker. 6. Native pronunciation, with no trace of “foreign accent.”

Grammar 1. 1. Grammar almost entirely inaccurate except in stock phases. 2. 2. Constant errors showing control of very few major patterns and frequently preventing

communication. 3. 3. Frequent errors showing some major patterns uncontrolled and causing occasional irritation and

misunderstanding. 4. 4. Occasional errors showing imperfect control of some patterns but no weakness that causes

misunderstanding. 5. Few errors, with no patterns of failure. 6. No more than two errors during the interview.

Vocabulary 1. Vocabulary inadequate for even the simplest conversation.

2. 2. Vocabulary limited to basic personal and survival areas (time, food, transportation, family, etc.). 3. 3. Choice of words sometimes inaccurate, limitations of vocabulary

prevent discussion of some common professional and social topics. 4. 4. Professional vocabulary adequate to discuss special interests; general vocabulary permits discussion

of any nontechnical subject with some circumlocutions 5. Professional vocabulary broad and precise; general vocabulary adequate to cope with complex practical problems and varied social situations. 6. Vocabulary apparently as accurate and extensive as that of an educated native speaker.

Fluency 1. 1. Speech is so halting and fragmentary that conversation is virtually impossible. 2. 2. Speech is very slow and uneven except for short or routine sentences. 3. 3. Speech is frequently hesitant and jerky; sentences may be left uncompleted. 4. 4. Speech is occasionally hesitant, with some unevenness caused by rephrasing and groping for words.

5. Speech is effortless and smooth, but perceptibly non-native in speed and evenness. 6. Speech on all profeSSional and general topics as effortless and smooth as a native speaker’s.

Comprehension 1. 1. Understands too little for the simplest type of conversation. 2. 2. Understands only slow, very simple speech on common social and touristic topics; requires constant

repetition and rephrasing. 3. 3. Understands careful, somewhat simplified speech directed to him/her, with considerable repetition

and rephrasing. 4. 4. Understands quite well normal educated speech directed to him/her, but requires occasional

repetition or rephrasing. 5. Understands everything in normal educated conversation except for very colloquial or low-frequency items or exceptionally rapid or slurred speech. 6. Understands everything in both formal and colloquial speech to be expected of an educated native speaker.

481

G.2 Common Reference Levels proposed by the CEFR G.2a Common Reference Levels: qualitative aspects of spoken language use (Council of Europe, 2001, pp. 28–29)

RANGE ACCURACY FLUENCY INTERACTION COHERENCEC2 Showsgreat

flexibilityreformulatingideasindifferinglinguisticformstoconveyfinershadesofmeaningprecisely,togiveemphasis,todifferentiateandtoeliminateambiguity.Alsohasagoodcommandofidiomaticexpressionsandcolloquialisms.

Maintainsconsistentgrammaticalcontrolofcomplexlanguage,evenwhileattentionisotherwiseengaged(e.g.,inforwardplanning,inmonitoringothers’reactions).

Canexpresshim/herselfspontaneouslyatlengthwithanaturalcolloquialflow,avoidingorbacktrackingaroundanydifficultysosmoothlythattheinterlocutorishardlyawareofit.

Caninteractwitheaseandskill,pickingupandusingnon-verbalandintonationalcuesapparentlyeffortlessly.Caninterweavehis/hercontributionintothejointdiscoursewithfullynaturalturn-taking,referencing,allusionmaking,etc.

Cancreatecoherentandcohesivediscoursemakingfullandappropriateuseofavarietyoforganisationalpatternsandawiderangeofconnectorsandothercohesivedevices.

C1 Hasagoodcommandofabroadrangeoflanguage,allowinghim/hertoselectaformulationtoexpresshim/herselfclearlyinanappropriatestyleonawiderangeofgeneral,academic,professionalorleisuretopicswithouthavingtorestrictwhathe/shewantstosay.

Consistentlymaintainsahighdegreeofgrammaticalaccuracy;errorsarerare,difficulttospotandgenerallycorrectedwhentheydooccur.

Canexpresshim/herselffluentlyandspontaneously,almosteffortlessly.Onlyaconceptuallydifficultsubjectcanhinderanatural,smoothflowoflanguage.

Canselectasuitablephrasefromareadilyavailablerangeofdiscoursefunctionstoprefacehisremarksinordertogetortokeepthefloorandtorelatehis/herowncontributionsskilfullytothoseofotherspeakers.

Canproduceclear,smoothlyflowing,well-structuredspeech,showingcontrolleduseoforganisationalpatterns,connectorsandcohesivedevices.

B2 Hasasufficientrangeoflanguagetobeabletogivecleardescriptions,expressviewpointsonmostgeneraltopics,withoutmuchconspicuoussearchingforwords,usingsomecomplexsentenceformstodoso.

Showsarelativelyhighdegreeofgrammaticalcontrol.Doesnotmakeerrorswhichcausemisunderstandingandcancorrectmostofhis/hermistakes.

Canproducestretchesoflanguagewithafairlyeventempo;althoughhe/shecanbehesitantasheorshesearchesforpatternsandexpressions,therearefewnoticeablylongpauses.

Caninitiatediscourse,takehis/herturnwhenappropriateandendconversationwhenhe/sheneedsto,thoughhe/shemaynotalwaysdothiselegantly.Canhelpthediscussionalongonfamiliargroundconfirmingcomprehension,invitingothersin,etc.

Canusealimitednumberofcohesivedevicestolinkhis/herutterancesintoclear,coherentdiscourse,thoughtheremaybesome“jumpiness”inalongcontribution.

B1 Hasenoughlanguagetogetby,withsufficientvocabularytoexpresshim/herselfwithsomehesitationandcircum-locutionsontopicssuchasfamily,hobbiesandinterests,work,travel,andcurrentevents.

Usesreasonablyaccuratelyarepertoireoffrequentlyused“routines”andpatternsassociatedwithmorepredictablesituations.

Cankeepgoingcomprehensibly,eventhoughpausingforgrammaticalandlexicalplanningandrepairisveryevident,especiallyinlongerstretchesoffreeproduction.

Caninitiate,maintainandclosesimpleface-to-faceconversationontopicsthatarefamiliarorofpersonalinterest.Canrepeatbackpartofwhatsomeonehassaidtoconfirmmutualunderstanding.

Canlinkaseriesofshorter,discretesimpleelementsintoaconnected,linearsequenceofpoints.

A2 Usesbasicsentencepatternswithmemorisedphrases,groupsofafewwordsandformulaeinordertocommunicatelimitedinformationinsimpleeverydaysituations.

Usessomesimplestructurescorrectly,butstillsystematicallymakesbasicmistakes.

Canmakehim/herselfunderstoodinveryshortutterances,eventhoughpauses,falsestartsandreformulationareveryevident.

Cananswerquestionsandrespondtosimplestatements.Canindicatewhenhe/sheisfollowingbutisrarelyabletounderstandenoughtokeepconversationgoingofhis/herownaccord.

Canlinkgroupsofwordswithsimpleconnectorslike“and,“but”and“because”.

482

A1 Hasaverybasicrepertoireofwordsandsimplephrasesrelatedtopersonaldetailsandparticularconcretesituations.

Showsonlylimitedcontrolofafewsimplegrammaticalstructuresandsentencepatternsinamemorisedrepertoire.

Canmanageveryshort,isolated,mainlypre-packagedutterances,withmuchpausingtosearchforexpressions,toarticulatelessfamiliarwords,andtorepaircommunication.

Canaskandanswerquestionsaboutpersonaldetails.Caninteractinasimplewaybutcommunicationistotallydependentonrepetition,rephrasingandrepair.

Canlinkwordsorgroupsofwordswithverybasiclinearconnectorslike“and”or“then”.

G.2b Common Reference Levels: global scale for speaking skills (Council of Europe, 2001, p. 24)

Category Level Descriptors

ProficientUser C2 Canexpresshim/herselfspontaneously,veryfluentlyandprecisely,differentiatingfinershadesofmeaningeveninmorecomplexsituations.

C1 Canexpresshim/herselffluentlyandspontaneouslywithoutmuchobvioussearchingforexpressions.Canproduceclear,well-structured,detailedtextoncomplexsubjects,showingcontrolleduseoforganisationalpatterns,connectorsandcohesivedevices.Mayrarelymakeerrors.

IndependentUser

B2 Canproduceclear,detailedtextonawiderangeofsubjectsandexplainaviewpointonatopicalissuegivingtheadvantagesanddisadvantagesofvariousoptions.Doesnotmakeerrorswhichcausemisunderstanding.

B1 Canproducesimpleconnectedtextontopicswhicharefamiliar,orofpersonalinterest.Candescribeexperiencesandevents,dreams,hopes&ambitionsandbrieflygivereasonsandexplanationsforopinionsandplans.Pausesforgrammaticalandlexicalplanning.

BasicUser A2 Cancommunicateinsimpleandroutinetasksrequiringasimpleanddirectexchangeofinformationonfamiliarandroutinematters.Candescribeinsimpletermsaspectsofhis/herbackground,immediateenvironmentandmattersinareasofimmediateneed,althoughpausesandfalsestartsareevident.Usesshortutterancesandsimplestructurescorrectly,butmaysystematicallymakeerrorsthatcreatemisunderstanding.

A1 Canusefamiliareverydayexpressionsandverybasicphrasesaimedatthesatisfactionofneedsofaconcretetype.Canintroducehim/herselfandothersandcanaskandanswerquestionsaboutpersonaldetailssuchaswherehe/shelives,peoplehe/sheknowsandthingshe/shehas.Caninteractinasimplewayprovidedtheotherpersontalksslowlyandclearlyandispreparedtohelp.Pausesalotforgrammaticalandlexicalplanning.

483

G.2c Common Reference Levels: self-assessment grid for speaking skills (Council of Europe, 2001, pp. 26-27) Category Level Descriptors

SpokenProduction SpokenInteraction

ProficientUser

C2 I can present a clear, smoothly flowing description or argument in a style appropriate to the context and with an effective logical structure which helps the recipient to notice and remember significant points.

I can take part effortlessly in any conversation or discussion and have a good familiarity with idiomatic expressions and colloquialisms. I can express myself fluently and convey finer shades of meaning precisely. If I do have a problem I can backtrack and restructure around the difficulty so smoothly that other people are hardly aware of it.

C1 I can present clear, detailed descriptions of complex subjects integrating sub- themes, developing particular points and rounding off with an appropriate conclusion.

I can express myself fluently and spontaneously without much obvious searching for expressions. I can use language flexibly and effectively for social and professional purposes. I can formulate ideas and opinions with precision and relate my contribution skilfully to those of other speakers.

IndependentUser

B2 I can present clear, detailed descriptions on a wide range of subjects related to my field of interest. I can explain a viewpoint on a topical issue giving the advantages and disadvantages of various options.

I can interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible. I can take an active part in discussion in familiar contexts, accounting for and sustaining my views.

B1 I can connect phrases in a simple way in order to describe expe-rien-ces and events, my dreams, hopes and ambitions. I can briefly give reasons and explanations for opinions and plans. I can narrate a story or relate the plot of a book or film and describe my reactions.

I can deal with most situations likely to arise whilst travelling in an area where the language is spoken. I can enter unprepared into conversation on topics that are familiar, of personal interest or pertinent to everyday life (e.g. family, hobbies, work, travel and current events).

BasicUser A2 I can use a series of phrases and sentences to describe in simple terms my family and other people, living conditions, my educational background and my present or most recent job.

I can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar topics and activities. I can handle very short social exchanges, even though I can’t usually understand enough to keep the conversation going myself.

A1 I can use simple phrases and sentences to describe where I live and people I know.

I can interact in a simple way provided the other person is prepared to repeat or rephrase things at a slower rate of speech and help me formulate what I’m trying to say. I can ask and answer simple questions in areas of immediate need or on very familiar topics.

484

G.3 The CEFR-based English Competence Framework adopted in Vietnam (Edumax, 2008; IELTS, 1018b) THE FOREIGN LANGUAGE PROFICIENCY FRAMEWORK APPLIED IN VIETNAM BASED ON

THE 2008 – 2020 “TEACHING AND LEARNING FOREIGN LANGUAGES IN THE NATIONAL EDUCATION SYSTEM” SCHEME

Lan

guag

e Co

mpe

tenc

e CEFR Levels (equivalent level names)

Qualification required by the MoET (Vietnam)

Equivalent international assessment programmes

Language competence framework: Global descriptions

English for young learners (YALE)

General English

General English

Business English

Acedemic English

Prof

icie

nt U

ser

C2: Mastery (Proficiency)

Level 6 Proficiency (CPE)

IELTS 9

8.5

8

7.5

7

6.5

6

5.5

5

4.5

4

Can understand with ease virtually everything heard or read. Can summarise information from different spoken and written sources, reconstructing arguments and accounts in a coherent presentation. Can express him/herself spontaneously, very fluently and precisely, differentiating finer shades of meaning even in more complex situations.

C1: Effective Operational Proficiency (Advanced)

Level 5: English major university graduate

Advanced (CAE)

Business Higher (BEC)

Can understand a wide range of demanding, longer texts, and recognise implicit meaning. Can express him/herself fluently and spontaneously without much obvious searching for expressions. Can use language flexibly and effectively for social, academic and professional purposes. Can produce clear, well-structured, detailed text on complex subjects, showing controlled use of organisational patterns, connectors and cohesive devices.

Inde

pend

ent U

ser

B2: Vantage (Upper-intermediate)

Level 4: English major college graduate

First for Schools (FCE for Schools)

First (FCE)

Business Vantage (BEC)

Can understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. Can interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible without strain for either party. Can produce clear, detailed text on a wide range of subjects and explain a viewpoint on a topical issue giving the advantages and disadvantages of various options.

B1: Threshold (Intermediate)

Level 3: Non-English major university graduate, vocational high school and upper secondary school graduate

Preliminary for Schools (PET for schools)

Preliminary (PET)

Business Preliminary (BEC)

Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topics which are familiar or of personal interest. Can describe experiences and events, dreams, hopes and ambitions and briefly give reasons and explanations for opinions and plans.

Basi

c U

ser

A2: Waystage (Elementary)

Level 2: Vocational training graduate and lower secondary school graduate

Young Learners Flyers (YLE Flyers)

Key for Schools (KET for Schools)

Key (KET)

Can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g. very basic personal and family information, shopping, local geography, employment). Can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters. Can describe in simple terms aspects of his/her background, immediate environment and matters in areas of immediate need.

A1: Breakthrough (Beginner)

Level 1: Primary school graduate

Young Learners Movers (YLE Movers)

Can understand and use familiar everyday expressions and very basic phrases aimed at the satisfaction of needs of a concrete type. Can introduce him/herself and others and can ask and answer questions about personal details such as where he/she lives, people he/she knows and things he/she has. Can interact in a simple way provided the other person talks slowly and clearly and is prepared to help.

485

KHUNG NĂNG LỰC NGOẠI NGỮ ÁP DỤNG TẠI VIỆT NAM THEO ĐỀ ÁN “DẠY VÀ HỌC NGOẠI NGỮ TRONG HỆ THỐNG GIÁO DỤC QUỐC DÂN 2008 – 2020”

Năn

g lự

c sử

dụn

g ng

ôn n

g ữ

Khung theo chuẩn Châu Âu (CEFR)

Trình độ theo yêu cầu của Bộ Giáo dục & Đào tạo (Việt Nam)

Chương trình khảo sát quốc tế tương đương

Khung năng lực ngoại ngữ: Mô tả tổng quát

Tiếng Anh dành cho Thiếu nhi (YALE)

Tiếng Anh Tổng quát

Tiếng Anh Tổng quát

Tiếng Anh Thương mại

Tiếng Anh Học thuật

Thàn

h th

ạo

C2: Thành thạo

Bậc 6: Proficiency (CPE)

IELTS 9

8.5

8

7.5

7

6.5

6

5.5

5

4.5

4

Có thể hiểu một cách dễ dàng hầu hết văn nói và viết. Có thể tóm tắt các nguồn thông tin nói hoặc viết, sắp xếp lại thông tin và trình bày lại một cách logic. Có thể diễn đạt tức thì, rất trôi chảy và chính xác, phân biệt được các ý nghĩa tinh tế khác nhau trong các tình huống phức tạp.

C1: Cao cấp

Bậc 5: Tốt nghiệp Đại học chuyên ngữ

Advanced (CAE)

Business Higher (BEC)

Có thể hiểu và nhận biết được hàm ý của các văn bản dài với phạm vi rộng. Có thể diễn đạt trôi chảy, tức thì, không gặp khó khăn trong việc tìm từ ngữ diễn đạt. Có thể sử dụng ngôn ngữ linh hoạt và hiệu quả phục vụ các mục đích xã hội, học thuật và chuyên môn. Có thể viết rõ ràng, chặt chẽ, chi tiết về các chủ đề phức tạp, thể hiện được khả năng tổ chức văn bản, sử dụng tốt từ ngữ nối câu và các công cụ liên kết.

Độc

lập

B2: Trung cao cấp

Bậc 4: Tốt nghiệp Cao đẳng chuyên ngữ

First for Schools (FCE for Schools)

First (FCE)

Business Vantage (BEC)

Có thể hiểu ý chính của một văn bản phức tạp về các chủ đề cụ thể và trừu tượng, kể cả những trao đổi kỹ thuật thuộc lĩnh vực chuyên môn của bản thân. Có thể giao tiếp ở mức độ trôi chảy, tự nhiên với người bản ngữ. Có thể viết được các văn bản rõ ràng, chi tiết với nhiều chủ đề khác nhau và có thể giải thích quan điểm của mình về một vấn đề, nêu ra được những ưu điểm, nhược điểm của các phương án lựa chọn khác nhau.

B1: Trung cấp

Bậc 3: Tốt nghiệp Cao đẳng và Đại học không chuyên ngữ, Trung cấp chuyên nghiệp và Trung học phổ thông

Preliminary for Schools (PET for schools)

Preliminary (PET)

Business Preliminary (BEC)

Có thể hiểu được các ý chính của một đoạn văn hay bài phát biểu chuẩn mực, rõ ràng về các chủ đề quen thuộc trong công việc, trường học, giải trí, v.v. Có thể xử lý hầu hết các tình huống xảy ra khi đến khu vực có sử dụng ngôn ngữ đó. Có thể viết đoạn văn đơn giản liên quan đến các chủ đề quen thuộc hoặc cá nhân quan tâm. Có thể mô tả được những kinh nghiệm, sự kiện, giấc mơ, hy vọng, hoài bão và có thể trình bày ngắn gọn các lý do, giải thích ý kiến và kế hoạch của mình.

Cơ b

ản

A2: Sơ cấp

Bậc 2: Tốt nghiệp trường nghề và Trung học cơ sở

Young Learners Flyers (YLE Flyers)

Key for Schools (KET for Schools)

Key (KET)

Có thể hiểu được các câu và cấu trúc được sử dụng thường xuyên liên quan đến nhu cầu giao tiếp cơ bản (như các thông tin về gia đình, bản thân, đi mua hàng, hỏi đường, việc làm). Có thể trao đổi thông tin về những chủ đề đơn giản, quen thuộc hằng ngày. Có thể mô tả đơn giản về bản thân, môi trường xung quanh và những vấn đề thuộc nhu cầu thiết yếu.

A1: Căn bản

Bậc 1: Tốt nghiệp tiểu học

Young Learners Movers (YLE Movers)

Có thể hiểu, sử dụng các cấu trúc quen thuộc thường nhật; các từ ngữ cơ bản đáp ứng nhu cầu giao tiếp cụ thể. Có thể tự giới thiệu bản thân và người khác; có thể trả lời những thông tin về bản thân như nơi sinh sống, người thân/bạn bè v.v… Có thể giao tiếp đơn giản nếu người đối thoại nói chậm, rõ ràng và sẵn sàng hợp tác giúp đỡ.

486

Appendix H: Flowchart of procedures for data collection and data analysis

Mar 2015 – Nov 2015 Dec 2015 – Jan 2016 Jan 2016 – Mar 2016 Apr 2016 – Aug 2018 Flowchart of procedures for data collection and data analysis using a convergent mixed method design for the study on assessing EFL speaking skills (Adapted from Creswell & Clark, 2011, p. 79; and Tsushima, 2015, p. 112).

D A T A A N A L Y S I S

D A T A C O L L E C T I O N

Before the test

During the test

After the test

QUAN data collection: Test scores

Preparatory steps: - Designing

and validating research

instruments - Initial

contacts with prospective institutions - Recording

devices

Integration of QUAN

and QUAL data

Interpretation of the merged

results

Context validity

Scoring validity RQ2. Rating consistency

Consequential validity RQ3. Washback effects

RQ1a. Test administration

RQ1b. Test contents

RQ1c. Test tasks

QUAN and QUAL data collection: Test room observations

QUAN and QUAL data collection: Questionnaire surveys for raters and test takers

QUAL data collection: Interviews with individual test raters and focus groups of test takers

QUAL data collection: Audio-recordings of speaking performance

Synthesising qualitative data

Protocol for judgements on test contents

QUAL data collection: Documents: course outlines, test tasks and rating scales

QUAL data collection: Documents: test contents

QUAN and QUAL data collection: Judgements on the relevence of test contents

Connecting QUAN and QUAL data

Adjusting interview protocols

487

Appendix I: Transcription notation symbols

(Adapted and modified from Atkinson and Heritage, 1984)

1. unfilled pauses or gaps periods of silence, timed in seconds. Micro-pauses (less than 1

second) are symbolised (.); longer pauses or gaps appear as a

time within parentheses, e.g. (.5) represents a 5-second pause

2. colon (:) a lengthened sound or syllable; more colons prolong

the stretch

3. dash (–) a cut-off, usually a glottal stop

4. equal sign (=) a latched utterance, no interval between utterances

5. percent signs (% %) quiet talk between

6. brackets ([ ]) overlapping talk, where utterances start and/or

end simultaneously

7. parentheses ( ) transcription doubt, uncertainty; words within

parentheses are uncertain

8. double parentheses (( )) words within double parentheses describe non-vocal action,

details of scene, e.g. coughs, telephone rings

9. arrow (--->) a feature of interest to the analyst

10. inward arrows (> <) the talk speeds up;

outward arrows (< >) the talk slows down

11. ellipsis (. . .) turns or part of a turn has been omitted

12. underlining or CAPS a word or SOund is emphasised

13. italics Vietnamese words

14. hah, huh, heh laughter, depending on the sounds produced

15. uhm, er, mm-hmm hesitation or filler words, depending on the sounds produced

16. tch a tongue click

17. punctuations markers of intonation rather than clausal structure;

a period (.) is falling intonation, a question mark (?) is rising

intonation, a comma (,) is continuing intonation, an

exclamation mark (!) is animated intonation.

488

Appendix J: List of tertiary institutions This list includes universities in HCMC (Vietnam) offereing training programmes for EFL majors (Updated 01/2017).

No.

Name of the Universities

Type of organisation

in English in Vietnamese Public Private

1 Banking University Đại học (ĐH) Ngân Hàng ü

2 HCMC University of Social Sciences and Humanities

ĐH Khoa Học Xã hội và Nhân Văn TPHCM

ü

3 Van Lang University ĐH Văn Lang ü

4 Hoa Sen University ĐH Hoa Sen ü

5 The Saigon International University ĐH Quốc Tế Sài Gòn ü

6 HCMC Open University ĐH Mở TPHCM ü

7 Van Hien University ĐH Văn Hiến ü

8 Nguyen Tat Thanh University ĐH Nguyễn Tất Thành ü

9 Sai Gon University ĐH Sài Gòn ü

10 University of Pedagogy ĐH Sư Phạm ü

11 Gia Dinh Information Technology University

ĐH Công Nghệ Thông Tin Gia Định

ü

12 Ton Duc Thang University ĐH Tôn Đức Thắng ü

13 Foreign Languages - Information Technology

ĐH Ngoại Ngữ - Tin Học ü

14 Hong Bang University International ĐH Quốc tế Hồng Bàng ü

15 University of Technology ĐH Công Nghệ ü

16 Industry University ĐH Công Nghiệp ü

17 Hung Vuong University ĐH Hùng Vương ü

18 University of Agriculture and Forestry ĐH Nông Lâm ü

19 University of Technology and Education ĐH Sư Phạm Kỹ thuật ü Source: http://kenhtuyensinh.vn/truong-dai-hoc-tai-tp-ho-chi-minh

489

Appendix K: Original quotes in Vietnamese

K.1 Quotes from the Vietnamese press and literature

Pageinthesis

CHAPTERONE:INTRODUCTION

33 … hoạt động dạy học được tổ chức thông qua môi trường giao tiếp đa dạng, phong phú với các hoạt động tương tác (trò chơi, bài hát, kể chuyện, câu đố, vẽ tranh...), dưới các hình thức hoạt động cá nhân, theo cặp và nhóm. (Huy Lan & Lan Anh, 2010)

Source:

https://nld.com.vn/giao-duc-khoa-hoc/chuong-trinh-ngoai-ngu-10-nam-khoi-dong-lung-tung--2010081711321136.htm

37

Sinh viên sau khi ra trường đáp ứng yêu cầu kỹ năng tiếng Anh của người sử dụng chỉ khoảng 49%, có tới 18,9% sinh viên không đáp ứng được và 31,8% sinh viên cần đào tạo thêm. Điều đó có nghĩa, hơn nửa số sinh viên sau khi ra trường không đáp ứng đủ yêu cầu về kỹ năng tiếng Anh. (V. Le, 2016)

Source:

https://baotintuc.vn/giao-duc/qua-nua-sinh-vien-tot-nghiep-kem-tieng-anh-20160506225914927.htm

38

Việc đổi mới thi Ngoại ngữ đang như vòng luẩn quẩn. Trước năm 2006, học sinh làm bài thi đại học môn tiếng Anh gồm 20% trắc nghiệm và 80% tự luận. Từ năm 2006 trở đi, môn tiếng Anh thi 100% trắc nghiệm. Sau đó, dạng trắc nghiệm bị phê phán là không đánh giá đúng thực lực học sinh, xa rời kỹ năng thực hành và đề lại chuyển về 80% trắc nghiệm và 20% tự luận trong kỳ thi THPT quốc gia 2015. Thay đổi được 2 năm, kỳ thi 2017 lại dự định chuyển môn tiếng Anh về 100% trắc nghiệm. (Thanh Tam & Phuong Hoa, 2006)

Source:

https://vnexpress.net/tin-tuc/giao-duc/thi-trac-nghiem-100-co-the-la-buoc-lui-cua-mon-tieng-anh-3465561.html

41

Mặc dù lệ phí thi VNU-EPT rất thấp so với các chứng chỉ quốc tế khác nhưng rất ít sinh viên chịu đăng ký dự thi. Đa số các em đều cho rằng, doanh nghiệp không hề sử dụng dạng chứng chỉ này. Tôi cho rằng, để áp dụng hiệu quả VNU-EPT, chúng ta cần tạo được uy tín của chứng chỉ này đối với xã hội, đặc biệt là các nhà tuyển dụng. (Phuong Chinh, 2017)

Source:

http://www.sggp.org.vn/85-sinh-vien-chua-dat-chuan-trinh-do-tieng-anh-456676.html

41

Vấn đề cốt lõi của việc nâng cao hiệu quả giảng dạy ngoại ngữ trên thế giới là làm sao tích hợp được ba thành tố cơ bản và quan trọng nhất của quá trình dạy và học, đó là giảng dạy, học tập, và kiểm tra đánh giá. Riêng đối với Việt Nam, kiểm tra đánh giá vẫn đang là khâu yếu nhất và vì thế cần có sự quan tâm nhiều nhất. (Vu, 2007)

Source:

http://vietbao.vn/Giao-duc/Hoc-tieng-Anh-10-nam-trong-truong-khong-su-dung-duoc-Kiem-tra-danh-gia-dang-la-khau-yeu-nhat/40224569/202/

490

K.2 Quotes from Vietnamese interviews with research participants

Pageinthesis

Chapter Four: TEST TAKER CHARACTERISTICS AND TEST ADMINISTRATION

125 Em nghĩ là em bị ảnh hưởng bởi khá nhiều yếu tố. Yếu tố ảnh hưởng tới em nhiều nhất là điều kiện sức khỏe. Tại vì em tới mỗi kì thi là bị căng thẳng quá nên bị đau bụng. Cho nên từ đó tới giờ những kì thi lớn bị ảnh hưởng khá là nhiều. (U1G1.S2)

149 Em thì cũng không thích hình thức này lắm. Tại vì mặc dù là nói với máy tính thì có lẽ mình bớt căng thẳng hơn, nhưng mà em thấy nói với máy thì không cảm thấy thích thú tại vì kiểu như không có ai lắng nghe mình Khi nói là mình phải tương tác. Mình tương tác với người thật thì sẽ dễ nói hơn là mình cứ nói trước màn hình máy tính như vậy em nghĩ nó không có tự nhiên. (U1G3.S2)

149 em nghĩ thì trên máy tính nó vừa có mặt lợi vừa có mặt hại. Mặt lời thì cái phần mình nói sẽ được chấm một cách công bằng, bởi mình nói gì thì máy đều ghi âm lại hết. Mặt hại là nó sẽ làm cho mình hơi căng thẳng, tại vì khi mình nói chuyện trực tiếp với người giám khảo thì em nghĩ nó sẽ thoải mái hơn, còn nói trên máy tính như kiểu mình là robot vậy. (U1G2.S4)

152 Toàn bộ nó lẫn quẫn trong sách đó. Nó không có gì lạ hết á, ví dụ như những cái câu mà Warm-up, hoặc là Lead-in hoặc là Consolidation trong sách đó là mình xài lại hết. (U2.T3)

153 Em nghĩ nếu được (giáo viên cho nhận xét) thì rất là tốt cho sinh viên. Nhưng mà trên thực tế khó làm lắm anh. Có thể được với điều kiện là giáo viên dạy lớp thì họ theo sát sinh viên thì họ biết trình độ sinh viên thế nào, thì họ feedback. Giống như có những lúc em dạy thì cũng có lớp em làm được chuyện đó (việc cho nhận xét), người học trò cũng thích được feedback rất là cảm ơn. (U1.T3)

Pageinthesis

Chapter Five: CONTENT RELEVANCE OF SPEAKING TEST QUESTIONS

184 Dạ đối với em thì lớp học kỹ năng nói cũng giống như là hỗ trợ thôi, phần còn lại là do bản thân mình. Có những câu hỏi không có trong sách thì nó vượt quá tầm của tụi em. Cần phải trao dồi kiến thức thêm, rồi đi ra ngoài thì mới có thể trả lời mấy câu hỏi này được. (U2G2.S3)

186 Bài kiểm tra giúp giáo viên nhận ra sinh viên mình mạnh và yếu ở những điểm nào. Qua việc các em nói mình có thể biết các em đạt được mục tiêu môn học đến mức độ nào để chúng ta (giáo viên) sẽ có những phương pháp phù hợp giúp sinh viên cải thiện kỹ năng nói. Nhiều em cần bổ sung kiến thức liên quan tới nội dung khi nói hơn là khả năng phát âm (tiếng Anh) các em đã có được. Theo em nghĩ, việc học một ngoại ngữ không chỉ là có thể nói được ngôn ngữ đó mà còn cần phải biết mình nói cái gì. Cho nên kiến thức cũng là một phần quan trọng để nói tốt. (U3.T2)

Page in thesis

Chapter Six: SPEAKING TEST TASKS

236 Sinh viên bắt buột phải biết cách thuyết phục bạn đồng ý với quan điểm của mình, và mình phải biết cách disagree với bạn của mình, biết cách debate cách tranh luận với nhau. Sinh viên cần chú ý những (cái) phrases và phải dùng formal language như thế nào khi mà anh disagree mà anh phản biện lại cái ý của bạn mình thì các em phải đạt đến những mục tiêu đó. Trong phần mô tả hình thì các em phải nói được (cái) main points khi mà các em nhìn vào vấn đề gì đó. Ví dụ khi anh nhìn vào bức hình đó anh có thể mô tả bức hình đó theo cái viewpoint của anh. Một cái hình có thể mô tả khác nhau nhưng anh có cái quan điểm của anh mà anh phải nhìn ra cái theme, cái chủ đề của cái bức tranh đó là như thế nào. (U1.T1)

237 Đa số các thí sinh làm không tốt, không tốt cả ba phần: thứ nhứt là nội dung ngôn ngữ (language content), thứ hai là kiến thức ngôn ngữ (language knowledge), thứ ba là chức năng ngôn ngữ (language functions). Bởi vì trong quá trình học các em không có lưu tâm đến, không có get involed vô những hoạt động mà trong lớp thì các em không quen với content đó, không có familiar với những nội dung đó. Khi ra thi thì toàn bộ nội dung nằm trong sách hết, thì cái đó các em không trình bày tốt. Thứ hai là kiến thức ngôn ngữ, ví dụ như văn

491

phạm từ vựng, vì em không quen với nội dung nên em không có nghĩ ra cái từ mà liên quan, và văn phạm thì cũng lấp bắp rồi cũng cố trình bày thôi, chứ thật ra thì những cái cấu trúc thì không được tăng ở cái mức tối đa như mong muốn. Còn về phát âm thì cái này cũng mang tính chất cố hữu thôi, một số em phát âm rất tốt, một số em thì không hề chú trọng và luyện tập cái cách phát âm của mình. Còn chức năng ngôn ngữ thì nếu nội dung không nắm được, kiến thức ngôn ngữ không tốt thì cái việc turn-taking hay là các em sử dụng ngôn ngữ như thế nào thì lại càng không tốt nữa. (U2.T1)

239 Kỳ thi này đúng là cơ hội để mình thể hiện những gì mình học được nhưng mà từ vựng trong bài thi nói này nó ít liên quan đến mấy từ vựng mình đã học được, chủ yếu là dựa vào kinh nghiệm cá nhân mình biết về cái đó thôi. Mình nhớ được cái gì thì mình nói cái đó, mình biết cái gì mình nói cái đó. (U2G2.S4)

239 Những lúc thi xong em hay hối hận tại sao lúc đó mình lại nghĩ và nói cái đó. Có một khoảng thời gian ngắn quá, mình không kịp tư duy hết được, nói hết được những gì trong khả năng của mình. (U1G1.S1)

Page in thesis

Chapter Seven: RATER CHARACTERISTICS AND RATING ORAL SKILLS

269 Em thường hay kiếm điểm cộng chứ không kiếm điểm yếu để trừ. Học viên không quá giỏi để mình kiếm điểm mình trừ. Thường thì sinh viên ở mức độ đó thì mình đi kiếm điểm cộng chứ không kiếm điểm trừ. (U3.T1)

270 Theo em thì nên do hai giám khảo đánh giá. Mỗi người có một cảm nhận riêng nên sẽ nhìn ra một cái riêng của thí sinh nên em nghĩ thì nên hai giám khảo đánh giá thì nó sẽ đa dạng hơn, đa chiều và nhiều mặt hơn, mình sẽ rút được nhiều kinh nghiệm hơn và nó đúng chính xác hơn. (U1G1.S1)

271 Em thì cũng khoái có nhiều giám khảo đánh giá hơn. Khi đó lỡ mà một giám khảo chấm điểm mình thấp đi thì sẽ có giám khảo khác người ta sẽ giải thích là “không thể nào thấp được tại vì tôi thấy khía cạnh em này tốt mà” sẽ có một người nói nữa nên cái bài của mình nó sẽ được phân tích kĩ hơn chỉ với một ý kiến. (U3G2.S3)

271 Em thấy chỉ một giám khảo thôi là tốt nhứt. Tại vì là khi người ta nghe mình nói thì người ta cũng ghi lại được những gì mình nói rồi. Người ta cảm được thấy mình yếu chỗ nào, tốt chỗ nào người ta ghi lại. Chứ không nhất thiết là có hai giám khảo. Càng nhiều càng đông thì càng áp lực hơn thầy, em không thích. (U2G1.S1)

271 Theo em bao nhiêu giám khảo cũng được nhưng mà giám khảo phải có một phong thái của giám khảo là đừng vô phòng thi rồi một người đứng bên đây, một người đứng bên kia, một người bên kia nữa rồi ngó mình thì em không thích như vậy. Khi là giám khảo thì vô bàn ngồi và nói chuyện như một người bình thường thì thí sinh sẽ cảm thấy thoải mái hơn, bài thi sẽ tốt hơn. (U3G2.S1)

Page in thesis

Chapter Eight: IMPACT OF ORAL TESTING ON EFL TEACHING AND LEARNING

305 Buổi hôm trước cái phòng thi của em đó sinh viên ra em cảm thấy nó không khả quan như lần trước em gác thi… Em rút được kinh nghiệm nhiều lắm.Thứ nhất là mấy bạn không tự tin, và muốn các bạn tự tin phải làm gì, vô lớp bắt cho thảo luận nhiều, dành thời gian cho đứng lên nói và cầm micro nói chung trước lớp phải thể hiện càng nhiều càng tốt, thì khi trước đám đông như vậy các bạn bước ra sẽ tự tin. Đa số những bạn thi mà bị vấp váp có nhiều bạn nói kiến thức rất tốt, ngôn ngữ rất tốt nhưng mà trong lúc đó diễn đạt không tốt bị lấp vấp, rồi bị ngưng, bị mất thời gian nói chung silent time nó nhiều quá hoặc là gì đó bạn chưa diễn đạt được hết ý mặc dù là ý đó nhiều mà bị run nên trước nhất em nghĩ ở lớp là phải tập cho các bạn nói nhiều, thảo luận nhiều, thực hành rất nhiều hơn để các bạn tự tin. (U2.T2)

306 Em nghĩ là có. Tại qua cái đợt đó em học từ những người thi, mình học từ cái người gíao viên đứng lớp đó. Em sẽ dạy cho các sinh viên cái trình tự buổi thi như thế nào chẳng hạn, các câu hỏi sẽ như thế nào, mình sẽ dùng những mẫu câu gì. Rồi với cái dạng đề đó, câu hỏi đó có cái cách trả lời nào để mình ghi điểm. (U1.T3)

492

309 Cái của sinh viên Việt Nam mình làm không được trong mind map nó giống như là take turn. Giống như là một em nói thích ăn kem, cái em kia nói lại là thích uống Pepsi. Rồi cứ hai người qua lại vậy đến hết giờ rồi nó chán à có nghĩa là cứ vậy vậy á. Ví dụ tại sao bạn thích uống Pepsi, tại sao bạn thích ăn loại kem đó mà không thích ăn kem này thì nó sẽ vui hơn thì mình sẽ tạo được cái điều đó. Cái đó cũng là thiếu trong cái kỹ năng của sinh viên mà tới lớn mình cũng bị như vậy á. Khi anh nói chuyện với Tây anh cũng sẽ thấy có nghĩa là Tây họ nói chuyện họ nói rõ cái quan điểm của họ còn mình thì mình ngại còn nhiều khi mình hỏng dám nói là mình không đồng ý cái quan điểm đó. Chắc cái đó cũng là văn hóa đó anh. Văn hóa của mình là không muốn disagree một cách frankly. (U2.T3)