List of Figures List of Tables List of Abbreviations Chapter I Introduction 1.1 Context of the research 1.2 Research questions 1.2.1 Research question 1 (Study one) 1.2.2 Research question 2 (Study two) 1.2.3 Research question 3 (Study three) 1.3 Research design overview 1.3.1 Study one: traditional scoring 1.3.2 Study two: confidence scoring 1.3.3 Study three: traditional scoring and confidence scoring 1.4 Potential contribution Chapter 2 Review of the Literature 2.1 Early development: linking scores to expert experience 2.1.1 Expert experience: the "native speaker" benchmark 2.1.2 Practice perspective: (I)ELTS (1986 & 1989) 2.2 Major contribution: linking scores to rater perception 2.2.1 Teacher/Rater interpretation: "scaling descriptors 2.2.2 Rater judgment: "binary comparisons 2.2.3 Practice perspective: IELTS revision (1998-2001) 2.3 Work in progress: linking scores to candidate performance 2.3.1 Identifying features from rater perception 2.3.2 Identifying features from documents/rating scales 2.3.3 Practice perspective: TOEFL iBT and IELTS (operational) 2.4 The L2 Chinese context and identifying L2 Chinese features 2.4.1 Pronunciation 2.4.2 FluentT 2.4.3 Vocabulary 2.4.4 Grammar 2.5 Traditional Scoring and problems of"indisfinction" and "overlap 2.6 Summary Chapter 3 Study One: Traditional Scoring 3.1 Introduction 3.1.1 Traditional scoring 3.1.2 Research question 3.2 Method 3.2.1 Instrument: an L2 Chinese speaking test 3.2.2 Participants 3.2.3 Coding 3.2.4 Statistical analysis 3.3 Results 3.3.1 Correlations 3.3.2 Standard multiple regression 3.4 Discussion 3.5 Summary Chapter 4 Study Two: Confidence Scoring 4.1 Introduction 4.1.1 Confidence scoring 4.1.2 Research question 4.2 Confidence scoring design 4.2.1 Raw confidence scores of adjacent levels 4.2.2 Raw confidence scores from different scales 4.2.3 Raw confidence scores to a confidence score 4.2.4 Score interpretation and use 4.3 Pilot study 4.3.1 Candidates and instruments 4.3.2 Coding system 4.3.3 Confidence scores and traditional scores 4.4 Discussion 4.5 Summary Chapter 5 Study Three: Traditional Scoring and Confidence Scoring 5.1 Introduction 5.1.1 Mixed methods: the convergent parallel design 5.1.2 Research question 5.2 Method 5.2.1 Quantitative score dataI 5.2.2 Qualitative interview data 5.3 Analysis 5.3.1 Quantitative data analysis 5.3.2 Qualitative data analysisII 5.4 Results and findings 5.4.1 Quantitative results 5.4.2 Qualitative findings 5.5 Discussion 5.6 Summary Chapter 6 General Discussion and Conclusion 6.1 Study one: traditional scoring 6.1.1 Consmacfing rating scales based on candidate performance 6.1.2 Establishing a potential alignment of L2 speaking tests 6.2 Study two: confidence scoring 6.2.1 Applying confidence scoring in other educational contexts 6.2.2 Developing computation package for confidence scoring 6.3 Study three: traditional scoring and confidence scoring 6.4 Limitations 6.5 Conclusion 6.6 Future agendas: where are we heading 6.6.1 Investigating more features representing the construct 6.6.2 Applying confidence scoring to different contexts 6.6.3 Combining automated scoring and raters' scoring References Appendices Appendix 1 Holistic rating scale for traditional scoring Appendix 2 The L2 Chinese speaking test Appendix 3 Histograms Appendix 4 Scatterplots Appendix 5 Correlation matrix Appendix 6 Histograms and scatterplots for the residuals (Study one)Appendix 7 Center of gravity (COG) computation details Appendix 8 Rating scales (used in Study two and Study three) Appendix 9 Histograms and scatterplots for the residuals (Study three) Appendix 10 Instructions for using the computation package for confidence scoring