Taipei Times have produced articles with regard to the alleged decline in English proficiency of students in Taiwan. “Failing in English,” by Eileen Tan of March 27, 2007 and Hugo Tsang’s “English scores low in college test,” of July 24, 2007 are commentaries on the quality of essays Taiwanese students write for entrance examinations in college. The writers accuse serious decline in English proficiency; however, there seems to have no accurate analyses as of the declining standards of literacy and performance of today’s students in Taiwan with those of previous years to support such accusations.
Apparently, there is no evidence of decline in English proficiency of Taiwanese students; the study of scores from Test of English and Foreign Language (TOEFL) and monthly examinations of English in senior high schools in Taiwan in fact suggests that Taiwanese students rank satisfactorily, even better compared to students in other Asian countries. In performance analyses, however, Taiwanese students show poor English literacy as having learned the language for six years or more, they can hardly or appear to experience difficulty in speaking with foreigners. These high school graduates are also incapable of writing a single English alphabet.
Taiwan secondary schools adopt different combinations of examination types to assess the level of performance of students. The problem with usage of non-uniform assessment examinations is the possible inconsistency of a test and what it claims or purports to be measuring. Due to non-uniformity of examinations and the fact that examination scores are based on arbitrary levels, committees can raise or lower test standards inconsistent to the lessons given to students, as per the standard academic lessons to a specific grade level, and the teaching methods utilized.
The degree of validity of examinations and scores such tests entail then becomes low. To address this issue, validation studies of examinations in Taiwan’s secondary schools must be implemented to determine whether or not English proficiency of Taiwanese students has declined and teaching methods should be replaced with emphasis in communication. Rationale of the Study Testing and assessment is part of daily life as school students are regularly assessed to monitor educational progress or governments to evaluate the quality of school systems.
Adults are tested to determine if they are suitable for a specific job they have applied for or if employees have the skills necessary for a promotion. Entrance tests are administered in educational institutions and even in sovereign states. Tests play a fundamental role in allowing access to the limited resources and opportunities in this world. The importance of understanding what the test is, how it is done, and the impact that the use of tests has on individuals and society at large cannot be understated.
In this study’s context, language testing is a core issue in foreign or second-language education. The practice of language testing draws upon, and also contributes to, all disciplines within applied linguistics. However, there is something fundamentally different about language testing. Language testing is all about building better tests, researching how to build better tests, and in so doing, understanding better the things that are assessed. Sociolinguists do not create sociolinguistic things; discourse analysts do not create discourses; phonologists do not create spoken utterances.
Language testing, in contrast, is about creating quality tests. The United Evening News cited research by the British Council, which states that Taiwan’s English-learning environment faces five main obstacles to effective English learning: first, it has no standardized teaching materials; second, there is no communication or guidance to improve students’ weak points; third, teaching design is restricting; fourth, teachers and material are not in accordance with international standards and fifth, a lack of sufficient learning facilities. (Taipei Times, 2007)
Validity problems most often observed in English tests are poor item writing, inadequate number of items, lack of item analysis procedures, lack of pilot testing, lack of validity analysis and lack of reliability studies. Background A Taiwan secondary school administers monthly examinations, achievement tests for senior students, which take place three times in every semester of a tri-semester school year. A monthly examination is conducted every after six or seven weeks of school days and this measures student aptitude as to the given English lessons within the said month.
A computer software system processes multiple choice items on each student’s examination paper and prints out a report for each class. The report includes student information, the student’s answers and whether each item is correct or wrong. The software can also determine the examination’s level of difficulty as to the collective number of correct or wrong answers of all students and from the statistics, a curve may be derived from the mean, and prefix or standard deviations are shown to report the subject performance of each class.
Human coding is required by non-multiple choice items such as translation and vocabulary, which words are written by students and manually corrected by teachers. The teachers calculate the final scores by adding those from the computer processed examination papers and hand written ones. A final report is given to each student; this includes information and issues of a class and its performance. Each class has its unique report, where the all the students’ names in a specified class are listed with their rankings both in class and in batch.
The students’ performances have a norm distribution in thirds: the higher range; the average; and the lower range. Statistics, such as mean, prefix, standard deviation, and etcetera are also included in the report. The senior high school monthly examination is an achievement test as it is both norm- and criterion-referenced. The teachers have specific grammar points and fixed vocabulary within the designated lessons for students to master. Analysis of Variance
In statistics, analysis of variance (ANOVA) is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables. For instance, if a group of testers were trying to test the validity of an English monthly examination of Taiwan senior high school students, a new test and the TOEFL must be administered to a large group of students and then the degree of correlation between the two tests must be calculated.
This method is criterion-related. Content validity requires the systematic examination of the degree to which a set of test items approximate the content or abilities the test is designed to assess. The items are always written to match the course objectives. The items should match the content and skills taught in the courses; hence, content validity is an integral part of the item development process. Construct validity is the experimental demonstration that a test is measuring the construct it claims to be measuring.
A differential-group study is an experiment, wherein test performances are compared between two groups: a group with construct and one without it. Should the group which has the construct performs better than the one which does not have, the result proves construct validity of the test. There is also the strategy called intervention study, “wherein a group that is weak in the construct is measured using the test, then taught the construct, and measured again.
If a non-trivial difference is found between the pretest and posttest, that difference can be said to support the construct validity of the test” (Brown, 2000). Another approach to reliability estimation is the standard error of measurement, which gives an estimate of the distribution of errors around a particular score or cut point. These statistics are primarily for purposes of reference and comparison. Validation Study of Monthly Exams in English The most effective validation study of the monthly examinations of English in a senior high school in Taiwan is the intervention study.
One-way analysis of variables for repeated measures is used when the subjects are subjected to repeated measures, which means that the same subjects are used for each treatment. The teachers will implement monthly examinations in English throughout the school year by completing checklists on three semesters, continuously collecting reports and preparing a summary report for each semester. The validation test will be administered twice in October and April, given to students who failed the monthly examinations in English.
The validation tests will have the same content or standard as of the examinations implemented on regular testing sessions. Students who fail the monthly examinations in English will be given remedial and special classes before taking up the validation tests, which as aforementioned above are almost completely similar to the tests they have already taken and failed. The results of the validation study include all students who take the pretest and posttest administrations.
Those students who scored high on the pretest and were exempted from the remedial classes will not take the posttest. In future analyses, such exempted students will be deleted from the pretest analyses and only students who actually attended remedial classes will be included in the final analysis. Three analyses will be conducted with the cross-sectional data using teachers’ ratings of student achievement in the regular monthly examinations and the students’ standard scores in the validation tests: 1.
correlations comparing students’ standard scores on the various subtests of the validation tests and the checklist and summary report ratings of student achievement of the regular monthly examinations, 2. four-step hierarchical regressions examining the various factors which account for the variance in the students’ scores in the validation tests, and 3. Receiver-Operating-Characteristic curves, which determine whether a random pair of average and below average scores on the validations tests are ranked correctly in terms of performance on the regular monthly examinations.
Evidence of concurrent factors of the regular monthly examination’s validity will be examined by computing correlations between the regular monthly examination subscale scores and the students’ standard subtest and broad scores in the validation tests to show the amount shared variance between the two assessments. Correlations of . 70 to . 75 are considered optimal because these indicate a substantial overlap between the two assessments, yet also recognize that each instrument contributes independently to the assessment of students’ learning.
If correlations are high, that is, more than . 80, more than half of the variance between the regular monthly examinations and the validation tests is shared and an argument can be made that the predictor, the validation tests, does not add sufficient new information to justify its use. Conversely, low correlations, those which are less than . 30 suggest very little overlap between the regular monthly examinations and the validation and other conventional achievement tests, thus raising the question of what exactly is measured by the predictor.
Four-step hierarchical regression analyses will be used to determine whether the regular monthly examination checklists and summary report ratings made a unique contribution to the students’ performance on the validation tests over and above the effects of students’ gender, age, socioeconomic status and initial performance level on the regular monthly examinations. The demographic variables are entered in the first step of the four step model. The regular monthly examination checklists are entered in the second step and the summary report is added third. In the final step, the students’ performance levels on the validation tests are entered.
The increment in the variance will be noted for each step in order to assess the contribution of the regular monthly examinations and the performance level on the validation tests above and beyond the demographic factors. Receiver-Operating-Characteristic curve analysis will be conducted in order to study the utility of using the regular monthly examinations to classify students in need of supportive educational services or remedial classes. ROC data enable investigators to examine whether two different assessments will assign students to the same or different categories. ROC percentages more than .
80 are considered excellent. To accomplish this, the investigators will establish cutoffs for the validation tests and perform a cost matrix analysis to obtain optimal cutoffs for the regular monthly examinations. The validation tests are commonly used in subject assessments with students suspected of having learning disabilities or language problems that might affect their academic success. This analysis will enable the investigators to determine the probability that regular monthly examination ratings can be used accurately to assign students to a high risk or low risk group.
If there is substantial difference between the regular monthly examination scores and the validation test scores, it suggests that the regular monthly examinations are valid given that it sufficiently assessed student performance prior to the remedial classes the student was required to undergo. If there however is minimal difference between the regular monthly examination scores and the validation test scores, it suggests that the regular monthly examinations are invalid given that the student has already underwent remedial classes to strengthen the construct, in this case, English proficiency.
The accuracy of this validation study requires that the students do not suffer clinical learning disabilities and competent English teachers, who carry out effective teaching methods. Logistic Requirements The intervention study for the validity test of monthly examinations of English in a senior high school in Taiwan will require at least 50% of the senior batch population or at least 100 students. 100 students who get below average grades will be subject for remedial classes before a validation test is administered. The researchers will draw sample units systematically to validly infer conclusions about the test and the entire population.
For a month of English remedial classes conducted for 100 students, an estimate of 4,000 Australian dollars will be needed for teachers’ compensation and academic materials. The processing of validation test results will need an estimate of 2,000 Australian dollars, both for the computer and manual encoding. The collection of data will cost 200 Australian dollars and the conducting of analyses will need an estimate of 4,000 Australian dollars for tools and professional fees. Facilities and Equipment The frequency method will be used in the validation study of monthly examinations of English in a senior high school in Taiwan.
Researchers will tally and tabulate the scores of the respondents, in this case, the students. Remedial classes will require classrooms conducive to learning language. Each classroom will have a complete set of computer, with audio and video screening. These rooms will also be filled with books written in the English language. Language testers utilize state-of-the-art computer technology to measure test validity with maximum accuracy and efficiency. Experts in computational linguistics, computer engineering, and psychometrics will be aided with the right computer hardware and software for language testing.
These equipment will be used for paper-based and computer-based or computer adaptive language tests. Outside experts will perform the validation study and analyses to get independent judgments on the language validity test. Possible Problems and Solutions Even a well-developed test will be unreliable if the range of talent is depressed in the testing population. Because the exempted students who by definition scored high on the pretest and do not figure into the posttest results, they will have the effect of diminishing the observed differences.
Although ANOVA can be proven mathematically, a part of the study can not have theoretical proofs and can be demonstrated only empirically. The problem may occur when a normal distribution-based test is used to analyze data from variables that are by nature not normally distributed. The large number of samples will be generated by a computer following pre-designed specifications and the results from such samples are analyzed using a variety of tests. The paper-based tests, those which are manually corrected by teachers, can be empirical and there may be theoretical assumptions of the tests that are not met by the data.
As a solution, Monte Carlo studies will be used extensively with normal distribution-based tests to determine how sensitive the validation tests are to violations of the assumption of normal distribution of the analyzed variables in the population. In Monte Carlo studies, data are generated from a population with hypothesized parameter values. A large number of samples are drawn, and a model is estimated for each sample. Parameter values and standard errors are averaged over the samples. Conclusion
Although a performance-based testing approach has been widely challenged especially in the issues of validity and reliability, there has been a consensus that performance assessment is valuable for measuring job applicants’ language proficiency in vocational situations as well as for motivating language learners to make a greater effort to develop communicative language ability (Jones, 1979). There is a useful purpose for administering validity tests. A unified and expanded theory of validity presented by Samuel Messick includes the consequential and evidential bases of test interpretation and use.Sample Essay of Edusson.com