Tuesday 8 October 2013

The validity of automated scoring software and its application in ELT contexts

This was the title of the closing plenary at this year's VUS-TESOL conference, given by Professor Timothy L. Farnsworth.  What follows is a summary of what he had to say.





What is automated scoring?
  • Computer software that automatically assigns scores to writing or speaking samples.
  • Essays can be assigned scores instantly by computer.
  • Test takers can call a testing centre and take an oral test without speaking to a human.
  • Scores can be reported instantly.
  • Some level of feedback is given to test takers.
  • There is a variety of software available.
How does a computer grade a test?

1.  Natural Language Processing (NLP)
  • software identifies and counts linguistic features.
  • software does not attempt to gauge content in any way.
  • used for testing writing.
2.  Speech recognition
  • software compares the speech sample to a large database of samples of the same test questions.
  • faster responses are 'more fluent'.
  • used for testing speaking.
E-rater (ETS)
  • automated scoring of timed essays
  • uses NLP
  • currently used in a limited way to rate TOEFL and GRE
  • used for formative assessment (e.g. TOEFL practice online)
  • individual assessment
  • students submit essays, receive scores and re-write them as many times as they want in order to improve their score
E-rater takes an essay and counts:
  • the number of words
  • the number of sentences
  • the number of paragraphs
  • sentence length
  • the number of unique words used versus the total number of words (lexical diversity)
  • the number of low-frequency words (lexical depth)
  • the number of prompt-specific words (topic appropriateness)
The computer doesn't try to understand the essay, but it does look at grammar:
  • dependent/independent clauses
  • passive voice
  • subject-verb agreement
  • plurals
  • sequencing words
  • logical relations
  • mechanics (punctuation, for example)
What is a good essay according to E-rater?
  • It's long - longer is always better!
  • It has a standard structure.
  • It has many longer sentences with a lot of dependent clauses.
  • It has many explicit organisational words.
  • It has a lot of obscure vocabulary - for example, indubitably would score much higher than surely!
  • It has a wide range of vocabulary.
This is not necessarily a good thing!  Good English writing is often simple, clear and concise.

What does E-rater not notice?
  • Untruths
  • Grammatical errors
  • Lexical errors
  • Flawed arguments
  • Insanity!
Therefore, ETS doesn't use E-rater as the sole scorer for tests.  Rather, it is used as the second human in order to save money.  More than ten years of research hasn't solved the problems with E-rater - it's incredibly hard to get a computer to understand language!

Criterion

This is an E-rater application designed for in-class use.  Students' essays are instantly scored using E-rater software.  Students are given individual scores and extra resources to refer to about their errors.

Versant

This is the first fully automated oral language test used commercially.  It is a Pearson product.  The test is taken in a computer lab or over the phone (speaking to a computer).  The computer automatically rates the speech and produces scores.  It is used widely in business and increasingly in schools.  There are many versions with multiple uses and languages - for the aviation industry, for example.

The test is fifteen minutes long and includes:
  • repeating sentences
  • scrambled sentences
  • oral multiple choice
All responses are totally scripted with only one possible right answer.  There is an optional 'free response' answer, but this is not scored.  Answers are scored on:
  • fluency
  • pronunciation
  • sentence mastery
  • vocabulary
  • grammar
Speech is captured by microphone and compared to a large database of human-scored responses.  The database includes responses from native speakers from different countries, and English learners from different countries and of all proficiency levels.  Scores are given in the range of 'most similar' to the sample.

What is a good Versant response?
  • It's fast (fluency score)
  • It's clear
  • It's accurate
  • It has native-like pronunciation
This last criteria is the most contentious.  We talk about 'global English' now and, for most of us, comprehensibility is much more important than native-like speech.

What Versant doesn't measure:
  • the range of vocabulary used
  • extended speaking
  • pragmatics - cultural awareness, for example
  • the ability to interact with others
Advantages of these systems

Reliability
  • computers don't get tired
  • computers aren't biased for or against individuals
  • scores are more consistent than with human raters
Practicality
  • it's less expensive than using human raters
  • scores and feedback are obtained instantly
Research shows that when test takers are 'acting in good faith', scores are roughly equivalent to those of human raters.  Even though the scores are very similar, however, they are arrived at in very different ways.

Problems

Automated tests can be 'gamed' or tricked.  Versant scores, for example, can be quickly raised by coaching.

Positive effects on teaching
  • Students can get more and faster feedback.
Negative effects on teaching
  • The form of the test can influence what happens in the classroom.
  • Teachers tend to focus on what is tested at the expense of communicative teaching.
  • There can be a decreased focus on the quality of the content.
  • There can be an increased focus on grammatical accuracy and low-frequency vocabulary.
  • There is more oral repetition in order to increase the students' speed of response.
  • There is less time spent on developing critical thinking.
  • There is a decreased focus on the pragmatic.
To conclude

Despite the obvious drawbacks, computer scored testing is in all our futures.

3 comments:

  1. Hi,

    The site is about automated scoring software and its application in ELT contexts, Golf Scoring Software offers facilities for organizations that prefer to collect data within the confines of their own software system. Also, you can get facilities to collect survey data as part of a larger system for measuring outcomes.Thanks....

    ReplyDelete
  2. I agree with all of the points about software keep up the good work.Thanks for sharing this.

    Thin Client Software & RDP Thin Client

    ReplyDelete
  3. Hi,

    The site is about scoring software, Online Golf Handicap helps you to get result faster and correct without any problem. So, many people take interest to use this in their clubs , thanks...

    ReplyDelete