New software described in this New York Times story allows teachers to leave essay grading to the computer. It was developed by EdX, the nonprofit organization that was founded jointly by Harvard University and the Massachusetts Institute of Technology and that will give the software to other schools for free. The story says that the software “uses artificial intelligence to grade student essays and short written answers.”
Multiple-choice exams have, of course, been graded by machine for a long time, but essays are another matter. Can a computer program accurately capture the depth, beauty, structure, relevance, creativity etc., of an essay? The National Council of Teachers of English says, unequivocally, “no” in its newest position statement. Here it is:
Machine Scoring Fails the Test
[A] computer could not measure accuracy, reasoning, adequacy of evidence, good sense, ethical stance, convincing argument, meaningful organization, clarity, and veracity in your essay. If this is true I don’t believe a computer would be able to measure my full capabilities and grade me fairly. — Akash, student
[H]ow can the feedback a computer gives match the carefully considered comments a teacher leaves in the margins or at the end of your paper? — Pinar, student
(Responses to New York Times The Learning Network blog post, “How Would You Feel about a Computer Grading Your Essays?”, 5 April 2013)
Writing is a highly complex ability developed over years of practice, across a wide range of tasks and contexts, and with copious, meaningful feedback. Students must have this kind of sustained experience to meet the demands of higher education, the needs of a 21st-century workforce, the challenges of civic participation, and the realization of full, meaningful lives.
As the Common Core State Standards (CCSS) sweep into individual classrooms, they bring with them a renewed sense of the importance of writing to students’ education. Writing teachers have found many aspects of the CCSS to applaud; however, we must be diligent in developing assessment systems that do not threaten the possibilities for the rich, multifaceted approach to writing instruction advocated in the CCSS. Effective writing assessments need to account for the nature of writing, the ways students develop writing ability, and the role of the teacher in fostering that development.
Research1 on the assessment of student writing consistently shows that high-stakes writing tests alter the normal conditions of writing by denying students the opportunity to think, read, talk with others, address real audiences, develop ideas, and revise their emerging texts over time. Often, the results of such tests can affect the livelihoods of teachers, the fate of schools, or the educational opportunities for students.
In such conditions, the narrowly conceived, artificial form of the tests begins to subvert attention to other purposes and varieties of writing development in the classroom. Eventually, the tests erode the foundations of excellence in writing instruction, resulting in students who are less prepared to meet the demands of their continued education and future occupations. Especially in the transition from high school to college, students are ill served when their writing experience has been dictated by tests that ignore the ever-more complex and varied types and uses of writing found in higher education.
Note: (1) All references to research are supported by the extensive work documented in the annotated bibliography attached to this report.
These concerns — increasingly voiced by parents, teachers, school administrators, students, and members of the general public — are intensified by the use of machine-scoring systems to read and evaluate students’ writing. To meet the outcomes of the Common Core State Standards, various consortia, private corporations, and testing agencies propose to use computerized assessments of student writing. The attraction is obvious: once programmed, machines might reduce the costs otherwise associated with the human labor of reading, interpreting, and evaluating the writing of our students. Yet when we consider what is lost because of machine scoring, the presumed savings turn into significant new costs — to students, to our educational institutions, and to society. Here’s why:
- Computers are unable to recognize or judge those elements that we most associate with good writing (logic, clarity, accuracy, ideas relevant to a specific topic, innovative style, effective appeals to audience, different forms of organization, types of persuasion, quality of evidence, humor or irony, and effective uses of repetition, to name just a few). Using computers to “read” and evaluate students’ writing (1) denies students the chance to have anything but limited features recognized in their writing; and (2) compels teachers to ignore what is most important in writing instruction in order to teach what is least important.
- Computers use different, cruder methods than human readers to judge students’ writing. For example, some systems gauge the sophistication of vocabulary by measuring the average length of words and how often the words are used in a corpus of texts; or they gauge the development of ideas by counting the length and number of sentences per paragraph.
- Computers are programmed to score papers written to very specific prompts, reducing the incentive for teachers to develop innovative and creative occasions for writing, even for assessment.
- Computers get progressively worse at scoring as the length of the writing increases, compelling test makers to design shorter writing tasks that don’t represent the range and variety of writing assignments needed to prepare students for the more complex writing they will encounter in college.
- Computer scoring favors the most objective, “surface” features of writing (grammar, spelling, punctuation), but problems in these areas are often created by the testing conditions and are the most easily rectified in normal writing conditions when there is time to revise and edit. Privileging surface features disproportionately penalizes nonnative speakers of English who may be on a developmental path that machine scoring fails to recognize.
- Conclusions that computers can score as well as humans are the result of humans being trained to score like the computers (for example, being told not to make judgments on the accuracy of information).
- Computer scoring systems can be “gamed” because they are poor at working with human language, further weakening the validity of their assessments and separating students not on the basis of writing ability but on whether they know and can use machine-tricking strategies.
- Computer scoring discriminates against students who are less familiar with using technology to write or complete tests. Further, machine scoring disadvantages school districts that lack funds to provide technology tools for every student and skews technology acquisition toward devices needed to meet testing requirements.
- Computer scoring removes the purpose from written communication — to create human interactions through a complex, socially consequential system of meaning making — and sends a message to students that writing is not worth their time because reading it is not worth the time of the people teaching and assessing them.
What Are the Alternatives?
Together with other professional organizations, the National Council of Teachers of English has established research-based guidelines for effective teaching and assessment of writing, such as the Standards for the Assessment of Reading and Writing (rev. ed., 2009), the Framework for Success in Postsecondary Writing (2011), the NCTE Beliefs about the Teaching of Writing (2004), and the Framework for 21st Century Curriculum and Assessment (2008, 2013). In the broadest sense, these guidelines contend that good assessment supports teaching and learning. Specifically, high-quality assessment practices will
- encourage students to become engaged in literacy learning, to reflect on their own reading and writing in productive ways, and to set respective literacy goals;
- yield high-quality, useful information to inform teachers about curriculum, instruction, and the assessment process itself;
- balance the need to assess summatively (make final judgments about the quality of student work) with the need to assess formatively (engage in ongoing, in-process judgments about what students know and can do, and what to teach next);
- recognize the complexity of literacy in today’s society and reflect that richness through holistic, authentic, and varied writing instruction;
- at their core, involve professionals who are experienced in teaching writing, knowledgeable about students’ literacy development, and familiar with current research in literacy education.
A number of effective practices enact these research-based principles, including portfolio assessment; teacher assessment teams; balanced assessment plans that involve more localized (classroom- and district-based) assessments designed and administered by classroom teachers; and “audit” teams of teachers, teacher educators, and writing specialists who visit districts to review samples of student work and the curriculum that has yielded them. We focus briefly here on portfolios because of the extensive scholarship that supports them and the positive experience that many educators, schools, and school districts have had with them.
Engaging teams of teachers in evaluating portfolios at the building, district, or state level has the potential to honor the challenging expectations of the CCSS while also reflecting what we know about effective assessment practices. Portfolios offer the opportunity to
- look at student writing across multiple events, capturing growth over time while avoiding the limitations of “one test on one day”;
- look at the range of writing across a group of students while preserving the individual character of each student’s writing;
- review student writing through multiple lenses, including content accuracy and use of resources;
- assess student writing in the context of local values and goals as well as national standards.
Just as portfolios provide multiple types of data for assessment, they also allow students to learn as a result of engaging in the assessment process, something seldom associated with more traditional one-time assessments. Students gain insight about their own writing, about ways to identify and describe its growth, and about how others — human readers — interpret their work. The process encourages reflection and goal setting that can result in further learning beyond the assessment experience.
Similarly, teachers grow as a result of administering and scoring the portfolio assessments, something seldom associated with more traditional one-time assessments. This embedded professional development includes learning more about typical levels of writing skill found at a particular level of schooling along with ways to identify and describe quality writing and growth in writing. The discussions about collections of writing samples and criteria for assessing the writing contribute to a shared investment among all participating teachers in the writing growth of all students.
Further, when the portfolios include a wide range of artifacts from learning and writing experiences, teachers assessing the portfolios learn new ideas for classroom instruction as well as ways to design more sophisticated methods of assessing student work on a daily basis.
Several states such as Kentucky, Nebraska, Vermont, and California have experimented with the development of large-scale portfolio assessment projects that make use of teams of teachers working collaboratively to assess samples of student work. Rather than investing heavily in assessment plans that cannot meet the goals of the CCSS, various legislative groups, private companies, and educational institutions could direct those funds into refining these nascent portfolio assessment systems. This investment would also support teacher professional development and enhance the quality of instruction in classrooms — something that machine-scored writing prompts cannot offer.
In 2010, the federal government awarded $330 million to two consortia of states “to provide ongoing feedback to teachers during the course of the school year, measure annual school growth, and move beyond narrowly focused bubble tests” (United States Department of Education).
Further, these assessments will need to align to the new standards for learning in English and mathematics. This has proven to be a formidable task, but it is achievable. By combining the already existing National Assessment of Educational Progress (NAEP) assessment structures for evaluating school system performance with ongoing portfolio assessment of student learning by educators, we can cost-effectively assess writing without relying on flawed machine-scoring methods. By doing so, we can simultaneously deepen student and educator learning while promoting grass-roots innovation at the classroom level. For a fraction of the cost in time and money of building a new generation of machine assessments, we can invest in rigorous assessment and teaching processes that enrich, rather than interrupt, high-quality instruction. Our students and their families deserve it, the research base supports it, and literacy educators and administrators will welcome it.
ERIC Identifier: ED458290
Publication Date: 2001-12-00
Author: Rudner, Lawrence - Gagne, Phill
Source: ERIC Clearinghouse on Assessment and Evaluation College Park MD.
An Overview of Three Approaches to Scoring Written Essays by Computer. ERIC Digest.
It is not surprising that extended-response items, typically short essays, are now an integral part of most large-scale assessments. Extended response items provide an opportunity for students to demonstrate a wide range of skills and knowledge, including higher order thinking skills such as synthesis and analysis. Yet assessing students' writing is one of the most expensive and time-consuming activities for assessment programs. Prompts need to be designed, rubrics created, multiple raters need to be trained, and then the extended responses need to be scored, typically by multiple raters. With different people evaluating different essays, interrater reliability becomes an additional concern in the writing assessment process. Even with rigorous training, differences in the background, training, and experience of the raters can lead to subtle but important differences in grading.
Computers and artificial intelligence have been proposed as tools to facilitate the evaluation of student essays. In theory, computer scoring can be faster, reduce costs, increase accuracy and eliminate concerns about rater consistency and fatigue. Further, the computer can quickly re-score materials should the scoring rubric be redefined. This articles describes the three most prominent approaches to essay scoring.
The most prominent writing assessment programs are:
*Project Essay Grade (PEG), introduced by Ellis Page in 1966,
*Intelligent Essay Assessor (IEA), first introduced for essay grading in 1997 by Thomas Landauer and Peter Foltz, and
*E-rater, used by Educational Testing Service (ETS) and developed by Jill Burstein.
Descriptions of these approaches can be found at the web sites listed at the end of this article and in Whittington and Hunt (1999) and Wresch (1993).
Page uses a regression model with surface features of the text (document length, word length, and punctuation) as the independent variables and the essay score as the dependent variable. Landauer's approach is a factor-analytic model of word co-occurrences which emphasizes essay content. Burstein uses a regression model with content features as the independent variables.
PEG - PEG grades essays predominantly on the basis of writing quality (Page, 1994). The underlying theory is that there are intrinsic qualities to a person's writing style called trins that need to be measured, analogous to true scores in measurement theory. PEG uses approximations of these variables, called proxes, to measure these underlying traits. Specific attributes of writing style, such as average word length, number of semicolons, and word rarity are examples of proxes that can be measured directly by PEG to generate a grade. For a given sample of essays, human raters grade a large number of essays (100 to 400), and determine values for up to 30 proxes. The grades are then entered as the criterion variable in a regression equation with all of the proxes as predictors, and beta weights are computed for each predictor. For the remaining unscored essays, the values of the proxes are found, and those values are then weighted by the betas from the initial analysis to calculate a score for the essay.
Page has over 30 years of research consistently showing exceptionally high correlations. In one study, Page (1994) analyzed samples of 495 and 599 senior essays from the 1998 and 1990 National Assessment of Educational Progress using responses to a question about a recreation opportunity: whether a city government should spend its recreation money fixing up some abandoned railroad tracks or converting an old warehouse to new uses. With 20 variables, PEG reached multiple Rs as high as .87, close to the apparent reliability of the targeted judge groups.
IEA - First patented in 1989, IEA was designed for indexing documents for information retrieval. The underlying idea is to identify which of several calibration documents are most similar to the new document based on the most specific (i.e., least frequent) index terms. For essays, the average grade on the most similar calibration documents is assigned as the computer-generated score (Landauer, Foltz, Laham, 1998).
With IEA, each calibration document is arranged as a column in a matrix. A list of every relevant content term, defined as a word, sentence, or paragraph, that appears in any of the calibration documents is compiled, and these terms become the matrix rows. The value in a given cell of the matrix is an interaction between the presence of the term in the source and the weight assigned to that term. Terms not present in a source are assigned a cell value of 0 for that column. If a term is present, then the term may be weighted in a variety of ways, including a 1 to indicate that it is present, a tally of the number of times the term appears in the source, or some other weight criterion representative of the importance of the term to the document in which it appears or to the content domain overall.
Each essay to be graded is converted into a column vector, with the essay representing a new source with cell values based on the terms (rows) from the original matrix. A similarity score is then calculated for the essay column vector relative to each column of the rubric matrix. The essay's grade is determined by averaging the similarity scores from a predetermined number of sources with which it is most similar. Their system also provides a great deal of diagnostic and evaluative feedback. As with PEG high correlations between IEA scores and human scored essays have been reported
E-rater - The Educational Testing Service's Electronic Essay Rater (e-rater) is a sophisticated "Hybrid Feature Technology" that uses syntactic variety, discourse structure (like PEG) and content analysis (like IEA). To measure syntactic variety, e-rater counts the number of complement, subordinate, infinitive, and relative clause and occurrences of modal verbs (would, could) to calculate ratios of these syntactic features per sentence and per essay. For structure analysis, e-rater uses 60 different features, similar to PEG's proxes. Two indices are created to evaluate the similarity of the target essay's content to the content of calibrated essays. As described by Burstein, et.al (1998), in their EssayContent analysis module, the vocabulary of each score category is converted to a single vector whose elements represent the total frequency of each word in the training essays for that holistic score category. The system computes correlations between the vector for a given test essay and the vectors representing the trained categories. The score that is most similar to the test essay is assigned as the evaluation of its content. E-rater's ArgContent analysis module is based on the inverse document frequency, like IEA. The word frequency vectors for the score categories are converted to vectors of word weights. Scores on the different components are weighted using regression to predict human grader's scores.
Several studies have reported favorably on PEG, IEA, and e-rater. A review of the research on IEA found that its scores typically correlate as well with human raters as the raters do with each other (Chung & O'Neil, 1997). Research on PEG consistently reports relatively high correlations between PEG and human graders relative to correlations between human graders (e.g., Page, Poggio, & Keith, 1997). E-rater was deemed so impressive it is now operational and used to score the General Management Aptitude Test (GMAT). All of the systems return grades that correlate significantly and meaningfully with those of human raters.
Compared to IEA and e-rater, PEG has the advantage of being conceptually simpler and less taxing on computer resources. PEG is also the better choice for evaluating writing style, as IEA returns grades that have literally nothing to do with writing style. IEA and e-rater, however, appear to be the superior choice for grading content, as PEG relies on writing quality to determine grades.
All three of these systems are proprietary and details of the exact process are not generally available. We do not know, for example, what variables are in any model nor their weights. The use of automated essay scoring is also somewhat controversial. A well-written essay about baking a cake could receive a high score if PEG were used to grade essays about causes of the American Civil War. Conceivably, IEA could be tricked into giving a high score to an essay that was a string of relevant words with no sentence structure whatsoever. E-rater appears to overcome some of these criticisms at the expense of being fairly complicated. These criticisms are more problematic for PEG than for IEA and e-rater.
One should not expect perfect accuracy from any automated scoring approaches. The correlation of human ratings on state assessment constructed-response items is typically only .70 - .75. Thus, correlating with human raters as well as human raters correlate with each other is not a very high, nor very meaningful, standard. Because the systems are all based on normative data, the current state of the art does not appear conducive for scoring essays that call for creativity or personal experiences. The greatest chance of success for essay scoring appears to be for long essays that have been calibrated on large numbers of examinees and which have a clear scoring rubric.
Those who are interested in pursuing essay scoring may be interested in the Bayesian Essay Test Scoring s Ystem (BETSY), being developed by the author based on the naive Bayes text classification literature. Free software is available for research use.
While recognizing the limitations, perhaps it is time for states and other programs to consider automated scoring services. We don't advocate abolishing human raters. Rather we can envision the use of any of these technologies as a validation tool with each essay scored by one human and by the computer. When the scores differ, the essay would be flagged for a second read. This would be quicker and less expensive than current practice.
We would also like to see retired essay prompts used as instructional tools. The retired essays and grades can be used to calibrate a scoring system. The entire system could then be made available to teachers to help them work with students on writing and high-order skills. The system could also be coupled with a wide range of diagnostic information, such as the information currently available with IEA.
KEY WEB SITES
PEG - http://188.8.131.52/pegdemo/ref.asp
IEA - http://www.knowledge-technologies.com/%20
E-rater - http://www.ets.org/research/erater.html%20
Betsy - http://ericae.net/betsy/%20
REFERENCES AND RECOMMENDED READING
Burstein, J., K. Kukich, S. Wolff, C. Lu, M. Chodorow, L. Braden-Harder, and M.D. Harris (1998). Automated scoring using a hybrid feature identification technique. In the Proceedings of the Annual Meeting of the Association of Computational Linguistics, August, 1998. Montreal, Canada. Available on-line: http://www.ets.org/reasearch/aclfinal.pdf%20
Chung, G. K. W. K., & O'Neil, H. F., Jr. (1997). Methodological Approaches to Online Scoring of Essays. ERIC Document Reproduction Service No. ED 418 101.
Landauer, T. K., Foltz, P. W, & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
Page, E.B. (1994). Computer grading of student prose, using modern concepts and software. Journal of Experimental Education, 62(2), 127-42.
Page, E. B., Poggio, J. P., & Keith, T. Z. (1997). Computer analysis of student essays: Finding trait differences in the student profile. AERA/NCME Symposium on Grading Essays by Computer.
Whittington, D., & Hunt, H. (1999). Approaches to the computerized assessment of free text responses. Proceedings of the Third Annual Computer Assisted Assessment Conference, 207-219. Available online: http://cvu.strath.ac.uk/dave/publications/caa99.html.%20
Wresch, W. (1993) The Imminence of Grading Essays by Computer - 25 Years Later. Computers and Composition, 10(2), 45-58. Available online: http://corax.cwrl.utexas.edu/cac/archiveas/v10/10_2_html/10_2_5_Wres%20ch.html.%20
This Digest is based on an article appearing in Practical Assessment Research and Evaluation.