Grading essays: Human vs. machine
May 10th, 2012
06:20 AM ET

Grading essays: Human vs. machine

by Jordan Bienstock, CNN

(CNN) No one thinks twice about using machines to grade multiple-choice tests. For decades, teachers – and students – have trusted technology to accurately decipher which bubble was filled in on a Scantron form.

But can a machine take on the task of evaluating the written word?

A recent study conducted by the College of Education at the University of Akron collected 16,000 middle and high school test essays from six states that had been previously graded by humans. The essays were then fed into a computer scoring program.

According to the researchers, the robo-graders “achieved virtually identical levels of accuracy, with the software in some cases proving to be more reliable.”

So the simple answer to whether machines can grade essays would appear to be yes. However, the situation is anything but simple.

The grading software looks for elements of good writing, such as strong vocabulary and good grammar.

What it isn’t able to do is distinguish nuance, or even truth.

Les Perelman, a director of writing at the Massachusetts Institute of Technology, is a critic of these robo-graders. He’s had a chance to study how some of the programs work, and says they can be gamed if you can determine the preferences set by the scoring algorithms.

For example, Perelman said in a New York Times article that the machines focus on composition, but have no concern with accuracy. According to Perelman, “any fact will do as long as it is incorporated into a well-structured sentence.”

Dr. Mark Shermis, dean of Akron’s College of Education and one of the authors of the study, acknowledges that “automatic grading doesn’t do well on very creative kinds of writing. But this technology works well for about 95 percent of all the writing that’s out there.”

Another point in the machine’s favor: speed. The New York Times article points out that human graders working as quickly as possible are expected to grade up to 30 essays in an hour.

In contrast, some robo-graders can score 16,000 essays in 20 seconds.

That disparity would seem to support Shermis’ view that robotic graders can serve “as a supplement for overworked” entry-level writing instructors. But he warns that his findings shouldn’t be used as a justification to replace writing instructors with robots.

What these robo-graders can do, Shermis says, is “provide unlimited feedback to how you can improve what you have generated, 24 hours a day, seven days a week.”

Posted by
Filed under: Practice • Testing
soundoff (31 Responses)
  1. JUSTIN

    Computerized checking can be convenient. However, it won't be the great choice because "can machines think like humans?" people all have their own way of checking. We can't always depend on machines since it is not always 100% accurate. We can use it as doublechecking, or finding grammar mistakes. Therefore, I don't think we should check with machines

    May 23, 2012 at 10:28 am |
  2. sheila

    yeah. same thing as resume scan. same difference

    May 21, 2012 at 7:47 am |
  3. alejandro

    YOLO

    May 14, 2012 at 2:59 pm |
    • fernanda

      haha Alejandro yur a looser >.<

      May 14, 2012 at 3:01 pm |
  4. aaron

    This sucks By aaron

    May 14, 2012 at 2:56 pm |
    • aaron

      ( .Y .)

      May 14, 2012 at 3:02 pm |
  5. nayeli

    This is a student and this sucks. I always copyright in school.

    May 14, 2012 at 2:54 pm |
  6. Caitlyn

    I understand having machines checking multiple choice and grammar and all the easy stuff but when it comes to an essay, thats where you need a person to actually understand your writing because a machine might not always be able to understand what you're trying to say when an actual human can.

    May 14, 2012 at 12:38 pm |
  7. Emma

    I think it's good the computer can check grammar and vocabulary mistakes, but can they really think like us humans?

    May 14, 2012 at 12:35 pm |
    • Daria

      Good observation I agree with you!!!

      May 14, 2012 at 12:37 pm |
  8. Sid

    "they can be gamed if you can determine the preferences set by the scoring algorithms."

    as opposed to professors who can be gamed by determining their political views & parroting them in your writing in which case you don't have to bother with things like grammar, vocabulary or spelling...

    May 11, 2012 at 8:25 am |
    • soulcatcher

      100% correct about most professors. By why stop with graders? Take my new product braindoubler to your test and finish the test in six seconds. And for you hackers out there taking information technology tests we have SQLinjector.

      /sarcasm off

      May 11, 2012 at 8:42 am |
  9. Alan Nordling

    CNN left out the most damning portion of Mr. Perelman's thoughts on "RoboGrading":

    "The e-Rater’s biggest problem, he says, is that it can’t identify truth. He tells students not to waste time worrying about whether their facts are accurate, since pretty much any fact will do as long as it is incorporated into a well-structured sentence. “E-Rater doesn’t care if you say the War of 1812 started in 1945,” he said.

    Mr. Perelman found that e-Rater prefers long essays. A 716-word essay he wrote that was padded with more than a dozen nonsensical sentences received a top score of 6; a well-argued, well-written essay of 567 words was scored a 5.

    An automated reader can count, he said, so it can set parameters for the number of words in a good sentence and the number of sentences in a good paragraph. “Once you understand e-Rater’s biases,” he said, “it’s not hard to raise your test score.”

    E-Rater, he said, does not like short sentences.

    Or short paragraphs.

    Or sentences that begin with “or.” And sentences that start with “and.” Nor sentence fragments.

    However, he said, e-Rater likes connectors, like “however,” which serve as programming proxies for complex thinking. Moreover, “moreover” is good, too.

    Gargantuan words are indemnified because e-Rater interprets them as a sign of lexical complexity. “Whenever possible,” Mr. Perelman advises, “use a big word. ‘Egregious’ is better than ‘bad.’ ”

    The substance of an argument doesn’t matter, he said, as long as it looks to the computer as if it’s nicely argued.

    For a question asking students to discuss why college costs are so high, Mr. Perelman wrote that the No. 1 reason is excessive pay for greedy teaching assistants.

    “The average teaching assistant makes six times as much money as college presidents,” he wrote. “In addition, they often receive a plethora of extra benefits such as private jets, vacations in the south seas, starring roles in motion pictures.”
    E-Rater gave him a 6. He tossed in a line from Allen Ginsberg’s “Howl,” just to see if he could get away with it.

    He could.

    The possibilities are limitless. If E-Rater edited newspapers, Roger Clemens could say, “Remember the Maine,” Adele could say, “Give me liberty or give me death,” Patrick Henry could sing “Someone Like You.”

    So, yeah, let's all RoboGrade!

    May 11, 2012 at 7:04 am |
  10. Rod C. Venger

    This is all about form over substance. If used only to determine if a student can correctly structure sentences and paragraphs into a supposedly coherent whole, then okay...just don't conclude that it's coherent. I'd like to think that the hundreds or thousands of essays that I've written were appreciated for their content as well as just their form, otherwise, what was the point?

    May 10, 2012 at 11:56 pm |
  11. Ophelia

    I had my students use some writing-grading technology a few years ago. One boy discovered that if he copied and pasted the same complete sentence in over and over again, he earned top marks. My guess is that now the program also looks for key words, such as transitions, but what else could it really do?

    One major issue I have with computers reading papers instead of the teacher–me–is that I would no longer know my students as I get to know them throughout the year. When you have 34-36 kids per hour for 6 hours a day, it takes a while, but I get to know a lot of kids. I also refer kids who write about depression, abuse, bullying, and other problems they'd never reveal verbally.

    Even in expository essay writing, I learn about my students through their writing. I definitely learn who's listening, who's copying, and who needs help. It might take a while to read all those papers, but kids and parents've told me for years how much they appreciate the comments on their papers, and how they liked knowing someone actually read their work.

    May 10, 2012 at 11:44 pm |
  12. MissingThePoint

    I believe that you are overlooking at important use of automated graders which affords students more opportunities to practice writing and get "some level" of instant feedback. A trained machine is not meant to grade Hemingway or Joyce, it is meant to help Janie and Jose and Shaniqua to have more opportunities to practice open-ended prompts. We would like to believe that teachers have the time to grade 120 essays each week in a timely manner with constructive feedback, but they do not. I have sat through high-stakes essay training and I would trust a machine for my fifth graders over many of the people in that room any day.

    May 10, 2012 at 10:49 pm |
    • MissingTheLogic

      Yes, because we all know that if word gets around amongst 5th graders that they can sham and get a good grade, they definitely won't take full advantage of it.

      May 11, 2012 at 4:48 am |
  13. hamsta

    The reason this wont work is because the english language is complex and full of double and triple meanings. for instance take the catch phrase from the milk commercial. GOT MILK? that could have 3 different meanings. 1 asking for a drink of milk. 2 an insult to mean she is a fat cow. 3 a compliment to mean she has a nice body part. how is a computer going to determine that? trust me a woman wont get it wrong.

    May 10, 2012 at 9:30 pm |
  14. John

    I never had any real problem with writing essays in school, but I always thought they were a waste of time and energy.

    My secret to decent essay writing was to get into my teachers' heads and then proceed to give them wish fulfillment.

    I wonder if these robograders would put the kabosh on that?

    Inputting some of what are considered the great writers like Plato, Shakespeare, Melville, Twain, Dickens, Swift, and a few others, as suggested by Alyssa would be a fascinating study. If they got bad marks, it might indicate that what are considered good writing styles need to be revised.

    May 10, 2012 at 9:25 pm |
  15. Alyssa

    Six degrees- well said! Let's put some of the very best writers we have – Twain, Dickens, Tolstoy, Austen and see what grade they get. If they all get A's- maybe I'll buy this whole robo grading thing.

    May 10, 2012 at 7:38 pm |
  16. Steve

    When I was in 7th grade and kept getting checks instead of check plusses on my assignments, I figured out the teacher graded purely for length. At that point, I would toss in meaningless, inappropriate non sequiturs to pad the length - instant check plusses.

    Even humans can be gamed, especially when the students generate more work than a teacher can realistically be expected grade effectively (as was clearly the case in my class, in retrospect). Not to suggest the computer would have done any better. The sentences I used to pad were all grammatically correct, just factually nonsensical.

    May 10, 2012 at 4:23 pm |
  17. jj

    The machine looks for complete sentences. Many good writers do not always use complete sentences. And I'd love to see what the machine would do with ultra-complex sentences by Proust!

    May 10, 2012 at 4:21 pm |
  18. Fed Up

    Why bother to test at all? The kids that have the grades get in on their own merit, the kids with parental money get in with societal connection, and the kids that are minorities apply and get "special consideration" that just lowers all of the standards. The minority kids end up failing and leaving without a degree and taxpayers get to cover their loan debt. It might as well be college welfare.

    May 10, 2012 at 4:14 pm |
    • JFKman

      Your argument is stupid. Try connecting it with some thought first.

      May 10, 2012 at 9:05 pm |
  19. SixDegrees

    Rather than feeding the program student essays, it would be more interesting to see how it performed when grading writers considered to be outstanding – Shakespeare, Dickens, Twain, Vonnegut and so on. My guess: it would fail them miserably. One element that makes these writers great is that they not only knew the rules, but knew them well enough to stretch to their very limits and beyond in pursuit of perfect expression. These excursions would doubtless be flagged as errors by robo-grader.

    May 10, 2012 at 1:15 pm |
    • PaulM

      I totally agree. I grade your comment an A.

      May 10, 2012 at 4:17 pm |
    • Mike

      I agree with you to some extent. You have to recognize that these machines grade evaluative essays and not other forms of writing. I would suspect that a Dicken's essay would score much better than a Shakespearean sonnet.

      May 10, 2012 at 4:42 pm |
  20. William Vanstralen

    Well, looks like I'm out of a job...

    May 10, 2012 at 11:55 am |
    • Ashley M

      Unfortunately, I'm unable to access the link to what the "elements of good writing" are, but I'm very interested to find out. I previously scored the essay portion of a college entrance exam as a profession, and I find it difficult to understand how this machine can truly determine if an essay is "good" or not. While I completely agree strong vocabulary and grammar are essential, when presenting an argument there is criteria that is not as black and white as the proper location of punctuation. What about complications, implications, counterarguments, flow, development, context, transitions, or expression as SixDegrees stated? I realize different forms of writing require different elements in order to be effective, but the humans grading these essays have an inherent understanding of effective writing regardless of the content. I really hope they do not start replacing graders with robots.

      May 10, 2012 at 2:50 pm |
      • Steve

        The quality of a writer's grammar and vocabulary are probably more tightly related to the quality of other aspects of his or her writing than we normally credit. It may very well be that you could arrive at the appropriate grade 95% of the time with the limited evaluation the computer is using.

        The computer is free from all the psychological factors that rattle around in the grader's head that make them spend 'extra' time to ensure the grade they are giving is appropriate and fair to the student.

        May 10, 2012 at 4:36 pm |