AI not yet good enough to mark university essays, rewarding ‘style over substance’

Researchers have used top Generative AI models to grade hundreds of undergraduate essays and found that AI only matched human-awarded degree classification around half the time, with AI often failing to assess the best and worst submissions accurately.

A University of Cambridge-led team of psychologists and AI experts tested three “frontier” systems, including the latest versions (as of April 2026) of Claude and ChatGPT, on over 750 student essays from three UK universities submitted as part of a psychology degree.

While accuracy of AI in grading the essays, from coursework to exam answers, was “not uniformly high”, say researchers, it did manage to match the broad grading bands – a first, 2:1, 2:2 and so on – given out by human examiners between 35-65% of the time.

However, major stumbling blocks for AI include routinely undervaluing work awarded top marks by humans, or overvaluing essays ranked among the lowest.

Unlike human examiners, all the AI systems were “oversensitive to linguistic features”: giving out higher marks based on essay length, vocabulary range and sentence complexity, which are often unrelated to academic standards.

However, major stumbling blocks for AI include routinely undervaluing work awarded top marks by humans, or overvaluing essays ranked among the lowest.

AI not yet good enough to mark university essays, rewarding ‘style over substance’

AI not yet good enough to mark university essays, rewarding ‘style over substance’

Other newsrooms on this story

Related reading

Can A.I. Produce Writing That We Actually Want to Read?

‘Inconsistent’ AI detection ‘should prompt assessment rethink’

Suspecting AI cheating, Ivy League prof ordered an in-person final; scores fell…

AI is inflating student grades, and the effect points to outsourced work, not…

End of homework essays: AI forces universities to rethink exams

Students Are Learning Less and Getting Higher Grades Because of AI, Study Finds

Other newsrooms on this story

Related reading

Can A.I. Produce Writing That We Actually Want to Read?

‘Inconsistent’ AI detection ‘should prompt assessment rethink’

Suspecting AI cheating, Ivy League prof ordered an in-person final; scores fell…

AI is inflating student grades, and the effect points to outsourced work, not…

End of homework essays: AI forces universities to rethink exams

Students Are Learning Less and Getting Higher Grades Because of AI, Study Finds