Researchers have used top Generative AI models to grade hundreds of undergraduate essays and found that AI only matched human-awarded degree classification around half the time, with AI often failing to assess the best and worst submissions accurately.

A University of Cambridge-led team of psychologists and AI experts tested three “frontier” systems, including the latest versions (as of April 2026) of Claude and ChatGPT, on over 750 student essays from three UK universities submitted as part of a psychology degree.

While accuracy of AI in grading the essays, from coursework to exam answers, was “not uniformly high”, say researchers, it did manage to match the broad grading bands – a first, 2:1, 2:2 and so on – given out by human examiners between 35-65% of the time.

However, major stumbling blocks for AI include routinely undervaluing work awarded top marks by humans, or overvaluing essays ranked among the lowest.

Unlike human examiners, all the AI systems were “oversensitive to linguistic features”: giving out higher marks based on essay length, vocabulary range and sentence complexity, which are often unrelated to academic standards.