Mathematicians grade AI performance on complex problem set at Harvard

Here’s a question that keeps researchers up at night: can AI actually do math, or is it just really good at pattern-matching against problems it’s already seen? A group of 30 mathematicians at Harvard decided to find out the hard way, by giving leading AI systems a test they couldn’t possibly have studied for.

The project, called “First Proof, Second Batch,” assembled its expert panel at Harvard’s Center of Mathematical Sciences and Applications in early June 2026. Their task was straightforward but unprecedented in scale: blind-grade AI-generated solutions to 10 original, unpublished research-level mathematics problems. The results, released on June 10, paint a picture that’s neither the doom scenario nor the triumph that partisans on either side might prefer.

The setup: why unpublished problems matter

The entire exercise hinges on one critical design choice. Every problem in the set was drawn from active, unpublished research. None of these questions had appeared in textbooks, on arXiv, or anywhere else an AI’s training data could have scraped them.

The mathematicians behind the project aren’t exactly lightweights, either. The roster includes Mohammed Abouzaid from Stanford, Nikhil Srivastava from UC Berkeley, Rachel Ward from UT Austin, and Lauren Williams of Harvard.

Mathematicians grade AI performance on complex problem set at Harvard

Other newsrooms on this story

Related reading

A new era for mathematics: AI excels in grading major problem sets

Humans outperform AI at this highly rigorous mathematics test

AI just took the world's hardest maths test and humans won

AI scores a ‘C–’ on its hardest math test yet

New math benchmark reveals AI models confidently solve problems that have no…

AI cracks a mystery from 1939 that mathematicians couldn’t solve for decades;…