FrontierMath benchmark undergoes major audit as Epoch AI flags errors in one-third of math problems

Epoch AI’s FrontierMath benchmark, a 350-problem test designed to push AI systems to their mathematical limits, is undergoing a significant correction after an internal review flagged errors in roughly one-third of its dataset. The audit, disclosed on May 11, 2026, revealed that the problems designed to stump the world’s most advanced AI models had a quality control issue of their own.

The organization plans to release updated scores once a thorough human review is completed.

What FrontierMath actually is, and why it matters

FrontierMath launched in November 2024 and was developed in collaboration with more than 60 mathematicians. The full dataset includes 300 problems across Tiers 1 through 3, spanning undergraduate to advanced graduate difficulty. Tier 4 adds another 50 problems at the research level, the kind of questions where even professional mathematicians might need multiple hours or days to solve.

Earlier reviews of the dataset had suggested error rates in the range of 7% to 10%, based on limited secondary checks. The AI-assisted review that Epoch AI conducted painted a much less flattering picture, bumping that estimate to approximately 33% of problems containing what the organization described as fatal errors.

The organization plans to release updated scores once a thorough human review is completed.

What FrontierMath actually is, and why it matters

FrontierMath benchmark undergoes major audit as Epoch AI flags errors in one-third of math problems

FrontierMath benchmark undergoes major audit as Epoch AI flags errors in one-third of math problems

Other newsrooms on this story

Related reading

Cognition introduces FrontierCode benchmark that exposes AI coding agents'…

New math benchmark reveals AI models confidently solve problems that have no…

Fantastic Bugs and Where to Find Them in AI Benchmarks

Humans outperform AI at this highly rigorous mathematics test

【Deep Dive】Frontier Code: The Benchmark That Asks "Would a Maintainer Merge…

AI scores a ‘C–’ on its hardest math test yet

Related reading

Cognition introduces FrontierCode benchmark that exposes AI coding agents'…

New math benchmark reveals AI models confidently solve problems that have no…

Fantastic Bugs and Where to Find Them in AI Benchmarks

Humans outperform AI at this highly rigorous mathematics test

【Deep Dive】Frontier Code: The Benchmark That Asks "Would a Maintainer Merge…

AI scores a ‘C–’ on its hardest math test yet

Other newsrooms on this story