Epoch AI’s FrontierMath benchmark, a 350-problem test designed to push AI systems to their mathematical limits, is undergoing a significant correction after an internal review flagged errors in roughly one-third of its dataset. The audit, disclosed on May 11, 2026, revealed that the problems designed to stump the world’s most advanced AI models had a quality control issue of their own.
The organization plans to release updated scores once a thorough human review is completed.
What FrontierMath actually is, and why it matters
FrontierMath launched in November 2024 and was developed in collaboration with more than 60 mathematicians. The full dataset includes 300 problems across Tiers 1 through 3, spanning undergraduate to advanced graduate difficulty. Tier 4 adds another 50 problems at the research level, the kind of questions where even professional mathematicians might need multiple hours or days to solve.
Earlier reviews of the dataset had suggested error rates in the range of 7% to 10%, based on limited secondary checks. The AI-assisted review that Epoch AI conducted painted a much less flattering picture, bumping that estimate to approximately 33% of problems containing what the organization described as fatal errors.









