New math benchmark reveals AI models confidently solve problems that have no solution

A consortium of 64 mathematicians built SOOHAK, a new AI benchmark with 439 handwritten tasks, including 99 that are deliberately unsolvable. Google's Gemini 3 Pro leads on research-level problems at 30 percent. But no model cracks 50 percent on spotting broken tasks. More compute makes models better at solving. It doesn't improve them at admitting a problem has no answer. SOOHAK tries to pin down the gap between a few flashy results and the broad research skills AI systems still lack.

domenica 17 maggio 2026 New tab

A consortium of 64 mathematicians built a new benchmark for AI models that exposes two weaknesses: research-level math and the ability to recognize unsolvable tasks.

With today's frontier models already hitting IMO Gold level, AI research needs new math benchmarks. SOOHAK, developed at Carnegie Mellon University, EleutherAI, and Seoul National University, among others, consists of 439 original tasks.

They're split into two sections: a "Challenge" set with 340 problems at the graduate and research level, and a "Refusal" set with 99 intentionally flawed problems that contain contradictions or don't allow a clear answer.

Unlike common collections, SOOHAK wasn't pulled from competitions or textbooks. Every problem was written from scratch by a team of 38 professors, 25 PhD students and postdocs, and five IMO medalists. Before submitting, each contributor had to confirm they worked without AI help. Anyone caught sneaking in LLM-generated tasks was kicked out.

The SOOHAK benchmark went through several collection and review stages: submission, automated LLM checks, manual moderation, revisions, and final inclusion in the dataset. | Image: Son et al.

A consortium of 64 mathematicians built a new benchmark for AI models that exposes two weaknesses: research-level math and the ability to recognize unsolvable tasks.

The SOOHAK benchmark went through several collection and review stages: submission, automated LLM checks, manual moderation, revisions, and final inclusion in the dataset. | Image: Son et al.

New math benchmark reveals AI models confidently solve problems that have no solution

New math benchmark reveals AI models confidently solve problems that have no solution

Other newsrooms on this story

Related reading

Humans outperform AI at this highly rigorous mathematics test

Mathematicians grade AI performance on complex problem set at Harvard

AI scores a ‘C–’ on its hardest math test yet

AI just took the world's hardest maths test and humans won

A new era for mathematics: AI excels in grading major problem sets

Mathematical AI helps researchers crack 50-year-old problem

Other newsrooms on this story

Related reading

Humans outperform AI at this highly rigorous mathematics test

Mathematicians grade AI performance on complex problem set at Harvard

AI scores a ‘C–’ on its hardest math test yet

AI just took the world's hardest maths test and humans won

A new era for mathematics: AI excels in grading major problem sets

Mathematical AI helps researchers crack 50-year-old problem