A consortium of 64 mathematicians built a new benchmark for AI models that exposes two weaknesses: research-level math and the ability to recognize unsolvable tasks.

With today's frontier models already hitting IMO Gold level, AI research needs new math benchmarks. SOOHAK, developed at Carnegie Mellon University, EleutherAI, and Seoul National University, among others, consists of 439 original tasks.

They're split into two sections: a "Challenge" set with 340 problems at the graduate and research level, and a "Refusal" set with 99 intentionally flawed problems that contain contradictions or don't allow a clear answer.

Unlike common collections, SOOHAK wasn't pulled from competitions or textbooks. Every problem was written from scratch by a team of 38 professors, 25 PhD students and postdocs, and five IMO medalists. Before submitting, each contributor had to confirm they worked without AI help. Anyone caught sneaking in LLM-generated tasks was kicked out.

The SOOHAK benchmark went through several collection and review stages: submission, automated LLM checks, manual moderation, revisions, and final inclusion in the dataset. | Image: Son et al.