The general shape of the problem is that every public LLM benchmark is on a saturation clock that runs from the moment of its publication to the moment a model's training corpus has eaten it. The clock has been running, on the visible benchmarks of the last five years, for somewhere between twelve and thirty months before each one is no longer useful for differentiating frontier models. The benchmarks are not failing. They are doing exactly what they were designed to do, in the order they were designed to do it, and the field has been running through them faster than the people designing them anticipated.

I want to put numbers on the saturation pattern, walk through what the contamination evidence actually says, and then sit with the question of what an honest benchmark would have to look like in 2026 — because the "private held-out eval" answer that the labs are converging on has economics that are worth examining carefully before any of us salute it as the solution.

The saturation timeline, with numbers

HumanEval (Chen et al., OpenAI, July 2021). 164 hand-written Python problems. The benchmark was published with Codex at 28.8% pass@1; the underlying GPT-3 base model scored 0%. GPT-4 (March 2023) hit 67% in the original Technical Report. By late 2024, OpenAI's o1-preview and o1-mini both reached 96.3% pass@1; Claude 3.5 Sonnet sat at 93.7%. The benchmark is saturated in the operational sense — the relative spread across the top ten models is around 10 percentage points, which is too small a gap to differentiate them on, and most of the new models arrive within a percentage point or two of the ceiling. The successor variants (HumanEval+ from EvalPlus, with augmented test cases) are the field's response. Lifespan from publication to operational saturation: about 36 months.