A new benchmark from UC Berkeley suggests that AI agent timelines need a serious reality check.
The Agents’ Last Exam, a large-scale evaluation framework built with input from over 250 industry experts across more than 100 institutions, found that mainstream AI agents achieve an average full pass rate of just 2.6% on its hardest tier of real-world professional tasks. The best-performing agent, Codex running on gpt-5-5, managed roughly 26%.
What the benchmark actually tests
The benchmark covers 55 non-physical sub-industries organized into 13 clusters, derived from the O*NET/SOC 2018 taxonomy. So far, the team has cataloged more than 1,500 tasks, with an ambitious goal of reaching 5,000. Each task produces verifiable outcomes, meaning there’s no room for the kind of fluent-sounding-but-wrong outputs that large language models have become famous for.
The paper was submitted to arXiv on June 3, 2026, and the project lives at agents-last-exam.org. It’s designed as a living benchmark that will continue expanding in scope and complexity over time.













