Agents' Last Exam reveals AI agents struggle with real work tasks, passing just 2.6% of the time

A new benchmark from UC Berkeley suggests that AI agent timelines need a serious reality check.

The Agents’ Last Exam, a large-scale evaluation framework built with input from over 250 industry experts across more than 100 institutions, found that mainstream AI agents achieve an average full pass rate of just 2.6% on its hardest tier of real-world professional tasks. The best-performing agent, Codex running on gpt-5-5, managed roughly 26%.

What the benchmark actually tests

The benchmark covers 55 non-physical sub-industries organized into 13 clusters, derived from the O*NET/SOC 2018 taxonomy. So far, the team has cataloged more than 1,500 tasks, with an ambitious goal of reaching 5,000. Each task produces verifiable outcomes, meaning there’s no room for the kind of fluent-sounding-but-wrong outputs that large language models have become famous for.

The paper was submitted to arXiv on June 3, 2026, and the project lives at agents-last-exam.org. It’s designed as a living benchmark that will continue expanding in scope and complexity over time.

Agents' Last Exam reveals AI agents struggle with real work tasks, passing just 2.6% of the time

Other newsrooms on this story

Related reading

AI agents scored 0% on expert tasks. The hype machine doesn't care.

New benchmark exposes how badly AI struggles with real knowledge work

Meta’s new AI research chief says agents are next big real-world milestone

Artificial Analysis launches EnterpriseOps-Gym-AA to benchmark AI agents in…

AI Agents in Production: Why 88% of Enterprise Pilots Fail (2026)

Button-pushing explorers: How to grasp that AI agents can do amazing things…