The Open Agent Leaderboard

Back to Articles

Can we measure generality? What we built How to read the leaderboard What we're already learning What's public today What we want from the community What's next Closing Related reading How good are general purpose AI agents? We built an open evaluation framework to find out.

Most evaluations in AI report a simple result: what score each model got on which benchmarking task. When you deploy an agent, you're not just choosing a model. You're choosing a full system: what tools the agent can use, how it plans its steps, what it remembers between actions, how it recovers when something goes wrong. Change any of those and the same model can produce very different results at very different costs.

How well an AI agent works depends on how it's built, not just the model inside it.

Today we're launching the Open Agent Leaderboard, an open benchmark for comparing full agent systems, not just the models inside them. It reports both quality and cost, so you can see not just what works, but what's worth deploying.

Back to Articles

How well an AI agent works depends on how it's built, not just the model inside it.

The Open Agent Leaderboard

The Open Agent Leaderboard

Related reading

Hugging Face – Community Blogs

CUGA on Hugging Face: Democratizing Configurable AI Agents

Community Evals: Because we're done trusting black-box leaderboards over the…

FOD#68: AI Benchmarks vs Vibe Checks: Measuring AI Progress

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…

Related reading

Hugging Face – Community Blogs

CUGA on Hugging Face: Democratizing Configurable AI Agents

Community Evals: Because we're done trusting black-box leaderboards over the…

FOD#68: AI Benchmarks vs Vibe Checks: Measuring AI Progress

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…