Community Evals: Because we're done trusting black-box leaderboards over the community

Back to Articles

Evaluation is broken What We're Shipping Why This Matters Get Started

TL;DR: Benchmark datasets on Hugging Face can now host leaderboards. Models store their own eval scores. Everything links together. The community can submit results via PR. Verified badges prove that the results can be reproduced.

Evaluation is broken

Let's be real about where we are with evals in 2026. MMLU is saturated above 91%. GSM8K hit 94%+. HumanEval is conquered. Yet some models that ace benchmarks still can't reliably browse the web, write production code, or handle multi-step tasks without hallucinating, based on usage reports. There is a clear gap between benchmark scores and real-world performance.

Back to Articles

Evaluation is broken What We're Shipping Why This Matters Get Started

Evaluation is broken

Community Evals: Because we're done trusting black-box leaderboards over the community

Community Evals: Because we're done trusting black-box leaderboards over the community

Related reading

Featuring Every Eval Ever Results on Hugging Face Model Pages

The Open Agent Leaderboard

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Is it agentic enough? Benchmarking open models on your own tooling

FOD#68: AI Benchmarks vs Vibe Checks: Measuring AI Progress

Let's talk about LLM evaluation

Related reading

Featuring Every Eval Ever Results on Hugging Face Model Pages

The Open Agent Leaderboard

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Is it agentic enough? Benchmarking open models on your own tooling

FOD#68: AI Benchmarks vs Vibe Checks: Measuring AI Progress

Let's talk about LLM evaluation