Featuring Every Eval Ever Results on Hugging Face Model Pages

Back to Articles

How Hugging Face Community Evals works together with EvalEval How it works Start here Every Eval Ever (EEE) and Hugging Face Community Evals are now intercompatible. We enable cross-posting and interpreting evaluation results, while linking to open models, leaderboards, and a unified standardized metadata store.

EEE launched in February 2026 as a project of the EvalEval Coalition, the first cross-institutional effort to improve how AI evaluation results get reported by both first and third party evaluators. Hugging Face launched Community Evals in February 2026 to decentralize how benchmark scores get reported on the Hub. Combined, they patch gaps in how users, researchers, and policymakers trust, understand, and choose evaluations and models.

Evaluation results are how we measure model capabilities, compare models against each other, and reason about safety and governance, and yet they are scattered and hard to compare. They live in papers, leaderboards, blog posts, and harness logs, among others, each in its own format. The same model on the same benchmark often returns different scores depending on who ran it and how; LLaMA 65B, for one, has been reported at both 63.7 and 48.8 on MMLU. These gaps can arise from evaluation settings that we found are commonly unreported.

Featuring Every Eval Ever Results on Hugging Face Model Pages

Related reading

Community Evals: Because we're done trusting black-box leaderboards over the…

olmo-eval: An evaluation workbench for the model development loop

The Open Agent Leaderboard

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Arcee Becomes the First Major American AI Lab to Replace AWS S3 with Hugging…

Together Evaluations now supports comparing top commercial APIs vs. open source…