Opus 4.8 tops the LLM leaderboard with 95% on skill evals

We added Claude Opus 4.8 to our ongoing model benchmark. It scored 95% with skill context, which puts...

mercoledì 10 giugno 2026 New tab

1,193 words~5 min read

We added Claude Opus 4.8 to our ongoing model benchmark. It scored 95% with skill context, which puts it 1.6 points above Opus 4.7 and 2.3 points above Cursor's Composer 2.5 Fast. It is also, by a meaningful margin, the slowest model we have tested.

TL;DR

Opus 4.8 scores 95% with skill context, taking the top spot from Opus 4.7.

Its 81% baseline is the highest ever recorded in this benchmark, higher than every other model and remains top even when models run evals with skills loaded.

All three independent judges agreed within two points, the tightest spread we have seen across nine models. Previous high-variance models swung over seven points between judges.

Opus 4.8 tops the LLM leaderboard with 95% on skill evals

Opus 4.8 tops the LLM leaderboard with 95% on skill evals

Other newsrooms on this story

Related reading

Opus 4.8 scored 81 in my benchmark. I still wouldn't default to it. (The full…

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…

Cursor's Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of…

🔥Claude Opus 4.6 vs. Sonnet 4.6 Coding Comparison ✅

We ran Composer 2.5 and 2.5 Fast across 11 skills. Surprisingly, Fast won.

Claude Fable 5 Scores 95% on SWE-bench, Then Hands Off to Opus 4.8

Related reading

Opus 4.8 scored 81 in my benchmark. I still wouldn't default to it. (The full…

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…

Cursor's Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of…

🔥Claude Opus 4.6 vs. Sonnet 4.6 Coding Comparison ✅

We ran Composer 2.5 and 2.5 Fast across 11 skills. Surprisingly, Fast won.

Claude Fable 5 Scores 95% on SWE-bench, Then Hands Off to Opus 4.8

Other newsrooms on this story