We added Claude Opus 4.8 to our ongoing model benchmark. It scored 95% with skill context, which puts it 1.6 points above Opus 4.7 and 2.3 points above Cursor's Composer 2.5 Fast. It is also, by a meaningful margin, the slowest model we have tested.
TL;DR
Opus 4.8 scores 95% with skill context, taking the top spot from Opus 4.7.
Its 81% baseline is the highest ever recorded in this benchmark, higher than every other model and remains top even when models run evals with skills loaded.
All three independent judges agreed within two points, the tightest spread we have seen across nine models. Previous high-variance models swung over seven points between judges.









