Anthropic's latest flagship model, Claude Opus 4.8, leads most benchmarks and is designed to be more upfront about its own mistakes.

Anthropic says Opus 4.8 beats both its predecessor and OpenAI's GPT-5.5 and Google's Gemini 3.1 Pro across most tested categories. On agentic coding (SWE-Bench Pro), the model hits 69.2 percent, up from 64.3 percent for Opus 4.7 and 58.6 percent for GPT-5.5. For multidisciplinary reasoning (Humanity's Last Exam), Opus 4.8 scores 49.8 percent without tools and 57.9 percent with tools, the highest marks in the field.

Opus 4.8 stacked against Opus 4.7, GPT-5.5, and Gemini 3.1 Pro. | Image: Anthropic

Less fake progress, more honesty

Anthropic calls the model's improved honesty one of its most noticeable upgrades. AI models have a habit of jumping to conclusions and claiming progress that falls apart on closer look. It's a widespread problem.