Anthropic's latest flagship model, Claude Opus 4.8, leads most benchmarks and is designed to be more upfront about its own mistakes.
Anthropic says Opus 4.8 beats both its predecessor and OpenAI's GPT-5.5 and Google's Gemini 3.1 Pro across most tested categories. On agentic coding (SWE-Bench Pro), the model hits 69.2 percent, up from 64.3 percent for Opus 4.7 and 58.6 percent for GPT-5.5. For multidisciplinary reasoning (Humanity's Last Exam), Opus 4.8 scores 49.8 percent without tools and 57.9 percent with tools, the highest marks in the field.
Opus 4.8 stacked against Opus 4.7, GPT-5.5, and Gemini 3.1 Pro. | Image: Anthropic
Less fake progress, more honesty
Anthropic calls the model's improved honesty one of its most noticeable upgrades. AI models have a habit of jumping to conclusions and claiming progress that falls apart on closer look. It's a widespread problem.












