I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

TL;DR

Last week I benchmarked 5 open-weight models (Llama 4 Scout, Llama 3.3 70B, Qwen3 32B, GPT-OSS, Gemini 2.5 Flash) and the best scored 62.5%. People asked the obvious follow-up: does the closed-frontier story look better?

Short answer: yes, but with a twist that surprised me.

I ran the same harness against 5 frontier closed models accessed via OpenRouter:

Rank

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

Other newsrooms on this story

Related reading

Claude Opus 4.8 vs Gemini 3.5 Pro vs GPT-5.6: Developer Model Selection Guide…

Model Showdown Round 9: Qwen 3.6 27B vs Qwen 3.6 35B-A3B vs Qwythos-9B vs…

Gemini 3.5 Flash vs Claude Haiku vs GPT-4o mini: Picking a Small Model

I let GPT-4o and a cheaper model fight over my inbox. GPT-4o lost.

Claude Opus 4.8 is out. The benchmark isn't why I'm switching.

GPT-5.6 Sol vs. Claude Fable: Why OpenAI’s new model crushes my benchmark