TL;DR

Last week I benchmarked 5 open-weight models (Llama 4 Scout, Llama 3.3 70B, Qwen3 32B, GPT-OSS, Gemini 2.5 Flash) and the best scored 62.5%. People asked the obvious follow-up: does the closed-frontier story look better?

Short answer: yes, but with a twist that surprised me.

I ran the same harness against 5 frontier closed models accessed via OpenRouter:

Rank