Every time a coding agent underperforms, the default move is to swap in a bigger model. I wanted to see what happens if you refuse that move and fix everything else instead.

The result: 61.6% ± 1.9 on Terminal-Bench 2.0 with GPT-5.1-Codex-Mini — rank #41, in the same band as stock harnesses running flagship models a tier or two larger. 445 runs, $27, ~35 hours.

This is not an argument that small models are secretly enough. It is an argument that the wrapper around the model is doing more work than most people give it credit for — and that you can see this clearly only when the model is small enough that harness mistakes actually hurt. What follows is a teardown of what survived.

Reading the number

The score is verified on the official leaderboard at rank #41 as of May 14, 2026, across 89 tasks with 5 runs each. Leaderboards move, so I treat the rank as a timestamped snapshot rather than a permanent claim. The useful comparison is the neighborhood around that snapshot: entries immediately around rank #41 run on GPT-5.2, Claude Opus 4.6, and Gemini 3 Pro.