How Far Can a Small Coding Model Go With a Better Harness?

Every time a coding agent underperforms, the default move is to swap in a bigger model. I wanted to see what happens if you refuse that move and fix everything else instead.

The result: 61.6% ± 1.9 on Terminal-Bench 2.0 with GPT-5.1-Codex-Mini — rank #41, in the same band as stock harnesses running flagship models a tier or two larger. 445 runs, $27, ~35 hours.

This is not an argument that small models are secretly enough. It is an argument that the wrapper around the model is doing more work than most people give it credit for — and that you can see this clearly only when the model is small enough that harness mistakes actually hurt. What follows is a teardown of what survived.

Reading the number

The score is verified on the official leaderboard at rank #41 as of May 14, 2026, across 89 tasks with 5 runs each. Leaderboards move, so I treat the rank as a timestamped snapshot rather than a permanent claim. The useful comparison is the neighborhood around that snapshot: entries immediately around rank #41 run on GPT-5.2, Claude Opus 4.6, and Gemini 3 Pro.

Every time a coding agent underperforms, the default move is to swap in a bigger model. I wanted to see what happens if you refuse that move and fix everything else instead.

The result: 61.6% ± 1.9 on Terminal-Bench 2.0 with GPT-5.1-Codex-Mini — rank #41, in the same band as stock harnesses running flagship models a tier or two larger. 445 runs, $27, ~35 hours.

Reading the number

How Far Can a Small Coding Model Go With a Better Harness?

How Far Can a Small Coding Model Go With a Better Harness?

Other newsrooms on this story

Related reading

Stop Upgrading the Model. Start Engineering the Harness.

Agent Harness Design Beats Model Tweaks

Harness Engineering: The Code Around the Model Is the Hard Part

The Same AI Model Can Perform 6x Better: Here's Why

Reward hacking is swamping model intelligence gains · Cursor

HarnessX rewrites AI scaffolding mid-task | VentureBeat

Related reading

Stop Upgrading the Model. Start Engineering the Harness.

Agent Harness Design Beats Model Tweaks

Harness Engineering: The Code Around the Model Is the Hard Part

The Same AI Model Can Perform 6x Better: Here's Why

Reward hacking is swamping model intelligence gains · Cursor

HarnessX rewrites AI scaffolding mid-task | VentureBeat

Other newsrooms on this story