Frontier-Quality Coding at Cheap-Tier Cost: What We Built, and How We Measured It

This is a /dev post for people who read benchmark tables for a living. The thesis is simple: a cascade that serves most requests from a cheap local model, escalating only the hard ones to a frontier model, can hit frontier-quality coding scores at a fraction of the per-request cost. The harder claim, the one we care about, is that the reliability comes from the structure, not the model. Whether that holds over long horizons at scale is exactly what our unrun benchmarks are meant to settle, so we flag it as a goal, not a result. Below is what we measured, how we kept the scoring honest, and where we still have no number at all.

The architecture, in one paragraph

Two channels. A capability channel (the cheap tier: gpt-oss-120b, an open roughly 120B model we run at a fraction of frontier price, doing the actual solving) and a structure channel (verification gates and guards that decide whether an answer is trustworthy or needs escalation). A cache sits in front so exact repeats do not re-solve. When the local model is confident and the guards pass, the request is served cheap. When the guards fail, it escalates to frontier. Most of the interesting behavior, and most of the measurement difficulty, lives in the structure channel.

The architecture, in one paragraph

Frontier-Quality Coding at Cheap-Tier Cost: What We Built, and How We Measured It

Frontier-Quality Coding at Cheap-Tier Cost: What We Built, and How We Measured It

Related reading

We built a coding harness that beats frontier models using open ones. It's in…

AI Weekly: Cheaper Coding Models, Custom Chips, and a Stateless MCP

I spent ten days forcing tiny local models to write real code. Here's what…

Serving cheap when two models agree: a measured cost lever

Free Models, Zero Compromise: Routing to Local and Free Tiers

Qwen3-Coder-Next review 2026: 80B params, 3B active, and the cheapest credible…

Related reading

We built a coding harness that beats frontier models using open ones. It's in…

AI Weekly: Cheaper Coding Models, Custom Chips, and a Stateless MCP

I spent ten days forcing tiny local models to write real code. Here's what…

Serving cheap when two models agree: a measured cost lever

Free Models, Zero Compromise: Routing to Local and Free Tiers

Qwen3-Coder-Next review 2026: 80B params, 3B active, and the cheapest credible…