This is a /dev post for people who read benchmark tables for a living. The thesis is simple: a cascade that serves most requests from a cheap local model, escalating only the hard ones to a frontier model, can hit frontier-quality coding scores at a fraction of the per-request cost. The harder claim, the one we care about, is that the reliability comes from the structure, not the model. Whether that holds over long horizons at scale is exactly what our unrun benchmarks are meant to settle, so we flag it as a goal, not a result. Below is what we measured, how we kept the scoring honest, and where we still have no number at all.

The architecture, in one paragraph

Two channels. A capability channel (the cheap tier: gpt-oss-120b, an open roughly 120B model we run at a fraction of frontier price, doing the actual solving) and a structure channel (verification gates and guards that decide whether an answer is trustworthy or needs escalation). A cache sits in front so exact repeats do not re-solve. When the local model is confident and the guards pass, the request is served cheap. When the guards fail, it escalates to frontier. Most of the interesting behavior, and most of the measurement difficulty, lives in the structure channel.