I had a thought a few weeks ago that wouldn't leave me alone - I depend on Claude Code every day. If it disappeared tomorrow, priced out, rate limited, whatever, I'd want a fallback I actually own. Not a cheaper subscription. Something that runs on my own hardware, forever, at zero marginal cost.

So I started an experiment. A coding harness where every reasoning call goes to a tiny local model (Gemma 4 2B, served by llama.cpp on a Jetson Orin Nano), and the harness does everything it can to make up the difference. One hard rule - no cloud fallback, ever. If the small model can't do something, decompose the work or move it into deterministic code. Never escalate to a bigger model.

The bet isn't mine alone. Projects like little-coder and NVIDIA's small-model research make the same wager - small models underperform agentic work because their harnesses are thin, not because the models are incapable. I wanted to find out exactly how true that is, with numbers I could trust. Ten days in, here's what I've learned.

The harness was throwing away right answers

My biggest early win wasn't making the model smarter. It was noticing that about 60% of my failures were the model producing correct logic with broken indentation. The module wouldn't even import, so it scored as a fail. The right answer was sitting there and my harness was discarding it over whitespace.