Cross-posted from the LLMKube blog.
A local 27B coding model, running on hardware in my house, is a coin flip. Some runs it nails the fix in twenty minutes. Some runs it edits the wrong file, writes a test that passes no matter what the code does, and tells you it's done. The bet behind LLMKube's Foreman was never that I would find a local model good enough to trust. It was that I could build a harness I trust more than any single model's output. This weekend tested that bet harder than any benchmark could, because the harness spent the weekend building its own guardrails.
Here is the short version of what happened across 0.8.12 and 0.8.13. My local coder built three new gates for itself. One of them shipped with the exact flaw it was written to catch, and the review caught it. Three new contributors sent four clean pull requests while the machines worked. The same model ran on an AMD box and an Apple Silicon Mac, and the Mac quietly won a round nobody expected. And not one byte of any of it touched a cloud API.
The thesis, stated plainly
Trust the harness, not the model. A coding agent on a local model produces output of wildly variable quality, and no amount of prompt tuning makes a 27B as reliable as a frontier model. So Foreman does not ask the model to be reliable. It wraps the model in a pipeline that is: the coder works in a cloned workspace, a fast in-workspace gate runs gofmt, vet, build, lint, and the unit tests for the packages it touched; a reviewer reads the diff against the issue; and a clean-room Kubernetes Job re-runs the full suite before anything is allowed to call itself a GO. Around all of that sit deterministic rails: scope checks, edit-free-streak detection, repo-map context. The model is a stochastic component inside a system whose job is to make the system's verdict trustworthy even when the component is not.






