Why a blocked count is not a success metric
Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.
I built a gate to block bad code. It blocked 198 pieces of code, and I took that number as evidence the gate was working well.
Then I opened the blocked cases and read them one by one, checking each against the acceptance criteria for the task it came from. A large share of them weren't bad code. The gate had been wrong often enough that I could no longer read the block count as evidence it was working — it had been firing constantly, exactly as it was designed to fire, and I'd mistaken "it fires a lot" for "it's doing its job." Those are not the same statement.
This is the second post in a short run about something I kept tripping over while building this agent: the things I use to verify my system can themselves be broken, and they tend to break in ways that look like success. The last post was about a test run that lied by passing. This one is about a gate that lied by blocking.






