TL;DRAI

A gate blocked 198 attempts, mostly false positives—correct code rejected for form, not substance. Block counts hide gate miscalibration; systems optimize for gate-shape, not improvement. For CI checks: verify rejections map to actual defects, not surface patterns.

Why a blocked count is not a success metric

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

I built a gate to block bad code. It blocked 198 pieces of code, and I took that number as evidence the gate was working well.

Then I opened the blocked cases and read them one by one, checking each against the acceptance criteria for the task it came from. A large share of them weren't bad code. The gate had been wrong often enough that I could no longer read the block count as evidence it was working — it had been firing constantly, exactly as it was designed to fire, and I'd mistaken "it fires a lot" for "it's doing its job." Those are not the same statement.

This is the second post in a short run about something I kept tripping over while building this agent: the things I use to verify my system can themselves be broken, and they tend to break in ways that look like success. The last post was about a test run that lied by passing. This one is about a gate that lied by blocking.

dev.to

The Gate Fired 198 Times. I Called It "Working."

Why a blocked count is not a success metric Part of the ForgeFlow series — building a...

domenica 21 giugno 2026 New tab

TL;DRAI

1,876 words~9 min read

Why a blocked count is not a success metric

I built a gate to block bad code. It blocked 198 pieces of code, and I took that number as evidence the gate was working well.

The Gate Fired 198 Times. I Called It "Working."

The Gate Fired 198 Times. I Called It "Working."

Related reading

The Code Caught the Bad Number. I Had to Catch the Bad Story.

When pytest Said "Passed," It Was Lying

The Human in the Loop Doesn't Scale. I Kept Him Anyway.

When --cap-drop ALL Broke the Gate Socket

Gates Earned From Failure: a cost test for agent guardrails

It ran it works: I audited my own security platform and found a detection…

Related reading

The Code Caught the Bad Number. I Had to Catch the Bad Story.

When pytest Said "Passed," It Was Lying

The Human in the Loop Doesn't Scale. I Kept Him Anyway.

When --cap-drop ALL Broke the Gate Socket

Gates Earned From Failure: a cost test for agent guardrails

It ran it works: I audited my own security platform and found a detection…