How a polluted virtual environment made my green tests meaningless

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

For a few days, I made decisions on top of a number that wasn't true.

The number was 186 passed. It came out of pytest, green, at the bottom of the terminal, the way it had dozens of times before. I trusted it the way you trust a number that has never been wrong before. Then I found out the run had been measured inside the wrong environment, and the green had very little to do with the code I thought I was checking.

To be fair to the tool: pytest wasn't wrong. It answered exactly the question I handed it — it just wasn't the question I meant to ask. This post is about that gap. Not a bug in a test, but a bug in how I measured the tests. It turned out to be one of the more uncomfortable lessons in the project so far, because it sat underneath everything else. If the floor is tilted, every measurement you take on top of it inherits the tilt, and you don't see it, because the floor looks like the floor.