When pytest Said "Passed," It Was Lying

How a polluted virtual environment made my green tests meaningless

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

For a few days, I made decisions on top of a number that wasn't true.

The number was 186 passed. It came out of pytest, green, at the bottom of the terminal, the way it had dozens of times before. I trusted it the way you trust a number that has never been wrong before. Then I found out the run had been measured inside the wrong environment, and the green had very little to do with the code I thought I was checking.

To be fair to the tool: pytest wasn't wrong. It answered exactly the question I handed it — it just wasn't the question I meant to ask. This post is about that gap. Not a bug in a test, but a bug in how I measured the tests. It turned out to be one of the more uncomfortable lessons in the project so far, because it sat underneath everything else. If the floor is tilted, every measurement you take on top of it inherits the tilt, and you don't see it, because the floor looks like the floor.

How a polluted virtual environment made my green tests meaningless

For a few days, I made decisions on top of a number that wasn't true.

When pytest Said "Passed," It Was Lying

When pytest Said "Passed," It Was Lying

Related reading

The Human in the Loop Doesn't Scale. I Kept Him Anyway.

What I Learned by Deleting Rules My Agent Had Already Learned

The Gate Fired 198 Times. I Called It "Working."

My test suite was green. My software was lying to me.

How I Prevented Claude Code from Breaking My Architecture with 18 Tests That…

A green test suite proves less than you think

Related reading

The Human in the Loop Doesn't Scale. I Kept Him Anyway.

What I Learned by Deleting Rules My Agent Had Already Learned

The Gate Fired 198 Times. I Called It "Working."

My test suite was green. My software was lying to me.

How I Prevented Claude Code from Breaking My Architecture with 18 Tests That…

A green test suite proves less than you think