A green test suite proves less than you think

The test that scared me was the one that passed.

I had an integration test for a routing agent, the kind that takes a task and picks a capability to handle it. The test registered a new capability at runtime and then checked that the router would eventually route to it. Green run after run. Solid. I trusted it.

Then I read it properly. It reused the same task string on every iteration of the loop. My scorer was deterministic by design, it hashed the task and indexed into the capability list, so a fixed string mapped to a fixed slot, and the newly registered capability lived at a different slot that the fixed string could never reach. The test asserted that the new capability got selected. The new capability was structurally unreachable. And the assertion passed anyway, because the loop happened to land on something registered every time, which was all the weak version of the check actually demanded.

The test was not testing what it said it was testing. It was green for a reason that had nothing to do with the thing I cared about. The fix was almost insulting in its smallness, vary the task strings so the hash spreads across every slot including the new one, and suddenly the test could fail when the feature was broken, which is the entire point of a test. One line. I had been shipping false confidence behind a checkmark.