I shipped a breaking change to production three weeks ago. CI was green. 142 tests passed. Code review was approved. The PR was clean enough that I barely read it.

A mobile client started 500ing four hours later.

Here's the part that kept me up that night: every single safety net I had was working as designed. That's what scared me. The system didn't fail. The system did exactly what I told it to do, and the thing still broke.

Let me walk you through how, because I think a lot of you are about to hit the same wall, if you haven't already.

What actually happened