Why API Breaking Changes Still Reach Production Even With CI/CD
A few years ago I watched a "tiny" API change take down checkout for about forty minutes. The change was a one-liner. The pull request had two approvals. CI was green across the board. And it still broke production, because the thing that actually mattered was never tested.
If you run microservices at any real scale, you have lived some version of this. Let's talk about why it keeps happening even with a mature pipeline, and what the teams who don't keep getting paged do differently.
The Problem
Here's the change that caused the outage. A payments service had a response that looked like this:






