There is a specific kind of incident that no alert ever fires for, and it is the one I trust least. Nothing crashed. No exception, no 500, no failed health check. The agent ran every day, returned answers every time, and stayed green on every dashboard you own. And yet, over six weeks, it got measurably worse — and you found out from a customer, not a monitor.

That is drift, and it is the failure mode I think the industry is least prepared for. We have gotten good at catching the cliff: the agent throws, the tool 500s, the JSON won't parse, CI goes red. We are still terrible at catching the slope: answer quality bleeding out two percent a week while every system reports perfect health. Crashes are loud and self-announcing. Drift is silent by construction, and that silence is exactly why it wins.

Here is the opinion I will defend: drift is not an outlier problem, it's a baseline problem. You cannot detect decay by looking at any single run, because a single run looks completely fine. Drift only exists as a change in a distribution over time — so if you are not continuously scoring production and trending the score, you are structurally incapable of seeing it. Not unlucky. Incapable.

Why your code didn't change but your behavior did