Originally published on kuryzhev.cloud

Why This Checklist

Three days. That's how long a memory leak crept through one of our checkout pods before anyone noticed — because our CPU-at-80% alert never fired. The leak was slow, gradual, and stayed well under every static threshold we had. When we finally caught it, it was a customer complaint that tipped us off, not Prometheus. That incident is the reason we built a Grafana AI anomaly detection pipeline, and it's the reason this checklist exists.

Static thresholds work fine for hard failures — disk full, service down, 500s spiking. They fail at catching slow-burn, multivariate drift: memory creeping up 2% a day, disk I/O latency inching from 4ms to 40ms over a week, a subtle shift in request pattern that only looks wrong when you compare it against the last 30 days of baseline. That's exactly the class of problem anomaly-detection models are good at — they learn "normal" for a given service and flag statistical deviation, not just a fixed number.

But here's the catch nobody tells you upfront: wiring an ML model's output into an alerting system is its own project, separate from building the model. You're not just training Prophet or PyOD on a metric — you're turning a float between 0 and 1 into a reliable page that doesn't wake someone up for nothing. Alert fatigue is real, and a poorly-tuned anomaly pipeline generates more noise than the static rules it replaced. The setup cost is only worth it if you're running a large fleet, dealing with seasonal traffic, or managing multi-tenant systems where "normal" varies by customer.