Originally published on kuryzhev.cloud

We replaced 200 static Prometheus threshold alerts with an AI anomaly detection model — and spent the first month making things measurably worse before we figured out why. The model fired constantly, woke people up at 3am for non-issues, then went completely silent during a real incident. This is the honest account of what went wrong and what the working architecture actually looks like now.

Context — Why We Tried AI Anomaly Detection in the First Place

Our alerting stack before this experiment was a graveyard of static thresholds. CPU above 80%? Alert. Memory above 75%? Alert. P95 latency above 500ms? Alert. Every one of those numbers was picked by a human, at a point in time, for a service that has since changed completely. The result was a Prometheus setup with roughly 200 alert rules that the on-call rotation had learned to mostly ignore. Alert fatigue was real and documented — we had a Slack channel called #alerts-noise that received more traffic than #incidents.

The incident that finally pushed us to act was a gradual memory leak in a Go microservice. The leak was slow — about 12MB per hour. It stayed under every static threshold for six full hours. No alert fired. Then the service hit the container memory limit, got OOM-killed, and the cascade took down three downstream services before we caught it. The postmortem was uncomfortable. We had all the metrics. Prometheus had scraped every data point. We just had no rule that would have caught a slow, sustained drift rather than a sharp spike.