The fastest way to get a team to ignore Prometheus is to point an alert at up == 0 and ship it. Week one it catches a real outage and everyone's impressed. By week three it has fired a couple dozen times for scrape blips that healed themselves before anyone looked, and people have quietly built a mental filter that drops every page from that channel. The alert still fires. Nobody reads it.
That's the failure mode I want to talk about: alerts that are technically correct and operationally useless. An alert you've trained yourself to ignore is worse than no alert, because it gives you the feeling of coverage without the substance.
If you're running kube-prometheus-stack or plain Prometheus and you've reached the point where your alert channel is mostly noise, this is for you. If you've ever gotten a page that said "GPU temperature high" followed by a 36-character UUID and zero indication of which physical card was cooking, you're exactly the audience. The fix isn't fewer alerts. It's alerts that know what they're talking about.
What I reached for first (and why it didn't hold up)
The instinct when you stand up monitoring is to alert on everything you can see. Scrape is failing? Page. CPU over 80%? Page. Memory climbing? Page. Disk filling? Page. It feels responsible. You're "covering your bases."






