Grafana 'No Data' after migration: 7 reconcilers we had to kill first

The first fix lasted 90 seconds. We had corrected the Grafana datasource URL from prometheus:9999 back to prometheus:9090, watched the pod roll, refreshed the dashboard, and seen one panel come alive. By the time we opened a second tab, the ConfigMap was back to 9999. That was the real incident. The 'No Data' dashboards were a symptom of an observability stack that someone, or something, was actively re-corrupting from at least seven places we had not yet found.

Problem signals:

Grafana dashboards show 'No Data' on every panel after a cluster migration, and kubectl edit fixes revert within 1-3 minutes

Prometheus targets page is empty or stuck on a namespace that does not exist anymore

ClusterRoleBindings you just recreated reference a ClusterRole name nobody on the team typed

Problem signals:

Grafana dashboards show 'No Data' on every panel after a cluster migration, and kubectl edit fixes revert within 1-3 minutes

Prometheus targets page is empty or stuck on a namespace that does not exist anymore

ClusterRoleBindings you just recreated reference a ClusterRole name nobody on the team typed

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Other newsrooms on this story

Related reading

One container to replace Grafana + Loki + Tempo + Prometheus

Grafana breach caused by missed token rotation after TanStack attack

The N+1 Query That Killed Our Database, And How I Fixed It

Category: Events

Grafana’s GitHub Token Incident: 5 Steps DevOps Teams Can Take to Recover Faster

𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝘀𝗼𝘀 𝗿𝗲𝗽𝗼𝗿𝘁 𝗶𝗻 𝗜𝗻𝗰𝗶𝗱𝗲𝗻𝘁…

Other newsrooms on this story

Related reading

One container to replace Grafana + Loki + Tempo + Prometheus

Grafana breach caused by missed token rotation after TanStack attack

The N+1 Query That Killed Our Database, And How I Fixed It

Category: Events

Grafana’s GitHub Token Incident: 5 Steps DevOps Teams Can Take to Recover Faster

𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝘀𝗼𝘀 𝗿𝗲𝗽𝗼𝗿𝘁 𝗶𝗻 𝗜𝗻𝗰𝗶𝗱𝗲𝗻𝘁…