The 2am Dashboard That Started Everything

It was 2:47am when I opened our feature flag dashboard and realized I had no idea what had changed. P99 latency on our payments service had spiked to 4.2 seconds about 20 minutes earlier, and the on-call playbook said to check recent flag changes first. We had LaunchDarkly for flag evaluation, Jira for change tickets, and a Notion doc that was supposed to track active experiment flags. The Notion doc hadn't been touched in six weeks. The Slack channel that was nominally our audit log had 340 unread messages from the previous day's deploy sprint.

The actual question I needed to answer — which flags changed in the last 30 minutes across all services — had no answer. Not a slow answer, not an approximate answer. No answer.

That's a knowledge management failure, not an infrastructure failure. We had three systems that each held a partial slice of production state and shared exactly zero causal model between them. LaunchDarkly knew flag evaluation counts. Jira knew someone opened a ticket. Notion knew whatever someone remembered to type. None of them knew that a flag flip in service A at 2:31am might be causally related to the latency spike in service B at 2:33am.