When failover isn’t safe: Building high-availability PostgreSQL on Kubernetes

Gamedays are one of the most effective ways we proactively uncover gaps in our systems and processes. At Datadog, we regularly run a variety of gamedays to intentionally stress our platforms and learn how our systems and teams respond under real-world conditions. These exercises help us surface hidden vulnerabilities, strengthen our operational readiness, and continually raise the bar for our infrastructure.

During one such gameday, a simulated zonal failure introduced targeted disruptions in an availability zone on a staging environment by inducing network latency, which exposed a weakness in our PostgreSQL architecture. Several of our Kubernetes-based PostgreSQL clusters had primary or writer nodes running in the affected availability zone. As network latency spiked, those primaries could no longer communicate reliably with their replicas. Replication lag quickly grew, writes stalled, and applications began serving stale data. Because no replica was sufficiently up to date, failover wasn’t safe and the clusters were effectively stuck.

We rely on PostgreSQL as the backend database for many Datadog products, and this architecture has served us well under normal conditions. But the gameday revealed an uncomfortable truth: In the face of certain network failures, our setup prioritized availability over durability in ways that left us with no safe recovery path.

When failover isn’t safe: Building high-availability PostgreSQL on Kubernetes | Datadog

Other newsrooms on this story

Related reading

Maintain observability during cloud outages with Datadog Disaster Recovery |…

PostgreSQL HA Risks, Replication Internals, & Rapid Branching

PostgreSQL on Kubernetes — Complete Setup Guide with CloudNativePG

Getting Started with pg_durable: Durable Workflows Inside PostgreSQL

Comparing Replication and Failover in PostgreSQL and MongoDB

Deploy Datadog Kubernetes Autoscaling at scale | Datadog