The Problem We Were Actually Solving

We needed to survive the Black Friday of game launches without throwing hardware at the problem. The treasure hunt engine is peak write-amplification: every coin placement and every search query mutates state and then broadcasts an event to hundreds of listening clients. Our first architecture carved the grid into shards, each backed by a dedicated PostgreSQL 14 instance with pg_bouncer in transaction pooling mode. It looked clean on paper: 64 shards, 8 read replicas per shard, Prometheus scraping every pg_stat_bgwriter.metrics at 5 s intervals.

The inflection came when a single viral TikTok clip drove 15× normal traffic. At 12 000 concurrent sessions we started seeing:

pg_bouncer logs: failed to get connection: timeout after 5000 ms

Prometheus counter pgsql_connections_max_overflow spiked to 42 on shard 23, exactly the one hosting the TikTok hotspot.