The Problem We Were Actually Solving
The product goal was simple: every player who walks into a building on the map should see the same treasure list within 300 ms. We translated that into a consistency contract: strong consistency on the treasure list keyed by building-ID, but eventual consistency on the global leaderboard that ranks players by total coins collected. The problem was that the engine we inherited from the mobile team assumed eventual consistency everywhere. Their Redis Cluster v6.2.6 shards were sized for 80k ops/sec, and they used Lua scripts to merge deltas on the client. When the Royale drop pushed 1.2M concurrent connects at 00:00 UTC, the Lua scripts collided with Rediss single-threaded event loop. We saw 47k script-timeouts per minute and a P99 tail latency of 4.2 seconds on the treasure-list endpoints.
What We Tried First (And Why It Failed)
Our first rollout kept the Lua merges but moved the treasure lists to a Go service backed by a single PostgreSQL 14 cluster with pgbouncer 1.17.0 connection pooling. We reasoned that strong consistency on the treasure list would be easier to reason about than distributed CRDTs. The migration script ran at 20:00 UTC the night before the drop. Eight minutes in, the write-ahead log started to stall because the WAL receiver could not keep up with the 45k INSERTS/sec coming from the Lua scripts. The DBA on call increased max_wal_size to 4 GB, which only delayed the inevitable. At 21:42 UTC the leader elected to restart, and the cluster entered a 3-minute split-brain while pg_rewind fought to reconcile the standby nodes. When the service came back, the Lua scripts had already enqueued 1.9 million backlogged treasure events. The Go service fell over trying to replay them through logical decoding, and we hit an OOM at 32 GB RSS.






