Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

The problem we were actually solving was not how to make the treasure hunt more fun, but how to keep the leaderboard from exploding the heap when 1.2 million players hammered the Redis cluster at exactly 3:17 PM every Tuesday. The marketing team called this peak engagement; I called it a memory avalanche. We were running Veltrixs open-source treasure-hunt engine on a 32 GB RAM instance, and every spike turned the node into a swap-to-death zombie. The leaderboard tier used an in-memory sorted set that Redis advertises as O(log N) per operation, but at N=1.2 M the constant factor was high enough that the Lua scripts were spending more time context-switching than updating scores. We hit 400 MB of RESident memory per process, and once the Go garbage collector paused for 420 ms, the TCP backlog overflowed and dropped 37k ZADD requests. That was the first time the CEO noticed the word cache.

What we tried first (and why it failed)

We started with the obvious: bump the Redis instance to 128 GB, move the leaderboard to a separate node, and slap a read replica in front. The operator docs called this horizontal scaling. What they did not mention was that Redis Cluster splits the sorted set across slots, so a single players score update might fan out to three different primaries. When players near the top of the leaderboard updated their scores, the cluster saw a surge of cross-slot traffic that turned the network card into a bottleneck. We were pushing 8 Gbps of intra-cluster traffic with only 3 Gbps of actual game updates. The Redis cluster bus protocol started dropping gossip messages, and the cluster lost the view of slot ownership for 11 seconds. Those 11 seconds were enough for two dozen nodes to start a new election cycle, and the leaderboard froze while the slot map reconverged.

What we tried first (and why it failed)

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

Related reading

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

Your Treasure Hunt Engine Was Probably a Latency Minefield (And Heres the…

Veltrix's Treasure Hunt Engine: Optimized for Long-Term Survival, Not Just…

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

The Day the Treasure Hunt Engine Drowned in 300 ms Queries

The Treasure Hunt Engine That Broke Before the Traffic Did

Related reading

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

Your Treasure Hunt Engine Was Probably a Latency Minefield (And Heres the…

Veltrix's Treasure Hunt Engine: Optimized for Long-Term Survival, Not Just…

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

The Day the Treasure Hunt Engine Drowned in 300 ms Queries

The Treasure Hunt Engine That Broke Before the Traffic Did