The problem we were actually solving was not how to make the treasure hunt more fun, but how to keep the leaderboard from exploding the heap when 1.2 million players hammered the Redis cluster at exactly 3:17 PM every Tuesday. The marketing team called this peak engagement; I called it a memory avalanche. We were running Veltrixs open-source treasure-hunt engine on a 32 GB RAM instance, and every spike turned the node into a swap-to-death zombie. The leaderboard tier used an in-memory sorted set that Redis advertises as O(log N) per operation, but at N=1.2 M the constant factor was high enough that the Lua scripts were spending more time context-switching than updating scores. We hit 400 MB of RESident memory per process, and once the Go garbage collector paused for 420 ms, the TCP backlog overflowed and dropped 37k ZADD requests. That was the first time the CEO noticed the word cache.

What we tried first (and why it failed)

We started with the obvious: bump the Redis instance to 128 GB, move the leaderboard to a separate node, and slap a read replica in front. The operator docs called this horizontal scaling. What they did not mention was that Redis Cluster splits the sorted set across slots, so a single players score update might fan out to three different primaries. When players near the top of the leaderboard updated their scores, the cluster saw a surge of cross-slot traffic that turned the network card into a bottleneck. We were pushing 8 Gbps of intra-cluster traffic with only 3 Gbps of actual game updates. The Redis cluster bus protocol started dropping gossip messages, and the cluster lost the view of slot ownership for 11 seconds. Those 11 seconds were enough for two dozen nodes to start a new election cycle, and the leaderboard froze while the slot map reconverged.