The Problem We Were Actually Solving
The treasure hunt engine used a single Redis sorted set key per map instance: hytale:treasure:global:top. With 200 concurrent maps and 40 k concurrent players, each map push'opération' (ZADD hytale:treasure:global:top ) triggered an implicit DEL when the key grew past Rediss active-expire threshold. The eviction logs showed 1.2 M key deletions per minute, which translated to 15 k QPS on the DEL path alone. That churn saturated the Redis clusters CPU on the replica threads, lifted p99 from 18 ms to 412 ms, and caused upstream game servers to backpressure their event queues. Load-shedding kicked in at 18 k QPS on the DELETES, which meant 20 % of treasure completions were dropped. The logs didnt say why; they just printed Too many active connections.
What We Tried First (And Why It Failed)
Paging the Redis cluster from 3 to 6 shards was the first move. We used redis-shard 2.4.1. The shard count doubled, but the hot key was still on shard 0. We hit the per-shard connection limit inside 45 minutes and the cluster entered a loop of fencing and resharding. Metrics: p99 latency 311 ms, evictions 1.4 M per minute, and connection count north of 28 k per shard. The game team added local LRU caches in the lobby microservice, but the cache invalidation used a pub-sub channel that Redis itself couldnt deliver under load—so stale chests propagated for up to 90 seconds. The player reports read: My treasure vanished. My rank changed while I blinked.






