The Problem We Were Actually Solving
In 2024 we shipped the treasure-hunt engine for Veltrix at 2,300 concurrent sessions running 180,000 packets per second across 4 AWS AZs, all perfectly fine—until Black Friday weekend. On Friday at 14:01 UTC the multi-tenant orchestrator hit a 429 on every DescribeCacheNodes call to ElastiCache. The Redis cluster itself was humming along at <3 ms P99, but the AWS control-plane simply could not keep up with the discovery loop we had hard-coded: every 5 seconds the orchestrator issued a DescribeCacheNodes against every shard, multiplied by the number of games, multiplied again by the number of players per game. By 14:12 UTC we had 1.2 million DescribeCacheNodes outstanding, each one costing us 328 ms and 4 KB of bandwidth. At that point the Redis control plane started throttling and the latency on LUA script executions jumped from 6 ms to 1.8 seconds. Players started reporting We couldnt find the chest on the map.
What We Tried First (And Why It Fails)
Our first configuration file looked like this:
orchestrator:






