The Problem We Were Actually Solving

We werent just chasing p99 latency; we were solving a fundamental mismatch between the event model and the treasure hunt logic. Each treasure hunt round emits thousands of micro-events: player joins, item picks, time updates, leaderboard recalculations, and realtime notifications. The Node.js event loop was choking under the backpressure. The BullMQ worker was blocked on Redis pubsub, not because of network latency, but because Node.jss single-threaded event loop couldnt keep up with the rate of incoming events. The Redis server itself was fine—CPU at 12%, memory at 68%, no evictions. The bottleneck wasnt the queue or the data store. It was the runtime.

I added a debug trace using 0x and saw 78% of CPU time was spent in uv__io_poll, the epoll/select wrapper. The Node.js process was spending more time waiting for events than processing them. And because BullMQ uses Redis streams, every publish and consume was a network roundtrip. The 250 microsecond RTT from us-east-1 to the Redis cluster was adding up when we were publishing 47,000 events per second. The p99 latency followed the square root of the number of concurrent players. At 5,000 players, it was 80ms. At 10,000 players, 2.3 seconds. The system wasnt scaling linearly. It was falling off a cliff.