The Problem We Were Actually Solving

We werent building a generic scale story; we were protecting a money-printing loop. The treasure-hunt engine awarded cash prizes every hour, and each award ran a small blockchain simulator to determine rarity. That simulator used about 4 MB of in-memory state per player. When the Rift hit, we had 85 k concurrent players and 290 GB of heap demanded by the single Node process. Our vertical-scrape plan—going from 12 cores / 64 GB to 32 cores / 256 GB—would have cost an extra $6 k per event and still risked another heap OOM on the next traffic surge because Nodes single-threaded GC cannot compact memory while the event loop is saturated. The real problem was not CPU or memory on a bigger box; it was the single-process model itself.

What We Tried First (And Why It Failed)

First we split users randomly across five Node processes behind HAProxy. That lowered max heap per process to ~1.1 GB, but we immediately hit a different wall: the in-memory simulation state was not serializable. We tried Redis to store the 4 MB blob per user, but the SET operations took 15–28 ms on a cloud Redis 7.0 cluster with 5 ms p99 latency. At 85 k players, that became 1.3 million round trips per second, and we saturated the 1 Gbps link between the Node pool and Redis. The error surfaced as 38 % of write operations timing out with: