The Problem We Were Actually Solving
Our player base exploded from 400k to 1.2M during Black-Friday week while the Treasure Hunt event gave 100k concurrent players a 30-second window to solve 5 puzzles. Rewards were dynamic: gold coins, exclusive skins, or nothing. The business wanted sub-second latency on /hunt so the UI felt instant, but AWS cost ceiling was $0.06 per player. We chose Redis with a bloom filter because the treasure engine doesnt need consistency for coins—only existence. What we didnt model was the write amplification from bloom regeneration and the session churn ratio of 2.1 per minute during the event. That churn meant TTL ≤ 30 s to keep memory bounded, but then the bloom false-positive rate spiked from 1 % to 12 % because the filter recycled every second. At 12 % false positives we were hitting Aurora with 80k point lookups per second—each query costing 30 ms on a db.t3.2xlarge. SLA burned, PagerDuty pages triggered, and finance sent a Slack alert titled Budget vs Reality: +37 %.
What We Tried First (And Why It Failed)
We started with Redis 7.2 clustered mode, three shards, replication factor 2, and a global session prefix. The bloom filter was a module named RedisBloom v2.4.4 with two parameters: capacity 10M and error rate 0.01. We set memory limit to 12 GiB per shard and assumed TTL at 5 minutes because 5 minutes felt long enough for a hunt session. The first load test with 20k concurrent players showed 150 ms p99 but we didnt simulate session churn. On the third day of internal QA the bloom hit 92 % memory usage in 45 minutes and OOM-killed the cluster. We switched to LFU eviction, then to a tiered setup with a 1 GiB hotset and a 10 GiB overflow. Hotset eviction still nuked the bloom filter and trigger regeneration on every request, cycle repeats. On the live day we saw the false positives spike and Aurora melt. We rolled back to a flat Redis hash (string type, value size 2.0 KiB) and set TTL 30 s. The p99 latency crashed back to 220 ms, CPU on instances dropped to 18 %, and cost per player stayed inside budget. Lesson learned: bloom filters and high-churn sessions are a toxic mix when your SLA is measured in milliseconds.






