A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

The Problem We Were Actually Solving

Our player base exploded from 400k to 1.2M during Black-Friday week while the Treasure Hunt event gave 100k concurrent players a 30-second window to solve 5 puzzles. Rewards were dynamic: gold coins, exclusive skins, or nothing. The business wanted sub-second latency on /hunt so the UI felt instant, but AWS cost ceiling was $0.06 per player. We chose Redis with a bloom filter because the treasure engine doesnt need consistency for coins—only existence. What we didnt model was the write amplification from bloom regeneration and the session churn ratio of 2.1 per minute during the event. That churn meant TTL ≤ 30 s to keep memory bounded, but then the bloom false-positive rate spiked from 1 % to 12 % because the filter recycled every second. At 12 % false positives we were hitting Aurora with 80k point lookups per second—each query costing 30 ms on a db.t3.2xlarge. SLA burned, PagerDuty pages triggered, and finance sent a Slack alert titled Budget vs Reality: +37 %.

What We Tried First (And Why It Failed)

We started with Redis 7.2 clustered mode, three shards, replication factor 2, and a global session prefix. The bloom filter was a module named RedisBloom v2.4.4 with two parameters: capacity 10M and error rate 0.01. We set memory limit to 12 GiB per shard and assumed TTL at 5 minutes because 5 minutes felt long enough for a hunt session. The first load test with 20k concurrent players showed 150 ms p99 but we didnt simulate session churn. On the third day of internal QA the bloom hit 92 % memory usage in 45 minutes and OOM-killed the cluster. We switched to LFU eviction, then to a tiered setup with a 1 GiB hotset and a 10 GiB overflow. Hotset eviction still nuked the bloom filter and trigger regeneration on every request, cycle repeats. On the live day we saw the false positives spike and Aurora melt. We rolled back to a flat Redis hash (string type, value size 2.0 KiB) and set TTL 30 s. The p99 latency crashed back to 220 ms, CPU on instances dropped to 18 %, and cost per player stayed inside budget. Lesson learned: bloom filters and high-churn sessions are a toxic mix when your SLA is measured in milliseconds.

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

Related reading

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

The Day the Treasure Hunt Engine Buried Itself Alive

Treasure Hunt Engine Was a Disaster Waiting to Happen: A Tale of Unchecked…

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

The Treasure Hunt Engine That Broke Before the Traffic Did

The Moment We Realized Our Treasure Hunt Engine Was Lying to Us

Related reading

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

The Day the Treasure Hunt Engine Buried Itself Alive

Treasure Hunt Engine Was a Disaster Waiting to Happen: A Tale of Unchecked…

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

The Treasure Hunt Engine That Broke Before the Traffic Did

The Moment We Realized Our Treasure Hunt Engine Was Lying to Us