How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had No Clear Operator Bounds

The Problem We Were Actually Solving

We had built the Treasure Hunt Engine to power in-game treasure hunts that reward players for exploring content. At 100 000 active players it felt fast. At 500 000 it started to stutter. The operator documentation told teams to set max_reindexing_concurrency = 2 and warned against full table scans. No one listened when the game grew faster than the docs.

The real problem wasnt the engines speed; it was the lack of an explicit operator boundary. We had implicit rules (dont reindex during peak hours, dont run two jobs on the same shard) but no enforced policy. When the shard leaders disk filled up at 02:17, Prometheus fired an alert, the on-call operator panicked and set max_reindexing_concurrency = 4 to speed up cleanup. That single command turned an idling indexer into a four-headed hydra that vacuumed every row in the events table.

What We Tried First (And Why It Failed)

Our first attempt was a simple feature flag: reindexing_enabled = false. We shipped it behind LaunchDarkly and let game masters flip the switch during maintenance windows. In practice, people forgot. One GM set it to false, recompiled the engine config, and redeployed—only to discover that the flag was evaluated a full 60 seconds after the pod came up, during which time the old config was still active. During that window the system reindexed anyway, locking the table for the full 9 minutes while the new pod waited to read the new flag value.

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had No Clear Operator Bounds

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had No Clear Operator Bounds

Related reading

Treasure Hunt Engine Was a Disaster Waiting to Happen: A Tale of Unchecked…

When Server Growth Hits a Wall the Treasure Hunt Engine Documentation Fails You

Designing a Treasure Hunt Engine to Survive a Million Players

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

Treasure Hunt Engine Was a Nightmare to Operate Until We Fixed These Three…

The Treasure Hunt Engine That Broke Before the Traffic Did

Related reading

Treasure Hunt Engine Was a Disaster Waiting to Happen: A Tale of Unchecked…

When Server Growth Hits a Wall the Treasure Hunt Engine Documentation Fails You

Designing a Treasure Hunt Engine to Survive a Million Players

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

Treasure Hunt Engine Was a Nightmare to Operate Until We Fixed These Three…

The Treasure Hunt Engine That Broke Before the Traffic Did