The Problem We Were Actually Solving
We had built the Treasure Hunt Engine to power in-game treasure hunts that reward players for exploring content. At 100 000 active players it felt fast. At 500 000 it started to stutter. The operator documentation told teams to set max_reindexing_concurrency = 2 and warned against full table scans. No one listened when the game grew faster than the docs.
The real problem wasnt the engines speed; it was the lack of an explicit operator boundary. We had implicit rules (dont reindex during peak hours, dont run two jobs on the same shard) but no enforced policy. When the shard leaders disk filled up at 02:17, Prometheus fired an alert, the on-call operator panicked and set max_reindexing_concurrency = 4 to speed up cleanup. That single command turned an idling indexer into a four-headed hydra that vacuumed every row in the events table.
What We Tried First (And Why It Failed)
Our first attempt was a simple feature flag: reindexing_enabled = false. We shipped it behind LaunchDarkly and let game masters flip the switch during maintenance windows. In practice, people forgot. One GM set it to false, recompiled the engine config, and redeployed—only to discover that the flag was evaluated a full 60 seconds after the pod came up, during which time the old config was still active. During that window the system reindexed anyway, locking the table for the full 9 minutes while the new pod waited to read the new flag value.






