The Problem We Were Actually Solving
The Treasure Hunt Engine is a multiplayer game where players dig for virtual gems, craft tools, and compete on leaderboards. Every action produces an event: dig_started, gem_found, score_updated, inventory_cleared. We needed a system that could ingest, deduplicate, and propagate these events to every player in under 200 ms while guaranteeing no double-counting of rare gems.
Our first attempt used Kafka as a raw log, with a downstream Flink job for deduplication and scoring. The contract was simple: events arrive, Flink groups by player_id and action_type within a 5-second window, then publishes to Redis streams for fan-out. Flink also wrote every deduplicated event back to Kafka under a processed topic for replay.
The first latency spike happened during the Black Friday gem drop when 300,000 players simultaneously clicked dig. Flinks backpressure alarms fired within 90 seconds. The real-time score dashboard showed 1,847 duplicate gem_found events in the first minute, meaning Flinks deduplication window was either too wide or too slow. We widened the window to 10 seconds and pushed the Flink autoscaler to 40 TaskManagers. The backpressure subsided, but the duplicate count stabilized at 1,102 per minute—still unusable. The error message in our logs repeated: FlinkException: Failed to commit checkpoint within 60000 ms. We discovered Flink was writing 1.2 TB of state to S3 every minute for exactly those 10-second windows. The cost of stronger consistency was turning into a bill shock.






