The Operators Regret: How We Blew Up the Event Bus at 3 AM

The Problem We Were Actually Solving

At 02:47 the Redis counters began to drift by as much as 18 %. Players who had just spent 300 gold on a dig turned around and screamed at Discord that the server had stolen their loot. We had a classic symptom: event loss.

Our original topology was Kafka → Kafka Streams → Redis. It looked good in the whiteboard diagram. We had 6 brokers, 240 partitions, a replication factor of 3, and acks=all on the producer. The logs said everything was healthy—lag was near zero, broker CPU idle at 20 %. The problem wasnt latency; it was missing messages.

The first clue came from the kafka-consumer-groups command. One consumer group—leaderboard-rebuild—showed a lag of 4.2 million messages. The Streams application had fallen behind, GC pauses every 30 seconds, and the RocksDB state store couldnt keep up. By the time it caught up, players had already opened another chest and the old events were gone.

We needed exactly-once semantics across three systems: Kafka, Kafka Streams, and Redis. Thats not the same as at-least-once; its exactly-once with external side effects. The docs dont cover that.

The Problem We Were Actually Solving

We needed exactly-once semantics across three systems: Kafka, Kafka Streams, and Redis. Thats not the same as at-least-once; its exactly-once with external side effects. The docs dont cover that.

The Operators Regret: How We Blew Up the Event Bus at 3 AM

The Operators Regret: How We Blew Up the Event Bus at 3 AM

Related reading

The Veltrix Event Engine Blew Up Because We Trusted the Defaults

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

The Event Store That Survived Black Friday Without a Single 5xx

How We Stopped Losing 45 Minutes Every Time Production Broke

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

Treasure Hunt Engine: The Day We Realized the Event Bus Was Our Constraint

Related reading

The Veltrix Event Engine Blew Up Because We Trusted the Defaults

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

The Event Store That Survived Black Friday Without a Single 5xx

How We Stopped Losing 45 Minutes Every Time Production Broke

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

Treasure Hunt Engine: The Day We Realized the Event Bus Was Our Constraint