The Problem We Were Actually Solving
At 02:47 the Redis counters began to drift by as much as 18 %. Players who had just spent 300 gold on a dig turned around and screamed at Discord that the server had stolen their loot. We had a classic symptom: event loss.
Our original topology was Kafka → Kafka Streams → Redis. It looked good in the whiteboard diagram. We had 6 brokers, 240 partitions, a replication factor of 3, and acks=all on the producer. The logs said everything was healthy—lag was near zero, broker CPU idle at 20 %. The problem wasnt latency; it was missing messages.
The first clue came from the kafka-consumer-groups command. One consumer group—leaderboard-rebuild—showed a lag of 4.2 million messages. The Streams application had fallen behind, GC pauses every 30 seconds, and the RocksDB state store couldnt keep up. By the time it caught up, players had already opened another chest and the old events were gone.
We needed exactly-once semantics across three systems: Kafka, Kafka Streams, and Redis. Thats not the same as at-least-once; its exactly-once with external side effects. The docs dont cover that.






