The Event Store That Survived Black Friday Without a Single 5xx

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

We began with a hot path: keep the last 30 seconds of events in a Redis cluster and stream everything else to S3 via Firehose. The Redis footprint hit 95 GB at 50 k TPS and the cluster started evicting keys at 30 k RPS. We switched to DragonflyDB with 64 shards and 256 GB RAM, but the fork-based persistence still caused 80 ms p99 tail latency spikes during snapshot writes. Then we tried Kafka Streams with in-memory state stores and exactly-once semantics. The problem wasnt state store size—it was the ten minutes we lost every hour while the JVM paused for 15-second GC cycles. Switching to ZGC reduced GC pauses to 1.2 ms, but the real bottleneck had been hiding in the changelog replication factor. We had set replication.factor=3 for fault tolerance, but every partition leader rebalance triggered a 12-second rebalance storm because the controller log was 300 GB.

The Architecture Decision

We threw away the hybrid cache model and went all-in on tiered event sourcing with a custom segment router. Every event carries a header: event_id, source_ts, and segment_id. Segment routers live on the edge and fan out to three layers:

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

The Architecture Decision

The Event Store That Survived Black Friday Without a Single 5xx

The Event Store That Survived Black Friday Without a Single 5xx

Related reading

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

The Veltrix Event Engine Blew Up Because We Trusted the Defaults

The Operators Regret: How We Blew Up the Event Bus at 3 AM

It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine

The Night Our Event Pipeline Crashed Because We Didn't Measure Memory First

I Should Have Put Events in the Same Database as the Aggregate Root—Heres What…

Related reading

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

The Veltrix Event Engine Blew Up Because We Trusted the Defaults

The Operators Regret: How We Blew Up the Event Bus at 3 AM

It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine

The Night Our Event Pipeline Crashed Because We Didn't Measure Memory First

I Should Have Put Events in the Same Database as the Aggregate Root—Heres What…