The Night Our Event Pipeline Crashed Because We Didn't Measure Memory First

The Problem We Were Actually Solving

It started with a scream from the observability dashboard. At 02:47 on a Sunday, our event ingestion pipeline hit 98% memory usage and refused to accept new events. The Go runtime we'd trusted for three years suddenly looked like a liability. The profiler showed 4.2 GB of heap allocated for just 18 million events in the last 30 minutes—numbers that would have been acceptable if each event didn't carry four nested structs of metadata. Our vertical scaling limit was 32 GB RAM, and we were floating 800 MB above it. The Go runtime's GC pauses climbed to 200 ms during peak load, which meant we were dropping events faster than we could acknowledge them.

What We Tried First (And Why It Failed)

We tried three band-aids before realizing they were all symptoms of the same disease. First: we increased worker concurrency from 16 to 32, which doubled our CPU usage and made the GC pauses worse. Second: we added a Redis-backed buffer to smooth traffic, but each event serialization added 1.4 microseconds of latency and introduced another failure point when Redis memory spiked to 95%. Third: we tried tuning GC parameters, setting GOGC=50 and GOMEMLIMIT=28GiB, but the heap still grew uncontrollably because we were allocating temporary slices in hot paths.

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

The Night Our Event Pipeline Crashed Because We Didn't Measure Memory First

The Night Our Event Pipeline Crashed Because We Didn't Measure Memory First

Related reading

The Veltrix Event Engine Blew Up Because We Trusted the Defaults

Why I Ditched Go for Rust in Our Real-Time Event Processing Pipeline

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

How I Built an AI-Powered Incident RCA Platform with LangGraph and RAG

The Operators Regret: How We Blew Up the Event Bus at 3 AM

The Event Store That Survived Black Friday Without a Single 5xx

Related reading

The Veltrix Event Engine Blew Up Because We Trusted the Defaults

Why I Ditched Go for Rust in Our Real-Time Event Processing Pipeline

How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had…

How I Built an AI-Powered Incident RCA Platform with LangGraph and RAG

The Operators Regret: How We Blew Up the Event Bus at 3 AM

The Event Store That Survived Black Friday Without a Single 5xx