The Problem We Were Actually Solving
It started with a scream from the observability dashboard. At 02:47 on a Sunday, our event ingestion pipeline hit 98% memory usage and refused to accept new events. The Go runtime we'd trusted for three years suddenly looked like a liability. The profiler showed 4.2 GB of heap allocated for just 18 million events in the last 30 minutes—numbers that would have been acceptable if each event didn't carry four nested structs of metadata. Our vertical scaling limit was 32 GB RAM, and we were floating 800 MB above it. The Go runtime's GC pauses climbed to 200 ms during peak load, which meant we were dropping events faster than we could acknowledge them.
What We Tried First (And Why It Failed)
We tried three band-aids before realizing they were all symptoms of the same disease. First: we increased worker concurrency from 16 to 32, which doubled our CPU usage and made the GC pauses worse. Second: we added a Redis-backed buffer to smooth traffic, but each event serialization added 1.4 microseconds of latency and introduced another failure point when Redis memory spiked to 95%. Third: we tried tuning GC parameters, setting GOGC=50 and GOMEMLIMIT=28GiB, but the heap still grew uncontrollably because we were allocating temporary slices in hot paths.






