The Problem We Were Actually Solving
Our treasure-hunt engine, running on Go 1.21 and a 3-layer micro-service stack, was supposed to scale to 50,000 concurrent connections with sub-50 ms p99 latency. We had engineered around every other obvious constraint: connection pooling, sharded Redis clusters with write-behind caching, and a bespoke lock-free ring buffer for the move stream. Yet every Friday night when North American players came online, the GC would cycle and the jitter spiked above 80 ms. pprof showed 38 % of wall time inside the sweep phase and 12 % inside mark termination. We measured the heap at 7.6 GB per instance, even though the live objects only accounted for 1.4 GB. The rest was fragmented or pinned.
Worse, the Go runtime did not expose a tunable limit on GC pacing. We tried GOGC=25, GOGC=10, even GOMEMLIMIT=4GiB, but the sweeper still ran in stop-the-world bursts. The jemalloc allocator underneath Gos page heap was coalescing small arenas so aggressively that the allocator latency histogram developed a fat tail beyond 2 ms per allocation.
What We Tried First (And Why It Failed)
First we attacked the symptom: we tuned the GOGC knob downward and increased GOMEMLIMIT in 500 MiB increments. At GOMEMLIMIT=3.2GiB the GC frequency doubled, but the pause times dropped to 22 ms. Unfortunately, the heap fragmentation increased the RSS by 22 %, which forced us to shrink the shard count from 32 to 24 per AZ. That meant fewer players per cluster and higher cross-AZ traffic during the daily spike.






