Rust Was the Constraint: How We Discovered the Language Was Our Scaling Bottleneck

The Problem We Were Actually Solving

Our treasure-hunt engine, running on Go 1.21 and a 3-layer micro-service stack, was supposed to scale to 50,000 concurrent connections with sub-50 ms p99 latency. We had engineered around every other obvious constraint: connection pooling, sharded Redis clusters with write-behind caching, and a bespoke lock-free ring buffer for the move stream. Yet every Friday night when North American players came online, the GC would cycle and the jitter spiked above 80 ms. pprof showed 38 % of wall time inside the sweep phase and 12 % inside mark termination. We measured the heap at 7.6 GB per instance, even though the live objects only accounted for 1.4 GB. The rest was fragmented or pinned.

Worse, the Go runtime did not expose a tunable limit on GC pacing. We tried GOGC=25, GOGC=10, even GOMEMLIMIT=4GiB, but the sweeper still ran in stop-the-world bursts. The jemalloc allocator underneath Gos page heap was coalescing small arenas so aggressively that the allocator latency histogram developed a fat tail beyond 2 ms per allocation.

What We Tried First (And Why It Failed)

First we attacked the symptom: we tuned the GOGC knob downward and increased GOMEMLIMIT in 500 MiB increments. At GOMEMLIMIT=3.2GiB the GC frequency doubled, but the pause times dropped to 22 ms. Unfortunately, the heap fragmentation increased the RSS by 22 %, which forced us to shrink the shard count from 32 to 24 per AZ. That meant fewer players per cluster and higher cross-AZ traffic during the daily spike.

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

Rust Was the Constraint: How We Discovered the Language Was Our Scaling Bottleneck

Rust Was the Constraint: How We Discovered the Language Was Our Scaling Bottleneck

Related reading

The Day the Language Became the Bottleneck

Rust Was Not the Silver Bullet I Expected for Our Treasure Hunt Engine

When the Runtime Was the Wall: How Rust Broke a 50 ms SLA and Saved the Day

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

Why I Ditched Go for Rust in Our Real-Time Event Processing Pipeline

The Moment the Config Parser Became the Bottleneck

Related reading

The Day the Language Became the Bottleneck

Rust Was Not the Silver Bullet I Expected for Our Treasure Hunt Engine

When the Runtime Was the Wall: How Rust Broke a 50 ms SLA and Saved the Day

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

Why I Ditched Go for Rust in Our Real-Time Event Processing Pipeline

The Moment the Config Parser Became the Bottleneck