The Problem We Were Actually Solving

We ran the Treasure Hunt Engine at Veltrix—our real-time game backend that serves 15 k QPS from players who expect to resolve a treasure within 50 ms or they rage-quit and refund. The performance target is hard: 99th percentile latency must stay under 50 ms end-to-end, including network marshaling, game state lookup, and leaderboard write. In December 2025 we hit a wall: the Go runtime stopped scaling past 2.4 k concurrent connections on a single c6i.4xlarge instance. We were seeing 67 ms p99s and 8 % allocator contention under load. That third 9 wasnt moving no matter how many connections we sharded. Flame graphs showed 32 % of CPU time inside the schedulers steal loop; the Go GC wasnt the bottleneck yet, but the scheduler was fighting itself under high context-switch rates. The team was ready to throw threads at it, but I knew that would only deepen the queueing delay. Something deeper had to change.

What We Tried First (And Why It Failed)

We started with Go 1.22.2, using net/http with fasthttp, then switched to github.com/valyala/fasthttp which cut GC pressure by 20 %, but the p99 crept up again once we crossed 3 k connections. I pulled the Linux perf data: