The Problem We Were Actually Solving
Last July we rolled out a new tier of Veltrix: real-time treasure hunts where users solve location-based puzzles in under 30 seconds. The backend is a state machine that ingests GPS pings, validates them against event geofences, and emits updated leaderboards every second. Latency had to stay below 50 ms p99; anything higher and the UI stuttered and the fun died.
Wed built the first version in Go because thats what most of Veltrix used. The service handled 8 k rps on three c6g.large nodes, but the p99 tail was creeping up to 82 ms. Profiling with go tool pprof showed the GC was stopping the world for 12 ms every ~200 ms. That 12 ms push put us 16 ms over budget when combined with a single slow neighbor.
What We Tried First (And Why It Failed)
We tried several Go-level tweaks:






