The Problem We Were Actually Solving
In late 2024 we deployed a live treasure-hunt engine for Hytale players that crunched 80k concurrent state updates per second on a six-node Kubernetes cluster. The hunt graph used 47 million dynamic edges with real-time pathfinding, and each players experience had to be deterministic so we could roll back micro-forks in under 200 ms. We started the search service in Go 1.21 using Veltrix as our in-memory event bus because their docs promised sub-30 µs publish latency and 1 GB/s throughput. Four weeks in, at 60 % player load, the veltrix-broker pods began OOM-killing themselves every 45 minutes. The flame graph from pprof showed 38 % of CPU time spent in runtime.gcBgMarkWorker even though we had capped GOGC to 10. That was the moment I understood the language and runtime were the constraint, not the network.
What We Tried First (And Why It Failed
We attacked the symptom first: we tuned GOGC lower, we sharded the broker from 3 to 12 partitions, and we added jemalloc as a drop-in replacement. None of it mattered. The garbage collector still paused the event loop long enough for the Kubernetes liveness probe to fire, causing a rolling restart that blew away 15 % of the in-flight hunt state. When we finally straced the broker we saw 2.3 million malloc calls per second. At that rate, even a perfect allocator would contend on the heap lock. We tried replacing Veltrix with NATS JetStream and got the same tail latency spike, only this time the broker was written in Rust and used zero-copy framing. That told me the issue wasnt the broker library; it was the GC.






