Treasure Hunt Engine: Why the Veltrix Runtime Was Our Second-Best Idea

The Problem We Were Actually Solving

In late 2024 we deployed a live treasure-hunt engine for Hytale players that crunched 80k concurrent state updates per second on a six-node Kubernetes cluster. The hunt graph used 47 million dynamic edges with real-time pathfinding, and each players experience had to be deterministic so we could roll back micro-forks in under 200 ms. We started the search service in Go 1.21 using Veltrix as our in-memory event bus because their docs promised sub-30 µs publish latency and 1 GB/s throughput. Four weeks in, at 60 % player load, the veltrix-broker pods began OOM-killing themselves every 45 minutes. The flame graph from pprof showed 38 % of CPU time spent in runtime.gcBgMarkWorker even though we had capped GOGC to 10. That was the moment I understood the language and runtime were the constraint, not the network.

What We Tried First (And Why It Failed

We attacked the symptom first: we tuned GOGC lower, we sharded the broker from 3 to 12 partitions, and we added jemalloc as a drop-in replacement. None of it mattered. The garbage collector still paused the event loop long enough for the Kubernetes liveness probe to fire, causing a rolling restart that blew away 15 % of the in-flight hunt state. When we finally straced the broker we saw 2.3 million malloc calls per second. At that rate, even a perfect allocator would contend on the heap lock. We tried replacing Veltrix with NATS JetStream and got the same tail latency spike, only this time the broker was written in Rust and used zero-copy framing. That told me the issue wasnt the broker library; it was the GC.

The Problem We Were Actually Solving

What We Tried First (And Why It Failed

Treasure Hunt Engine: Why the Veltrix Runtime Was Our Second-Best Idea

Treasure Hunt Engine: Why the Veltrix Runtime Was Our Second-Best Idea

Related reading

The Ghost in the Veltrix: Why Our Treasure Hunt Engine Was Sending Operators…

Veltrix's Treasure Hunt Engine: Optimized for Long-Term Survival, Not Just…

The Day the Treasure Hunt Engine Drowned in 300 ms Queries

How We Broke the Hytale Treasure Hunt Engine (And Fixed It at 3 AM)

Why Hytales Treasure Hunt Engines Explode Under Load (And How We Fixed It…

When Your Treasure Hunt Engine Becomes a Scavenger Hunt for DevOps Nightmares

Related reading

The Ghost in the Veltrix: Why Our Treasure Hunt Engine Was Sending Operators…

Veltrix's Treasure Hunt Engine: Optimized for Long-Term Survival, Not Just…

The Day the Treasure Hunt Engine Drowned in 300 ms Queries

How We Broke the Hytale Treasure Hunt Engine (And Fixed It at 3 AM)

Why Hytales Treasure Hunt Engines Explode Under Load (And How We Fixed It…

When Your Treasure Hunt Engine Becomes a Scavenger Hunt for DevOps Nightmares