The Problem We Were Actually Solving
Our treasure hunt engine at Veltrix was a real-time geospatial matching service that processed 50 million location events daily. By month six it handled bursts of 2M concurrent users during events like Black Friday flash sales. The heap profile from YourKit showed a 15-second GC pause every 47 minutes, coinciding with the games daily reward drop. The GC logs screamed OldGen exhaustion. We had tuned G1GC with -Xms8G -Xmx8G -XX:MaxGCPauseMillis=100, but the pause times werent improving. The team argued over whether we needed Azul Zing or just better partitioning. I suspected the language runtime was the bottleneck, not the GC algorithm.
What We Tried First (And Why It Failed)
We doubled the heap to 16G and increased MaxGCPauseMillis to 200. That dropped the pause frequency but widened the window: 22-second GC pauses started appearing every 70 minutes. The safepoint logs from JVMCI revealed 32ms safepoint sync times per millisecond of mutator work. The allocation rate hit 7.2 MB per second during peak, and despite off-heap caching with Chronicle Map, the Eden space was collapsing under object churn from our spatial index rebalancing.
We tried Azul Zing. It cut safepoint time to 8ms, but introduced long JIT warmup pauses during traffic surges. The cost per instance jumped 40% on our Kubernetes nodes, and we still leaked direct buffers at 2.3 MB/s due to improper Netty arena sizing. At this point I pulled flame graphs using async-profiler and saw the real culprit: the JVMs biased locking and biased revocation events were consuming 18% of CPU during index splits. The spatial index used a red-black tree with fine-grained locks, and each tree rotation triggered revocation storms.






