The Problem We Were Actually Solving

We ran Veltrix, a distributed event processing engine that powered real-time treasure hunts across retail stores. The business needed sub-50ms latency for event ingestion and 99.99% uptime during Black Friday sales. Our first system was a Kafka Streams topology in Scala, carefully tuned with RocksDB state stores. The JVM heap was 16 GiB, G1GC was configured with -XX:MaxGCPauseMillis=50, and we had 32 vCPUs per pod. Yet, during a load test with 500k events per second, the p99 latency spiked to 1.2 seconds and the JVM OOMd twice.

What We Tried First (And Why It Failed)

We tried scaling out the Kafka Streams app to six pods, but the shuffle phase in the repartition topic introduced a 300 ms tail. We switched to exactly-once semantics and bumped the RocksDB cache to 4 GiB, but the blocking fsync on every commit pegged the disks at 100% iowait. Profiling with async-profiler showed 42% of the time was spent in JIT compilation stalls and 28% in GC pauses. The GC logs printed phrases like Promoted 12 GB in 2.1 s, which was code for were about to crash.

We then rewrote the heavy join in C++ using RocksDBs JNI bindings. The median latency dropped to 28 ms, but every time the C++ library threw an uncaught exception our JVM process exited with code 139. The ops team deployed a liveness probe that restarted the pod, but the treasure hunt UI refreshed and showed stale leaderboards for 8–12 seconds. Marketing sent Slack messages that read This is unacceptable.