Three years ago I inherited a Rust-based treasure-hunt engine that processed upwards of 2 million in-game events per second. Latency was fine on the bench—median 3 ms, p99 12 ms—but every time we hit 2.5 M events the JVM control-plane ground to a halt at 30 % CPU and 512 MB RSS. I blamed GC pauses, tuned G1, disabled safepoints, even rewrote the hot path in C. Nothing mattered.

Then I ran the collector under perf for 30 seconds and saw a 3.4 ms pause where 92 % of threads blocked on syscalls inside tokio::park_timeout. The async runtime was blocking the executor thread on the event queues io_uring submission path. In my head Rust equaled no GC, therefore performance was only about the algorithm. The runtime was the constraint.

The Problem We Were Actually Solving

We were building a real-time open-world game where every player action—move, rotate, shoot, loot—generated an event. Those events streamed into a Kafka topic partitioned by shard (hot zones used 64 partitions). A cluster of Rust services consumed the topic, fed an in-memory state machine, then published updates to a second topic for the physics and rendering workers.

The system promised 2 ms end-to-end latency at 2 M events/sec. Early load tests with 1.5 M events/sec showed median 2.8 ms and p99 14 ms. When we pushed to 2.2 M events/sec the metrics inverted: median 1.9 ms but p99 exploded to 124 ms and 5 % of responses timed out. Collectors on the consumer pods showed CPU flat at 65 %, but RSS climbed 1.2 GB every minute and the JVM control-plane nodes rebooted with OutOfMemoryError.