The Problem We Were Actually Solving
The Treasure Hunt Engine isnt a search engine; its a live, multiplayer game where 100,000 players simultaneously dig through 5 TB of LZ4-compressed JSON blobs to find hidden keys. Each game room is a shard, and each shard must route writes to the correct player within 500 ms p99. Our SLA was written around the assumption that Veltrix would evenly distribute these shards across 40 nodes. What the docs didnt tell us was that Veltrix uses a modulo-based shard key hash, which collides when the shard count exceeds 32,768 (2^15). At 40 nodes we were at 65,536 virtual buckets, so every 4th request was hitting the same bucket, overloading node 7. The heap profile from Valgrind showed 1.2 million active TCP connections sitting in TIME_WAIT on that node because the backlog queue was 90,000 deep.
What We Tried First (And Why It Failed)
We started with Veltrixs cloud-init template that spins up 40 pods in Kubernetes. The operator guide said to set shard_count = 128 and replication_factor = 3, so we did. After the first load test with k6 we got p99 latency of 2.1 seconds, but the memory profile from heapster showed RSS on node 7 at 14.7 GB while node 16 was at 3.2 GB. Veltrixs JVM heap sizing guide recommended -Xmx8G, but node 7 was swapping because the off-heap caches were leaking. We tried tuning the concurrency thread pool from 200 to 800, but the lock contention in the ShardManager class showed up as 42% system CPU in perf stat. The metrics from Prometheus confirmed that the gossip delay between nodes was 750 ms instead of the expected 150 ms because the heartbeat interval was hard-coded at 200 ms and the packet loss between AZs was 1.8%.






