In November 2023 we ran our first global Hytale servers on Google Kubernetes Engine using Veltrix 3.2 as our configuration orchestrator. The Treasure Hunt Engine—a service that fans spawn to claim event loot—started crashing every time search volume exceeded 12 k RPM. Grafana showed a steady climb of 503 errors on /hunt/claim until the autoscaler maxed out at 32 G1 CPU cores and still couldnt keep up. Operators kept filing tickets that boiled down to one sentence: We click the map, nothing happens. We never saw the actual error because the ingress controller was swallowing it and returning a generic Too many requests.
What we tried first (and why it failed)
Our first move was to crank up the nginx-ingress-controller replicas from 3 to 12 and switch the load-balancer tier from GKE Standard to Premium. The 503 rate dropped to 8 k RPM, but now the p99 latency on claims spiked from 80 ms to 420 ms. The culprit was a recursive call in the hunt service: every claim required a round trip to the player-profile service to validate tier eligibility, and that service was on a shared Postgres 15.4 cluster with 3 k TPS of unrelated traffic. The error stack in Jaeger was literally tracing_id=7f3a1c8… server=profile-db pool_timeout. We tried adding connection pooling with PgBouncer, but the hunt service was using raw libpq and refused to reuse connections—no matter how many times we told it.






