The Problem We Were Actually Solving

During load testing at 50k concurrent hunters hitting the hunt endpoints, p99 latencies stayed under 200ms. But at 270k concurrent users in production, the hunt page suddenly took 1.8 seconds to load, triggering cascading 502s from our CDN. The error surfaced in Datadog as hunt_page_render_time_bucket{le=2.0} = 42% while le=0.5 dropped to 18%. The fingerprints were identical across three regions: high latency correlated exactly with Redis cache miss rate spiking from 12% to 48% during the hunt start window. Our cache-aside pattern with a 30-second TTL was amplifying miss storms.

We discovered that the treasure hunt start time was synchronised by marketing campaigns. When the clock struck 10:00:00 UTC, 270k users hit the endpoint within 30 seconds. Each request would check the cache (miss), fetch from PostgreSQL, render the page, and write the cache entry. But PostgreSQL couldnt keep up with 9k queries per second during that window, causing query queueing and connection exhaustion. The Redis layer, designed for 150k ops/sec, was not the bottleneck. The database was.

What We Tried First (And Why It Failed)

Our first attempt was to increase Redis TTL from 30 seconds to 5 minutes. This reduced cache misses from 48% to 24%, and p99 latency improved to 650ms. But at 320k concurrent users, the latency still spiked to 1.4s because the underlying database queries were still hitting the same table with the same indexes. The Redis layer was masking symptoms, not solving the root cause.