Treasure Hunting at Scale: Why Our Cache-Aside Cache Cost Us 40% in Tail Latency During Black Friday

The Problem We Were Actually Solving

During load testing at 50k concurrent hunters hitting the hunt endpoints, p99 latencies stayed under 200ms. But at 270k concurrent users in production, the hunt page suddenly took 1.8 seconds to load, triggering cascading 502s from our CDN. The error surfaced in Datadog as hunt_page_render_time_bucket{le=2.0} = 42% while le=0.5 dropped to 18%. The fingerprints were identical across three regions: high latency correlated exactly with Redis cache miss rate spiking from 12% to 48% during the hunt start window. Our cache-aside pattern with a 30-second TTL was amplifying miss storms.

We discovered that the treasure hunt start time was synchronised by marketing campaigns. When the clock struck 10:00:00 UTC, 270k users hit the endpoint within 30 seconds. Each request would check the cache (miss), fetch from PostgreSQL, render the page, and write the cache entry. But PostgreSQL couldnt keep up with 9k queries per second during that window, causing query queueing and connection exhaustion. The Redis layer, designed for 150k ops/sec, was not the bottleneck. The database was.

What We Tried First (And Why It Failed)

Our first attempt was to increase Redis TTL from 30 seconds to 5 minutes. This reduced cache misses from 48% to 24%, and p99 latency improved to 650ms. But at 320k concurrent users, the latency still spiked to 1.4s because the underlying database queries were still hitting the same table with the same indexes. The Redis layer was masking symptoms, not solving the root cause.

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

Treasure Hunting at Scale: Why Our Cache-Aside Cache Cost Us 40% in Tail Latency During Black Friday

Treasure Hunting at Scale: Why Our Cache-Aside Cache Cost Us 40% in Tail Latency During Black Friday

Related reading

The Day the Treasure Hunt Engine Buried Itself Alive

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

Your Treasure Hunt Engine Was Probably a Latency Minefield (And Heres the…

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine

Treasure Hunt Engine: Why One Bad Prometheus Rule Sank the Whole Veltrix Event

Related reading

The Day the Treasure Hunt Engine Buried Itself Alive

A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

Your Treasure Hunt Engine Was Probably a Latency Minefield (And Heres the…

Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine

Treasure Hunt Engine: Why One Bad Prometheus Rule Sank the Whole Veltrix Event