TL;DR: Exponential backoff with jitter spreads client retries over time so a recovering service doesn't get flattened by a synchronised wave. We added full jitter to our CI agents and a retry budget, and our internal 503 rate during recovery dropped from minutes of saturation to seconds.

A few months back at Buildkite a metadata service we run on ECS had a 40 second blip, and the recovery took six minutes instead of forty seconds. The service came back healthy, then immediately fell over again, then recovered, then fell over. Our build agents had all retried at the same fixed interval, so every two seconds a few thousand requests arrived in lockstep. That synchronised wave is a retry storm, and exponential backoff with jitter is the standard fix.

I'd read about this for years and never actually watched it bite us. Once it did, the maths got a lot more interesting. Here's what we changed and why it worked.

What is a retry storm

A retry storm happens when many clients fail at the same moment, then retry on the same schedule, producing a repeating spike of traffic that prevents the downstream service from recovering. Because the retries are correlated in time, each recovery attempt is immediately overwhelmed, so a short outage stretches into a long one. Jitter breaks the correlation by randomising when each client retries.