Exponential backoff with jitter stopped our CI retry storms

TL;DR: Exponential backoff with jitter spreads client retries over time so a recovering service doesn't get flattened by a synchronised wave. We added full jitter to our CI agents and a retry budget, and our internal 503 rate during recovery dropped from minutes of saturation to seconds.

A few months back at Buildkite a metadata service we run on ECS had a 40 second blip, and the recovery took six minutes instead of forty seconds. The service came back healthy, then immediately fell over again, then recovered, then fell over. Our build agents had all retried at the same fixed interval, so every two seconds a few thousand requests arrived in lockstep. That synchronised wave is a retry storm, and exponential backoff with jitter is the standard fix.

I'd read about this for years and never actually watched it bite us. Once it did, the maths got a lot more interesting. Here's what we changed and why it worked.

What is a retry storm

A retry storm happens when many clients fail at the same moment, then retry on the same schedule, producing a repeating spike of traffic that prevents the downstream service from recovering. Because the retries are correlated in time, each recovery attempt is immediately overwhelmed, so a short outage stretches into a long one. Jitter breaks the correlation by randomising when each client retries.

I'd read about this for years and never actually watched it bite us. Once it did, the maths got a lot more interesting. Here's what we changed and why it worked.

What is a retry storm

Exponential backoff with jitter stopped our CI retry storms

Exponential backoff with jitter stopped our CI retry storms

Related reading

The SIGTERM our build workers ignored, and the 90s that fixed it

Our retry loop made an outage worse. The circuit breaker stopped the cascade.

Resilix: How I Revived a Half-Baked Rate Limiter into an Enterprise-Grade…

PodDisruptionBudgets: Your Kubernetes Outage Insurance

Orchestrate Saga Compensation Timeouts in Real Time (Kiponos Java SDK)

Semantic caching our flaky-test summariser: 58% fewer LLM calls

Related reading

The SIGTERM our build workers ignored, and the 90s that fixed it

Our retry loop made an outage worse. The circuit breaker stopped the cascade.

Resilix: How I Revived a Half-Baked Rate Limiter into an Enterprise-Grade…

PodDisruptionBudgets: Your Kubernetes Outage Insurance

Orchestrate Saga Compensation Timeouts in Real Time (Kiponos Java SDK)

Semantic caching our flaky-test summariser: 58% fewer LLM calls