A few weeks back there was a 22-minute window where Anthropic returned a high rate of 5xx responses. Not a full outage. Degraded.
Our agent service had a retry policy that backed off and tried again on 5xx. With six workers and a shared retry budget that I had set too high, we were re-issuing failed calls about as fast as the API was returning errors. When the API recovered, we had a backlog of in-flight retries that pushed us straight back into rate limiting.
Total cost of the bad decision: about 18,000 wasted Anthropic calls and 9 minutes of additional recovery time after their side came back. Nothing user-visible blew up, but I felt bad about it.
I wrote llm-circuit-breaker the next day. It is small. The whole crate is under 400 lines of Rust. It pairs with llm-retry.
The state machine








