How we route around a 20-minute Anthropic outage

The shape of the bad day is always the same. A status page goes red, or doesn't go red but should have, or goes red 40 minutes after the customer's first failed request. The AI provider you depend on is having a moment. Your application's error rate spikes. You spend the next hour explaining to your own customer that "the model API is degraded" — a phrase that means absolutely nothing to them.

The previous Prism release — v1.4 Policy + Governance — made the cost side predictable. Today's release, v1.5, makes the reliability side predictable. The bet is the same: provider outages should be a routing problem, not a customer problem. Whether Anthropic, OpenAI, or Google is having a moment, the request that lands at api.ssimplifi.com should get a response or a structured error — never a silent hang, never a stream that never closes, never a stale "the model API is degraded" conversation.

The three problems the old failover had

The v1.0 failover code did roughly the right thing on paper: try the primary provider, retry once, then walk the fallback chain. It had three problems that only show up when something actually goes wrong.

Problem 1: per-process in-memory state. Health was tracked in a Python dict that lived inside one uvicorn worker. Two workers handling traffic on the same EC2 instance had independent views — each had to learn that Anthropic was failing three times on its own. A container restart wiped the dict entirely; the first three requests after a deploy ate the outage all over again.

The three problems the old failover had

How we route around a 20-minute Anthropic outage

How we route around a 20-minute Anthropic outage

Related reading

Our retry loop made an outage worse. The circuit breaker stopped the cascade.

When Your AI Provider Fails: Building a Resilient Fallback System

The Degradation Ladder: How Systems Fail Before They Fail

Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down"

Autonomous Agents: what breaks first (and why that's the real product)…

When Your AI Service Goes Down: Building a Multi-Model Fallback System

Related reading

Our retry loop made an outage worse. The circuit breaker stopped the cascade.

When Your AI Provider Fails: Building a Resilient Fallback System

The Degradation Ladder: How Systems Fail Before They Fail

Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down"

Autonomous Agents: what breaks first (and why that's the real product)…

When Your AI Service Goes Down: Building a Multi-Model Fallback System