The shape of the bad day is always the same. A status page goes red, or doesn't go red but should have, or goes red 40 minutes after the customer's first failed request. The AI provider you depend on is having a moment. Your application's error rate spikes. You spend the next hour explaining to your own customer that "the model API is degraded" — a phrase that means absolutely nothing to them.
The previous Prism release — v1.4 Policy + Governance — made the cost side predictable. Today's release, v1.5, makes the reliability side predictable. The bet is the same: provider outages should be a routing problem, not a customer problem. Whether Anthropic, OpenAI, or Google is having a moment, the request that lands at api.ssimplifi.com should get a response or a structured error — never a silent hang, never a stream that never closes, never a stale "the model API is degraded" conversation.
The three problems the old failover had
The v1.0 failover code did roughly the right thing on paper: try the primary provider, retry once, then walk the fallback chain. It had three problems that only show up when something actually goes wrong.
Problem 1: per-process in-memory state. Health was tracked in a Python dict that lived inside one uvicorn worker. Two workers handling traffic on the same EC2 instance had independent views — each had to learn that Anthropic was failing three times on its own. A container restart wiped the dict entirely; the first three requests after a deploy ate the outage all over again.






