How I built a 3-provider LLM fallback system in production (and what actually broke)

I'm a pre-final year student. I built Socra(https://socra-production.up.railway.app/) — a multi-agent LLM SaaS that interrogates your startup idea using 5 specialist AI personas before generating an architecture masterplan. It has paying users. It runs on Railway. And for the first two weeks of production, it was quietly broken in a way I didn't notice until real users hit it.

This is the story of how I built the 3-provider fallback chain (Anthropic → Google → Groq), what broke along the way, and the actual code that runs in production today.

Why you need a fallback chain at all

When I first deployed Socra, the LLM routing was simple: one provider, one model, one API key. It worked fine in development.