When my agents started failing in production, I did what everyone does first: I went hunting for hallucinations. Better prompts, tighter output schemas, more guardrails. None of it moved the needle, because I was debugging the wrong layer. The agent's reasoning was fine. It was the plumbing that kept collapsing — and the single biggest culprit was the most boring thing imaginable: rate limits.

This turns out not to be just my problem. It's the dominant production failure mode for LLM applications right now, and almost nobody talks about it because it doesn't make for a good demo.

TL;DR — In production, the thing that takes your agent down usually isn't bad reasoning — it's capacity. Provider rate limits are now one of the largest sources of LLM call errors in real traces. A demo makes one request at a time; a production agent fans out into dozens of chained, retrying, concurrent calls and slams into limits the demo never touched. The fix isn't a smarter model, it's capacity engineering: budgeting, backpressure, retries with jitter, fallback models, and caching.

The data nobody puts in the pitch deck

Here's the number that reframed how I think about agent reliability. In Datadog's analysis of real LLM observability traces, rate-limit errors were a huge share of all LLM call failures — in March 2026, roughly a third of all LLM span errors were rate limits, on the order of millions of individual errors. Their conclusion was blunt: when the dominant failure mode of your LLM application is capacity, you need to redouble your capacity engineering, not your prompt engineering.