The first few LLM API bugs I hit in production were easy to notice.
The request failed. The user saw an error. I opened the logs, found the stack trace, fixed the obvious thing, and moved on.
The harder bugs were quieter.
The API still returned a response, but it was slower than usual. A fallback model kicked in without anyone noticing. Token usage crept up over a few days. A retry made the request succeed, but doubled the latency. Streaming worked most of the time, except when it didn't.
Nothing looked "down." The app just started feeling worse.






