The first few LLM API bugs I hit in production were easy to notice.

The request failed. The user saw an error. I opened the logs, found the stack trace, fixed the obvious thing, and moved on.

The harder bugs were quieter.

The API still returned a response, but it was slower than usual. A fallback model kicked in without anyone noticing. Token usage crept up over a few days. A retry made the request succeed, but doubled the latency. Streaming worked most of the time, except when it didn't.

Nothing looked "down." The app just started feeling worse.