Most of the failures I see in production AI agents aren't reasoning failures. The model picks the right tool, fills in the right arguments, and makes a perfectly sensible decision. Then the agent charges the customer twice.

The reason is mundane and has nothing to do with intelligence. A write-capable agent — one that can send an email, create a ticket, move money, or update a database — lives inside the same unreliable network as any other distributed system. Requests time out. Connections drop after the server already committed the write but before the response came back. An orchestration framework retries a step that looked like it failed but didn't. And because the agent is a loop that re-plans on every observation, a single ambiguous outcome can send it down the path of just trying the action again.

In a read-only agent, a retry is free. In a write-capable agent, a retry is a second irreversible action in the real world. That asymmetry is the whole game, and the fix is older than LLMs: idempotency.

The shape of the bug

Here's the sequence that bites teams over and over. The agent calls send_invoice. The downstream service receives it, creates the invoice, and starts sending the response. Somewhere on the way back, the connection dies. From the agent's point of view, the call failed — it got a timeout, not a 200. So the agent, doing exactly what a resilient system is supposed to do, retries. Now there are two invoices.