Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)

Your agent calls a tool. The tool times out at the network layer but actually succeeds on the server. Your harness sees no response, so it retries. Now charge_customer ran twice, send_email fired twice, and create_ticket left two tickets. The model did nothing wrong. Every eval you have is green. And a customer just got billed $198 for a $99 plan.

This is the failure mode nobody puts in a demo, because demos don't retry and demos don't have side effects that matter. Production has both. If your agent takes actions — not just generates text — then retry safety is not a nice-to-have, it is the difference between an autonomous system and a liability with a scheduler.

I want to argue two things. First: side-effect safety is a Tier 1 evaluation problem, not a prompt problem. Second: you cannot even see this class of bug without a trace of what the agent actually did, which is where the eval story and the observability story become the same story.

Why the model can't save you here

The instinct is to make the agent smarter. "Tell it to check whether the charge already went through before retrying." Please don't. You are asking a non-deterministic component to enforce an invariant that must hold every single time. The model will comply 95% of the time and the other 5% is a chargeback.

Why the model can't save you here

Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)

Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)

Other newsrooms on this story

Related reading

Your AI Agent Will Double-Charge on a Lost Response

Your Agent Failed in Prod. Good Luck Reproducing It.

Why your agentic system doesn't survive its own success — and what a…

Your agent demo works. That's the trap.

Your AI Agent Just Burned $108 in an Hour. Here's the 50-Line Fix.

Tool-Call Accuracy Is Lying to You: A Four-Layer Eval Stack for Agents

Related reading

Your AI Agent Will Double-Charge on a Lost Response

Your Agent Failed in Prod. Good Luck Reproducing It.

Why your agentic system doesn't survive its own success — and what a…

Your agent demo works. That's the trap.

Your AI Agent Just Burned $108 in an Hour. Here's the 50-Line Fix.

Tool-Call Accuracy Is Lying to You: A Four-Layer Eval Stack for Agents

Other newsrooms on this story