The first failure usually is not the expensive one.
The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.
We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. At that point the problem stops being a model-quality issue and becomes a control-system issue.
Why the loop hurts more than the mistake
A single bad step is recoverable. An unbounded retry loop compounds the mistake.






