The 3am pager that changed how I write agents
A few months back, I shipped what I thought was a clean agent for a client. It scraped web pages, summarized them, then routed the results to different downstream tools based on content. Worked great in dev. Worked great for the first week.
Then I got paged at 3am.
The agent had gotten into a loop. One of the tools timed out, returned a partial response, the LLM "decided" the task wasn't done, called the same tool again, got another partial response, and so on. By the time I caught it, we'd burned through about 4,000 API calls overnight.
The fix wasn't fun. The agent logic was scattered across if statements, retry decorators, prompt templates, and a while loop that was supposed to terminate when the LLM said "DONE". Spoiler: it sometimes did not say DONE.








