It demoed perfectly.

Clean output, every case green, the kind of run that makes you screenshot the terminal.

Then it reached production and fell apart inside a day.

What followed was the worst kind of debugging. The agent would return something wrong, and I could not tell why. Was it the prompt. Was it a timeout on a tool call. Was it stale data. Was it a retry firing twice and confusing the next step. I spent far longer staring at that agent than I ever spent building it.

That pain was on the dev.to front page this week. someone wrote about spending ten times longer debugging AI code than writing it. I felt that one in my spine.