Your Agent Failed in Prod. Good Luck Reproducing It.

9:04 a.m.

A ticket lands. A customer ran your agent yesterday, it called the wrong tool, deleted the wrong record, and now there is a screenshot in your inbox with a red box drawn around the damage. You have the user ID. You have the timestamp. You copy the exact prompt out of the logs, paste it into the same model, with the same system prompt, and hit run.

It works perfectly.

You run it again. It works again. You run it ten more times. The agent behaves like a model employee every single time, and the one run that mattered, the one that cost a customer their data, is nowhere. You cannot make it happen again, which means you cannot debug it, which means you cannot promise it will not happen to the next customer.

This is the reproducibility problem, and if you are shipping anything built on a large language model, it is already your problem. This post is about why it happens, why some of it is actually a feature you do not want to remove, and what you can do to get back the one thing you need: the ability to replay a run exactly as it happened.

Your Agent Failed in Prod. Good Luck Reproducing It.

Related reading

You Recorded the Incident. Now Prove Your Fix Actually Works.

You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them

Your Agent Logs Are Lying to You: What to Actually Trace in an Agentic System

AI Agent Debugging Checklist: From Failed Run to Root Cause

Why Your AI Agent Works in Dev and Breaks in Prod

Your AI Agent Keeps Failing After 50 Clean Demos, Stop Treating the LLM Like a…