9:04 a.m.
A ticket lands. A customer ran your agent yesterday, it called the wrong tool, deleted the wrong record, and now there is a screenshot in your inbox with a red box drawn around the damage. You have the user ID. You have the timestamp. You copy the exact prompt out of the logs, paste it into the same model, with the same system prompt, and hit run.
It works perfectly.
You run it again. It works again. You run it ten more times. The agent behaves like a model employee every single time, and the one run that mattered, the one that cost a customer their data, is nowhere. You cannot make it happen again, which means you cannot debug it, which means you cannot promise it will not happen to the next customer.
This is the reproducibility problem, and if you are shipping anything built on a large language model, it is already your problem. This post is about why it happens, why some of it is actually a feature you do not want to remove, and what you can do to get back the one thing you need: the ability to replay a run exactly as it happened.






