Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: "These are the same subjects measured before and after an intervention. Did their scores change significantly?"
Four of the five runs return p = 0.009 from a paired t-test.
The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports p = 0.000018.
All five reach the same conclusion (significant). But notice what happened: only one run out of five thought to check an assumption you'd want it to check. The other four skipped it. The choice of method — and the test statistic, and the p-value — depended on whether the LLM happened to run an assumption check that time. On borderline data, this is the difference between reject and don't reject.
If you're using LLMs for exploratory data analysis on a weekend project, you might shrug. If you're using them for anything that gets cited, gets submitted to a regulator, or gets handed to a clinician, this is a problem. It's a known problem — Cui & Alexander (2026) documented exactly this kind of method-divergence empirically; AIRepr (Zeng et al., 2025) shows the same thing across reproducibility metrics. The current answer in the literature is to constrain the agent so its execution is replayable. But replayability fixes "did we run the same code." It doesn't fix "did we run the right analysis."






