Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead)

Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: "These are the same subjects measured before and after an intervention. Did their scores change significantly?"

Four of the five runs return p = 0.009 from a paired t-test.

The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports p = 0.000018.

All five reach the same conclusion (significant). But notice what happened: only one run out of five thought to check an assumption you'd want it to check. The other four skipped it. The choice of method — and the test statistic, and the p-value — depended on whether the LLM happened to run an assumption check that time. On borderline data, this is the difference between reject and don't reject.

If you're using LLMs for exploratory data analysis on a weekend project, you might shrug. If you're using them for anything that gets cited, gets submitted to a regulator, or gets handed to a clinician, this is a problem. It's a known problem — Cui & Alexander (2026) documented exactly this kind of method-divergence empirically; AIRepr (Zeng et al., 2025) shows the same thing across reproducibility metrics. The current answer in the literature is to constrain the agent so its execution is replayable. But replayability fixes "did we run the same code." It doesn't fix "did we run the right analysis."

Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: "These are the same subjects measured before and after an intervention. Did their scores change significantly?"

Four of the five runs return p = 0.009 from a paired t-test.

The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports p = 0.000018.

Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead)

Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead)

Related reading

I A/B tested 4 LLMs on the same 500 queries. The results surprised me.

How to Stop Evaluating LLM Outputs by Gut Feel

Bringing Scientific Rigor to LLM Comparison

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Which LLM is the best stock picker? I built a benchmark to find out.

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Related reading

I A/B tested 4 LLMs on the same 500 queries. The results surprised me.

How to Stop Evaluating LLM Outputs by Gut Feel

Bringing Scientific Rigor to LLM Comparison

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Which LLM is the best stock picker? I built a benchmark to find out.

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept