Bootstrap confidence intervals for your LLM eval metrics

TL;DR: A single eval number hides its own uncertainty. Eval confidence intervals from bootstrap resampling turn a point estimate like 84.2% accuracy into a range, so you stop shipping models on a difference that is noise.

Two checkpoints came back from a fine-tuning run at 84.2% and 85.7% on our 500-example agent eval set. The 1.5 point gap read like a win, and someone wanted to promote the second checkpoint to staging. Before that, I wanted eval confidence intervals on both numbers, because a 500-example set carries more sampling error than most teams admit. At 500 examples, the 95% interval on a single accuracy near 85% spans roughly 3 points on each side. The win sat well inside the noise.

I lead the fine-tuning and evaluation team at Nexus Labs, and the most common mistake I see is treating an eval score as exact. It isn't. Your eval set is a sample drawn from the input space you care about, and a different 500 examples would return a different number. Confidence intervals make that variance visible.

What an eval confidence interval actually tells you

An eval confidence interval is a range around a metric, like accuracy or F1, that quantifies how much the score would move if you resampled the eval set. A 95% bootstrap interval of [81.0%, 87.1%] means that across thousands of resamples of your data, 95% of the recomputed scores fell in that band. It measures sampling noise, not model quality.

What an eval confidence interval actually tells you

Bootstrap confidence intervals for your LLM eval metrics

Bootstrap confidence intervals for your LLM eval metrics

Related reading

We put confidence intervals on our LLM-judge scores. The error bars ate three…

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Power analysis for LLM evals: how big does your eval set need to be to catch a…

Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically…

More eval traces will not stabilize your kappa. Stratify the ones you have

Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic…

Related reading

We put confidence intervals on our LLM-judge scores. The error bars ate three…

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Power analysis for LLM evals: how big does your eval set need to be to catch a…

Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically…

More eval traces will not stabilize your kappa. Stratify the ones you have

Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic…