TL;DR: A single eval number hides its own uncertainty. Eval confidence intervals from bootstrap resampling turn a point estimate like 84.2% accuracy into a range, so you stop shipping models on a difference that is noise.
Two checkpoints came back from a fine-tuning run at 84.2% and 85.7% on our 500-example agent eval set. The 1.5 point gap read like a win, and someone wanted to promote the second checkpoint to staging. Before that, I wanted eval confidence intervals on both numbers, because a 500-example set carries more sampling error than most teams admit. At 500 examples, the 95% interval on a single accuracy near 85% spans roughly 3 points on each side. The win sat well inside the noise.
I lead the fine-tuning and evaluation team at Nexus Labs, and the most common mistake I see is treating an eval score as exact. It isn't. Your eval set is a sample drawn from the input space you care about, and a different 500 examples would return a different number. Confidence intervals make that variance visible.
What an eval confidence interval actually tells you
An eval confidence interval is a range around a metric, like accuracy or F1, that quantifies how much the score would move if you resampled the eval set. A 95% bootstrap interval of [81.0%, 87.1%] means that across thousands of resamples of your data, 95% of the recomputed scores fell in that band. It measures sampling noise, not model quality.






