TL;DRAI

A team bootstrapped confidence intervals on Cohen's kappa LLM-judge scores and found a three-week decline (0.55→0.44) was noise, not signal. Most eval dashboards report raw estimates and retune on 0.05 movements—a rigor gap versus A/B testing standards.

We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team started hunting for what "broke" the judge.

Then we bootstrapped confidence intervals on each weekly number. At our sample size (50 traces a week), the 95% intervals were roughly plus or minus 0.15. All three weekly estimates sat inside one another's intervals. The decline we had spent two days investigating was indistinguishable from noise.

What we changed

Stratified the weekly sample by score band and intent instead of sampling uniformly. Rare-but-important slices stopped vanishing from some weeks, which had been a major source of week-to-week wobble.

Report the interval, not the point. The dashboard shows the band. Nobody reacts to a movement smaller than the band. This alone has prevented at least two more pointless investigations.

dev.to

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of...

giovedì 11 giugno 2026 New tab

TL;DRAI

301 words~1 min read

What we changed

Stratified the weekly sample by score band and intent instead of sampling uniformly. Rare-but-important slices stopped vanishing from some weeks, which had been a major source of week-to-week wobble.

Report the interval, not the point. The dashboard shows the band. Nobody reacts to a movement smaller than the band. This alone has prevented at least two more pointless investigations.

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

Related reading

More eval traces will not stabilize your kappa. Stratify the ones you have

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is…

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Bootstrap confidence intervals for your LLM eval metrics

An open source LLM eval tool with two independent quality signals

Exploring LLM-as-a-Judge

Related reading

More eval traces will not stabilize your kappa. Stratify the ones you have

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is…

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Bootstrap confidence intervals for your LLM eval metrics

An open source LLM eval tool with two independent quality signals

Exploring LLM-as-a-Judge