We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team started hunting for what "broke" the judge.
Then we bootstrapped confidence intervals on each weekly number. At our sample size (50 traces a week), the 95% intervals were roughly plus or minus 0.15. All three weekly estimates sat inside one another's intervals. The decline we had spent two days investigating was indistinguishable from noise.
What we changed
Stratified the weekly sample by score band and intent instead of sampling uniformly. Rare-but-important slices stopped vanishing from some weeks, which had been a major source of week-to-week wobble.
Report the interval, not the point. The dashboard shows the band. Nobody reacts to a movement smaller than the band. This alone has prevented at least two more pointless investigations.






