TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces to 200. Variance barely moved. Then we stratified the 50 we already had, by score class and a couple of known failure dimensions, and the swing dropped more than quadrupling the sample did. Composition was the lever, not volume.
The symptom: kappa that will not sit still
The judge scored production traces against a 5-point rubric. Each week we hand-labeled a calibration set and computed kappa. It bounced: 0.55, then 0.42, then 0.61. Nothing in the rubric or the judge prompt had changed. A kappa that moves 0.2 on noise is useless as an early-warning signal, because you cannot tell a real judge regression from the wobble.
Why adding traces did almost nothing
Random sampling pulls mostly from the majority class. For us that was clean passes, the easy 5s. Kappa is driven by agreement on the rare, ambiguous classes (the 2s and 3s), and random sampling gives you only a handful of those no matter how big the sample gets. So 200 random traces was mostly more easy passes: more data, almost no new signal where it counts.






