More eval traces will not stabilize your kappa. Stratify the ones you have

TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces to 200. Variance barely moved. Then we stratified the 50 we already had, by score class and a couple of known failure dimensions, and the swing dropped more than quadrupling the sample did. Composition was the lever, not volume.

The symptom: kappa that will not sit still

The judge scored production traces against a 5-point rubric. Each week we hand-labeled a calibration set and computed kappa. It bounced: 0.55, then 0.42, then 0.61. Nothing in the rubric or the judge prompt had changed. A kappa that moves 0.2 on noise is useless as an early-warning signal, because you cannot tell a real judge regression from the wobble.

Why adding traces did almost nothing

Random sampling pulls mostly from the majority class. For us that was clean passes, the easy 5s. Kappa is driven by agreement on the rare, ambiguous classes (the 2s and 3s), and random sampling gives you only a handful of those no matter how big the sample gets. So 200 random traces was mostly more easy passes: more data, almost no new signal where it counts.

The symptom: kappa that will not sit still

Why adding traces did almost nothing

More eval traces will not stabilize your kappa. Stratify the ones you have

More eval traces will not stabilize your kappa. Stratify the ones you have

Related reading

We put confidence intervals on our LLM-judge scores. The error bars ate three…

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is…

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build…

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

How to Stop Evaluating LLM Outputs by Gut Feel

Related reading

We put confidence intervals on our LLM-judge scores. The error bars ate three…

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is…

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build…

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

How to Stop Evaluating LLM Outputs by Gut Feel