Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory

TL;DR. The human-labeled calibration set you use to validate an LLM-as-judge does not need a fixed size. It needs a size that depends on how balanced your labels are. For roughly balanced binary criteria with no heavy tail, 50 stratified traces will usually pin Cohen's kappa to within a tolerable band (in my runs, a 95 percent bootstrap interval on the order of plus or minus 0.10 to 0.15). The moment you have a rare-but-expensive category, say a safety violation that shows up in 6 percent of traces, 50 is not enough and you should plan for 200 or more, because the variance of kappa is dominated by the count of minority-class examples, not the total. Below I give the kappa formula and why it is sensitive to the marginal distribution, the sample-size intuition, Wilson confidence intervals for small-n per-class precision, and the stratified-sampling routine that keeps marginals stable week to week. Pasteable Python at the end.

1. Kappa, and why the marginal distribution is doing more work than you think

Cohen's kappa (Cohen, 1960) measures agreement between two raters corrected for the agreement you would expect by chance. Here the two raters are your human labeler and your LLM-as-judge. The formula is kappa = (p_o - p_e) / (1 - p_e), where p_o is observed agreement and p_e is chance agreement computed from the marginals. For a binary label, if the human marks "pass" with probability a and the judge with probability b, then p_e = a*b + (1-a)*(1-b).

1. Kappa, and why the marginal distribution is doing more work than you think

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory

Other newsrooms on this story

Related reading

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Exploring LLM-as-a-Judge

An open source LLM eval tool with two independent quality signals

Together Evaluations: Benchmark Models for Your Tasks

Introducing AutoJudge: Streamlined inference acceleration via automated dataset…

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Other newsrooms on this story

Related reading

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Exploring LLM-as-a-Judge

An open source LLM eval tool with two independent quality signals

Together Evaluations: Benchmark Models for Your Tasks

Introducing AutoJudge: Streamlined inference acceleration via automated dataset…

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)