TL;DR. The human-labeled calibration set you use to validate an LLM-as-judge does not need a fixed size. It needs a size that depends on how balanced your labels are. For roughly balanced binary criteria with no heavy tail, 50 stratified traces will usually pin Cohen's kappa to within a tolerable band (in my runs, a 95 percent bootstrap interval on the order of plus or minus 0.10 to 0.15). The moment you have a rare-but-expensive category, say a safety violation that shows up in 6 percent of traces, 50 is not enough and you should plan for 200 or more, because the variance of kappa is dominated by the count of minority-class examples, not the total. Below I give the kappa formula and why it is sensitive to the marginal distribution, the sample-size intuition, Wilson confidence intervals for small-n per-class precision, and the stratified-sampling routine that keeps marginals stable week to week. Pasteable Python at the end.

1. Kappa, and why the marginal distribution is doing more work than you think

Cohen's kappa (Cohen, 1960) measures agreement between two raters corrected for the agreement you would expect by chance. Here the two raters are your human labeler and your LLM-as-judge. The formula is kappa = (p_o - p_e) / (1 - p_e), where p_o is observed agreement and p_e is chance agreement computed from the marginals. For a binary label, if the human marks "pass" with probability a and the judge with probability b, then p_e = a*b + (1-a)*(1-b).