A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.

Not a war story. Pattern share.

What changed in our Promptfoo config

# Before: single 5-class assertion

assertions: