Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green...

mercoledì 3 giugno 2026 New tab

552 words~3 min read

A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.

Not a war story. Pattern share.

What changed in our Promptfoo config

# Before: single 5-class assertion

assertions:

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Related reading

LLM-as-Judge Shouldn't Aggregate Scores: Binary Checks as Evidence, One…

I checked six LLM-as-judge tools against human labels. The scoreboard was the…

We put confidence intervals on our LLM-judge scores. The error bars ate three…

More eval traces will not stabilize your kappa. Stratify the ones you have

Your LLM-as-judge disagrees with itself between runs

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Related reading

LLM-as-Judge Shouldn't Aggregate Scores: Binary Checks as Evidence, One…

I checked six LLM-as-judge tools against human labels. The scoreboard was the…

We put confidence intervals on our LLM-judge scores. The error bars ate three…

More eval traces will not stabilize your kappa. Stratify the ones you have

Your LLM-as-judge disagrees with itself between runs

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans