I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

Most LLM-as-judge comparisons rank tools by which one gives you a number fastest. That is the wrong axis. A judge you have not validated against human labels is not a measurement, it is a vibe with a decimal point. So I ran six tools the way a methodologist would: not "which one scores," but "which one helps me prove the score is trustworthy."

Trust here has a specific meaning. An LLM judge inherits known failure modes: position bias (it favors the first answer it sees), verbosity bias (it rewards longer outputs), and self-preference (it scores outputs from its own model family higher). None of these show up in the score itself. They show up only when you compare the judge against a human-labeled set and compute agreement. The standard instrument for that is Cohen's kappa, not raw accuracy, because raw accuracy lies whenever your classes are imbalanced.

So the criterion I graded each tool on was simple: how much friction does it put between me and a confusion matrix against human labels?

DeepEval (G-Eval). The broadest eval breadth of the group, honestly. Chain-of-thought scoring via G-Eval, a pytest-style harness, a large catalog of metrics. It is the tool I reach for when I want coverage. What it does not do for you is the human-agreement step. You write the judge, you collect the labels, you compute kappa yourself. Reference: Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (arXiv:2303.16634), which is worth reading precisely because it measures Spearman correlation with human judgment rather than asserting it. (G-Eval is the paper's method; DeepEval is the tool that implements it.)

So the criterion I graded each tool on was simple: how much friction does it put between me and a confusion matrix against human labels?

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

Other newsrooms on this story

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

Other newsrooms on this story

Related reading

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

We put confidence intervals on our LLM-judge scores. The error bars ate three…

An open source LLM eval tool with two independent quality signals

Exploring LLM-as-a-Judge

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is…

Related reading

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

We put confidence intervals on our LLM-judge scores. The error bars ate three…

An open source LLM eval tool with two independent quality signals

Exploring LLM-as-a-Judge

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is…