A Better LLM Judge? The Rubric Made My Small Model Worse

Part 3 of an eval series. I tried to fix a 43%-agreement LLM judge two ways — a bigger model (DeepSeek & Qwen via OpenRouter) and a real anchored rubric — in a 2x2 against human votes. The rubric helped the big model and HURT the small one. A good rubric needs a model capable of following it.

lunedì 29 giugno 2026 New tab

In Part 2 I built the laziest possible LLM judge — a tiny model (Qwen2.5-1.5B) and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating.

Two things were wrong with that judge, and people usually fix only one:

The model was too small.

The rubric told it almost nothing.

I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that.

A Better LLM Judge? The Rubric Made My Small Model Worse

A Better LLM Judge? The Rubric Made My Small Model Worse

Other newsrooms on this story

Related reading

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

An open source LLM eval tool with two independent quality signals

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Exploring LLM-as-a-Judge

Let's talk about LLM evaluation

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One…

Other newsrooms on this story

Related reading

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

An open source LLM eval tool with two independent quality signals

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Exploring LLM-as-a-Judge

Let's talk about LLM evaluation

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One…