A RAG evaluator that admits what it can't judge

Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should be more trustworthy than the thing it grades.

The quiet flaw in "LLM-as-judge" evals

Most tools that score AI output are an LLM grading an LLM, and they report every number in the same confident voice — the verified ones and the guessed ones alike. For evaluation that's backwards. An evaluator's whole job is to be more trustworthy than the model it grades, not equally credulous.

rag-triad is a small local evaluator for retrieval-augmented answers built on one rule: lean on a deterministic check wherever one exists, and abstain — out loud — wherever one doesn't.

Localizing the failure, not just scoring it

Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should be more trustworthy than the thing it grades.

The quiet flaw in "LLM-as-judge" evals

rag-triad is a small local evaluator for retrieval-augmented answers built on one rule: lean on a deterministic check wherever one exists, and abstain — out loud — wherever one doesn't.

Localizing the failure, not just scoring it

A RAG evaluator that admits what it can't judge

Other newsrooms on this story

A RAG evaluator that admits what it can't judge

Other newsrooms on this story

Related reading

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About…

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

An open source LLM eval tool with two independent quality signals

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Related reading

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About…

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

An open source LLM eval tool with two independent quality signals

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce