Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should be more trustworthy than the thing it grades.
The quiet flaw in "LLM-as-judge" evals
Most tools that score AI output are an LLM grading an LLM, and they report every number in the same confident voice — the verified ones and the guessed ones alike. For evaluation that's backwards. An evaluator's whole job is to be more trustworthy than the model it grades, not equally credulous.
rag-triad is a small local evaluator for retrieval-augmented answers built on one rule: lean on a deterministic check wherever one exists, and abstain — out loud — wherever one doesn't.
Localizing the failure, not just scoring it







