An open source LLM eval tool with two independent quality signals

LLM-as-judge has become the dominant pattern for evaluating language model outputs. Tools like Promptfoo, Braintrust, LangSmith all converge on the same architecture: send your prompt to your model, send the output to a different model with a rubric, take the second model's score as the quality signal.

This works. It's also expensive (judge tokens cost real money), slow (extra API roundtrip), variance-prone (the same eval gets different scores across runs), and architecturally a bit circular (using an LLM to evaluate an LLM trained on overlapping data distributions). The single signal becomes a bottleneck for trust.

So I built an eval module that has two independent signals instead of one.

What the tool does

Side-by-side blind comparison. Two agents answer the same prompt. One runs raw, the other can optionally have a cognitive harness wired in as a tool call. A separate blind judge model scores both responses, sees only A and B labels with no knowledge of which is which. Standard setup so far.

So I built an eval module that has two independent signals instead of one.

What the tool does

An open source LLM eval tool with two independent quality signals

An open source LLM eval tool with two independent quality signals

Other newsrooms on this story

Related reading

Exploring LLM-as-a-Judge

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Together Evaluations: Benchmark Models for Your Tasks

How to Stop Evaluating LLM Outputs by Gut Feel

Evaluating RAG with LLM as a Judge | Mistral AI

I checked six LLM-as-judge tools against human labels. The scoreboard was the…

Other newsrooms on this story

Related reading

Exploring LLM-as-a-Judge

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Together Evaluations: Benchmark Models for Your Tasks

How to Stop Evaluating LLM Outputs by Gut Feel

Evaluating RAG with LLM as a Judge | Mistral AI

I checked six LLM-as-judge tools against human labels. The scoreboard was the…