My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit test could catch

I almost shipped a RAG pipeline that, on certain questions, cited exactly the right document — and then told the user the answer wasn't in it.

Every unit test was green. The retrieval returned the correct chunk. The API returned 200. The citation was attached to the response. By every check I had, it worked. The first run of my eval harness scored it 0.57, and that number is the only reason I found out before users did.

This is the story of those two bugs, why no unit test I could have written would have caught them, and why I now believe an eval harness belongs in a GenAI project from day one — not "once it's stable."

What the eval harness actually does

For the RAG starter I was building, "chat with your documents," I wanted a test that exercised the thing users actually do, end to end. So the harness:

I almost shipped a RAG pipeline that, on certain questions, cited exactly the right document — and then told the user the answer wasn't in it.

What the eval harness actually does

For the RAG starter I was building, "chat with your documents," I wanted a test that exercised the thing users actually do, end to end. So the harness:

My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit test could catch

Other newsrooms on this story

My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit test could catch

Other newsrooms on this story

Related reading

I Built a 131-Test Eval Harness Before Writing New Features. Here's the Silent…

LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every…

How I set up RAG evals in CI/CD so they actually catch regressions

Token-level eval harness for tool-calling agents: what we wired up

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm…

Related reading

I Built a 131-Test Eval Harness Before Writing New Features. Here's the Silent…

LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every…

How I set up RAG evals in CI/CD so they actually catch regressions

Token-level eval harness for tool-calling agents: what we wired up

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm…