How I set up RAG evals in CI/CD so they actually catch regressions

I have hit this a few times.A PR lands late in the day, the RAG eval runs in under a minute, green check, merge.Twelve hours later support tickets start coming in.

The trace shows the retriever switched its top-1 chunk on a class of queries the 30-example dataset never covered. Suite Groundedness stayed at 0.91. Production Groundedness on the affected traffic was 0.62.

The gate passed because it was not checking the right thing.Most CI eval gates I have seen for RAG are smoke tests. Small dataset, mean compared against a fixed floor, pass unless something is badly broken.

The dataset is not representative, the floor is not tied to baseline variance, and the threshold does not separate a real regression from judge noise.So a green check does not tell you much.

The way I think about it now: a gate wants three things at once -cheap, fast, and statistically significant and you usually get two.

I have hit this a few times.A PR lands late in the day, the RAG eval runs in under a minute, green check, merge.Twelve hours later support tickets start coming in.

The dataset is not representative, the floor is not tied to baseline variance, and the threshold does not separate a real regression from judge noise.So a green check does not tell you much.

The way I think about it now: a gate wants three things at once -cheap, fast, and statistically significant and you usually get two.

How I set up RAG evals in CI/CD so they actually catch regressions

How I set up RAG evals in CI/CD so they actually catch regressions

Other newsrooms on this story

Related reading

vLLM V0 to V1: Correctness Before Corrections in RL

What Test Goes Where? A Practical Guide to Test Layer Decisions

Worker vs Test Scope & the Layer Rules (Playwright + TypeScript, Ch.10)

Why Your Next.js 15 App Is Still Slow (And How to Fix React 19 Hydration Lag)

Rugby Fundamentals as Software Concepts - Mapping the Pitch to your Code Base

Pitch Points: is Slot at risk, what’s up with Trinity Rodman and is…

Other newsrooms on this story

Related reading

vLLM V0 to V1: Correctness Before Corrections in RL

What Test Goes Where? A Practical Guide to Test Layer Decisions

Worker vs Test Scope & the Layer Rules (Playwright + TypeScript, Ch.10)

Why Your Next.js 15 App Is Still Slow (And How to Fix React 19 Hydration Lag)

Rugby Fundamentals as Software Concepts - Mapping the Pitch to your Code Base

Pitch Points: is Slot at risk, what’s up with Trinity Rodman and is…