I have hit this a few times.A PR lands late in the day, the RAG eval runs in under a minute, green check, merge.Twelve hours later support tickets start coming in.
The trace shows the retriever switched its top-1 chunk on a class of queries the 30-example dataset never covered. Suite Groundedness stayed at 0.91. Production Groundedness on the affected traffic was 0.62.
The gate passed because it was not checking the right thing.Most CI eval gates I have seen for RAG are smoke tests. Small dataset, mean compared against a fixed floor, pass unless something is badly broken.
The dataset is not representative, the floor is not tied to baseline variance, and the threshold does not separate a real regression from judge noise.So a green check does not tell you much.
The way I think about it now: a gate wants three things at once -cheap, fast, and statistically significant and you usually get two.









