Everyone ships the RAG system. Almost nobody ships the eval system that tells them when the RAG system starts lying.
You updated the embedding model. Tweaked the system prompt. Swapped the re-ranker. Metrics look fine. Three weeks later, support tickets arrive — the system is drawing inferences the source documents never made. No alarm fired. No test failed. The system drifted silently.
This is not a model quality problem. It is an evaluation infrastructure problem.
The Four Metrics That Matter
Faithfulness — of the claims in the response, what fraction are directly supported by the retrieved context? Your primary hallucination guard. Does not require ground truth.






