I'm an automation tester. Usually my job is simple: the same input should give the same output, every time. Language models don't work that way. Ask the same question twice and you can get two different answers, and both can be right.

A RAG system - retrieval-augmented generation - makes it harder still. It searches your own documents and has a model write the answer from what it finds (chat with your PDF, or a support bot answering from a company's help pages). So a wrong answer has two possible causes: the search picked the wrong page, or it picked the right page and the model still got it wrong. To the user these look the same. But they're different problems with different fixes. If your tests can't tell them apart, you don't know which half to fix.

So I built a small RAG system and a test suite built to tell the two apart.

Repo: https://github.com/sbezjak/llm-rag

What it is