Introduction

In the previous RAG implementation, we built a working system — but we could only verify "is this actually correct?" by reading answers manually.

[Before] Manual verification

Ask "How do you calculate F1 score?" → check the answer by eye

[Now — Evals]