Introduction
In the previous RAG implementation, we built a working system — but we could only verify "is this actually correct?" by reading answers manually.
[Before] Manual verification
Ask "How do you calculate F1 score?" → check the answer by eye
[Now — Evals]






