Key Takeaways You can't unit-test a coach agent the way you test a pure function — the output is...

Part 2 of an eval series. A 15-line LLM judge, scored against real Chatbot Arena human votes. It agreed with people on just 43% of pairs, tied a third of them, parked every score…

Key Takeaways You can't unit-test a coach agent the way you test a pure function — the output is...

Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should...