Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub
Your golden eval set was good in March. It's December now. Half the queries you see in production don't look like anything in the eval set. The dashboard still shows 98% pass, and that number is a lie, because the test set you're measuring against stopped resembling the workload months ago.







