Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team

Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go

My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools

Me: xgabriel.com | GitHub

Your golden eval set was good in March. It's December now. Half the queries you see in production don't look like anything in the eval set. The dashboard still shows 98% pass, and that number is a lie, because the test set you're measuring against stopped resembling the workload months ago.