Your agent traces are scattered across four incompatible formats, and that fragmentation is quietly the reason your evals don't cover production. You run OpenClaw in one service, someone bolted LangSmith onto the Python side, the platform team standardized on OpenTelemetry, and your homegrown recorder writes its own JSON. Four shapes. Four schemas. Zero shared triage. So when you finally sit down to find the production runs worth turning into eval cases, you either write four parsers or — far more likely — you look at one source and call it a day.

I just built the adapter layer that makes that a non-problem, and the exercise taught me something about honest tooling I want to show you, bug and all.

The premise: your eval set should come from production, not imagination

I've argued before that the hardest part of agent evaluation isn't the scorer, it's the corpus — that a rigorous judge over twelve hand-invented cases is grading fiction. The only honest source of eval cases is the traffic you actually serve. Your users run a free, adversarial fuzzing campaign against your agent every day; the job is to capture the runs that broke and promote them into permanent regression cases.

But there's a step-zero nobody talks about: before you can promote a trace, you have to be able to read it. And "read it" is where the fragmentation tax hits. A trace store is only useful if the thing that grades runs can ingest whatever recorded them. Otherwise your beautiful trace archive is four silos, and your eval coverage quietly collapses to whichever silo was easiest to parse.