One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage

Your agent traces are scattered across OpenClaw, LangSmith, OpenTelemetry, and homegrown recorders - and that fragmentation quietly shrinks your eval coverage to one silo. Here is the adapter layer that reads them all into one deterministic triage pass, plus the moment the tool caught a bug in my own gate.

giovedì 2 luglio 2026 New tab

Your agent traces are scattered across four incompatible formats, and that fragmentation is quietly the reason your evals don't cover production. You run OpenClaw in one service, someone bolted LangSmith onto the Python side, the platform team standardized on OpenTelemetry, and your homegrown recorder writes its own JSON. Four shapes. Four schemas. Zero shared triage. So when you finally sit down to find the production runs worth turning into eval cases, you either write four parsers or — far more likely — you look at one source and call it a day.

I just built the adapter layer that makes that a non-problem, and the exercise taught me something about honest tooling I want to show you, bug and all.

The premise: your eval set should come from production, not imagination

I've argued before that the hardest part of agent evaluation isn't the scorer, it's the corpus — that a rigorous judge over twelve hand-invented cases is grading fiction. The only honest source of eval cases is the traffic you actually serve. Your users run a free, adversarial fuzzing campaign against your agent every day; the job is to capture the runs that broke and promote them into permanent regression cases.

But there's a step-zero nobody talks about: before you can promote a trace, you have to be able to read it. And "read it" is where the fragmentation tax hits. A trace store is only useful if the thing that grades runs can ingest whatever recorded them. Otherwise your beautiful trace archive is four silos, and your eval coverage quietly collapses to whichever silo was easiest to parse.

I just built the adapter layer that makes that a non-problem, and the exercise taught me something about honest tooling I want to show you, bug and all.

The premise: your eval set should come from production, not imagination

One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage

One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage

Related reading

tracesage: See Inside Your LangGraph Agents

Distributed Tracing 101: The Mental Model, the Standards, and Your First…

I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of…

Multi-agent runs need a handoff receipt, not just a shared trace

Distributed Tracing for LLM Agents: When MCP Makes Tool Calls Observable

cctrace: a local profiler for AI coding agents (Claude Code / Codex)

Related reading

tracesage: See Inside Your LangGraph Agents

Distributed Tracing 101: The Mental Model, the Standards, and Your First…

I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of…

Multi-agent runs need a handoff receipt, not just a shared trace

Distributed Tracing for LLM Agents: When MCP Makes Tool Calls Observable

cctrace: a local profiler for AI coding agents (Claude Code / Codex)