Coding agents like Claude Code, Cursor, and Codex CLI handle the coding parts of building an AI application well. The harder work comes after: understanding why a response went wrong, building eval sets that reflect real production behavior, and keeping up with an application that changes faster than any one-off script can. Teams spend 60–80% of their time on evaluation and error analysis, and much of that work needs to be redone every time the stack shifts.
Datadog Agent Observability already captures the telemetry data needed to answer those questions. It traces every prompt and response and runs online evaluations over them. To make that telemetry data usable from inside your coding agent, we’ve built two foundations. The Agent Observability toolset in the Datadog MCP Server gives agents structured access to Agent Observability data. The Pup CLI, a command-line interface into much of Datadog’s API surface. On top of these foundations, we’re shipping a set of Agent Skills that package common AI engineering tasks into single commands. Drop them into your agent’s skills directory, and your coding agent can classify sessions, debug production failures, and evaluate new versions of your application against real traffic.







