How we built a real-world evaluation platform for autonomous SRE agents at scale

We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse.

Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had no reliable way to detect it.

Bits AI SRE is Datadog’s autonomous agent for investigating production incidents. It reasons across metrics, logs, traces, infrastructure metadata, network telemetry, monitor configuration, and more to determine, triage, and remediate the root cause of an issue.

As we built Bits, we expected behavior to improve incrementally with each feature we added. Instead, we saw something more subtle. Improvements in one area could quietly introduce regressions in another. The problem wasn’t just the model. We had no way to replay real production context, measure behavior consistently across diverse incidents, or track whether the agent was actually improving over time.

We needed infrastructure that could turn production issues into reproducible investigation environments. So we built a replayable evaluation platform from scratch.

We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse.

Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had no reliable way to detect it.

We needed infrastructure that could turn production issues into reproducible investigation environments. So we built a replayable evaluation platform from scratch.

How we built a real-world evaluation platform for autonomous SRE agents at scale | Datadog

How we built a real-world evaluation platform for autonomous SRE agents at scale | Datadog

Related reading

Improve AI agent quality with Bits Evals | Datadog

Using Evaluation Frameworks with Agent Observability | Datadog

Get reliable answers to business questions with Bits Data Analysis | Datadog

Debug and evaluate your AI app from your coding agent with Datadog Agent…

Engineering | Datadog Official Blog

Autonomously monitor for impactful degradations with Bits Detection | Datadog