We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse.

Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had no reliable way to detect it.

Bits AI SRE is Datadog’s autonomous agent for investigating production incidents. It reasons across metrics, logs, traces, infrastructure metadata, network telemetry, monitor configuration, and more to determine, triage, and remediate the root cause of an issue.

As we built Bits, we expected behavior to improve incrementally with each feature we added. Instead, we saw something more subtle. Improvements in one area could quietly introduce regressions in another. The problem wasn’t just the model. We had no way to replay real production context, measure behavior consistently across diverse incidents, or track whether the agent was actually improving over time.

We needed infrastructure that could turn production issues into reproducible investigation environments. So we built a replayable evaluation platform from scratch.