Datadog dashboards for prompt regression: the panels we actually keep

We wired our LLM eval suite into Datadog over about four months. Most of the panels we built got deleted. These are the five that stayed, and the metrics that feed them.

TL;DR: We run an LLM-as-judge eval suite on every PR that touches a prompt, and we ship the results to Datadog as custom metrics. The dashboard started with fourteen panels. We kept five. The one that catches the most real regressions is per-criterion pass-rate split out by judge criterion, not the single rolled-up pass-rate number, because an aggregate of 91 percent hid the fact that one criterion had dropped from 0.95 to 0.62. Below are the metrics we emit, the Python that submits them, the monitor config we alert on, and the panels we tried and dropped.

Some context on the setup so the rest makes sense. We are a Series-C dev-tool startup. We have a handful of prompts in production that do real work (classification, extraction, a summarization step in an agent loop). Each one has an eval set of tagged examples, somewhere between 80 and 400 per prompt. The judge is a separate model call that scores each output against a rubric. We run the suite in GitHub Actions. The eval job emits metrics to Datadog at the end of every run. Backend service health was already in Datadog, so putting eval data next to it meant one place to look during an incident instead of two.

We wired our LLM eval suite into Datadog over about four months. Most of the panels we built got deleted. These are the five that stayed, and the metrics that feed them.

Datadog dashboards for prompt regression: the panels we actually keep

Datadog dashboards for prompt regression: the panels we actually keep

Other newsrooms on this story

Related reading

I Shipped 12 BI Dashboards With 5 Different Tools. Here Is the Honest…

Other newsrooms on this story

Related reading

I Shipped 12 BI Dashboards With 5 Different Tools. Here Is the Honest…

Annotate traces to improve LLM quality with Datadog LLM Observability | Datadog

How we made a SQL query optimization agent 59% more accurate using autoresearch…

How we built a real-world evaluation platform for autonomous SRE agents at…

DASH 2026 End-to-End Observability: Guide to Datadog’s newest announcements |…

Get a unified view of system health with the Datadog Synthetic Monitoring…