How we cut Spark compute costs by 44% with agentic AI and Datadog Jobs Monitoring

Spark jobs only get more expensive and harder to debug as they scale. It’s a problem we’ve run into ourselves. Our Referential Data Platform team builds and maintains the knowledge graph that maps relationships between customers’ observability entities. ServiceQueryEdge is at the center of that graph, mapping service entities to their associated metric and log queries. It runs daily across seven datacenters, with individual partitions processing up to 27 TB of input and 16 billion records. At that scale, we were averaging $1.5k of infrastructure costs daily, with each run taking over 17 hours.

AI agents seemed like a natural fit for this problem. They’re good at reasoning over code, connecting symptoms to root causes, and generating hypotheses quickly. But an agent working from code alone is still guessing. It needs to know what’s actually slow.

In this post, we’ll walk through how we used Datadog’s Data Observability Jobs Monitoring and an AI agent built on Claude to debug and optimize ServiceQueryEdge. We’ll cover what worked, what didn’t, and the specific changes that cut our daily compute costs by 44% and reduced run duration by 60% in US1, our largest data center.

Closing the gap between Jobs Monitoring and the codebase

How we cut Spark compute costs by 44% with agentic AI and Datadog Jobs Monitoring | Datadog

How we cut Spark compute costs by 44% with agentic AI and Datadog Jobs Monitoring | Datadog

Other newsrooms on this story

Related reading

Optimize Spark and Databricks jobs with Datadog | Datadog

Attribute AI costs across providers with Datadog Cloud Cost Management | Datadog

How we saved over $3 million in idle compute costs with Datadog Kubernetes…

Databricks sales growth tops 80%, but margin are shrinking from swarm of AI…

We Cut Our AI Agent Costs by 60%. Here's What Worked.

Making agentic token costs visible in production | Datadog