As environments grow in size and scale and new AI workloads are deployed every day, infrastructure teams must constantly adapt to and manage new resource patterns, scaling behavior, and operational risks. When application teams don’t have the expertise to respond to issues confidently on their own, infrastructure teams shoulder the burden to remediate issues across their infrastructure stack, including hosts, Kubernetes, serverless, and network infrastructure. These issues can include disk saturation on hosts, CrashLoopBackOff and OOMKilled errors in Kubernetes, concurrency limits on AWS Lambda, expiring TLS certificates on networks, memory pressure on Amazon ECS, and much more.

Datadog Bits Infrastructure Operations autonomously detects, investigates, and safely remediates common infrastructure issues before they impact your production environments and escalate into incidents. When Bits can safely act, it remediates issues automatically. When approval is required, it surfaces the highest-priority issues with the information your team needs to review and approve the next step. This reduces handoffs between application and infrastructure teams. Application engineers can identify infrastructure issues affecting their services and safely remediate them, while platform engineers control the guardrails.