GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

TL;DR

3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x slower than expected. Zero tools to diagnose this. eBPF kernel tracing produces causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. A taskset fix, back to sleep, no ML engineer woken up. This is a field guide for GPU incident response using eBPF tracing to go from alert to root cause in under a minute.

The 3am Page Every GPU SRE Dreads

GPU incident response starts with a page that makes no sense. PagerDuty fires:

[CRITICAL] GPU Training Pipeline SLA Breached

GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

Other newsrooms on this story

Related reading

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

GPUs keep falling off the PCIe bus, and standard node health does not notice

Solving the GPU Pinning Saga and Gemma's Meta-Commentary

Enterprise GPU utilization: why 95% of AI infrastructure spend is wasted

FOMO Driving GPU Overbuying, 95% of Capacity Idle

The $2 trillion AI infrastructure problem no one is talking about, and the…