A practical look at how debugging workflows, metrics, and automated runbooks are used to investigate slowdowns and failures in large-scale model training.