TL;DR: A chunk of our EC2 build agents got slow at the same time every afternoon. No CPU pressure, no memory pressure, no network weirdness. It was EBS gp2 burst credits draining to zero, and the fix was a one-line volume type change to gp3 plus a CloudWatch alarm we should've had years ago.
Right, so this one annoyed me for about a week before the penny dropped.
I work on the core compute platform at Buildkite. We run a fleet of EC2 build agents that pick up jobs off a queue and run them. Sydney afternoon, roughly 2pm local, a handful of agents would start dragging. Builds that normally finished in 4 minutes were taking 11. Not all agents. Maybe 15% of the fleet at any given time.
The symptom that lied to us
First instinct was the obvious stuff. CPU? Flat at 30%. Memory? Plenty free. The agent process itself looked bored.






