TL;DR: Our ECS build workers were quietly killing in-flight jobs every time we scaled in or deployed. The fix wasn't a bigger timeout, it was actually handling SIGTERM and bumping stopTimeout to 120s. Cut our "agent lost" failures from ~2% of runs to under 0.1%.

So, the thing that bugged me for weeks. We run a chunk of Buildkite's build compute on ECS, and every deploy or scale-in event would spike a small batch of failed builds. Not heaps. Maybe 2% of running jobs at that moment. Enough that someone in the team Slack would go "oi, my build died again" once a day.

The error was always the same flavour: agent disconnected mid-step, job marked as lost, customer retries, moves on. Annoying but not loud enough to page anyone. Which is exactly why it survived for a month.

What was actually happening

ECS sends SIGTERM to your container when it wants the task gone. Scale-in, deployment, spot reclaim, all of it. You get a grace window, then SIGKILL. The default stopTimeout is 30 seconds.