The SIGTERM our build workers ignored, and the 90s that fixed it

TL;DR: Our ECS build workers were quietly killing in-flight jobs every time we scaled in or deployed. The fix wasn't a bigger timeout, it was actually handling SIGTERM and bumping stopTimeout to 120s. Cut our "agent lost" failures from ~2% of runs to under 0.1%.

So, the thing that bugged me for weeks. We run a chunk of Buildkite's build compute on ECS, and every deploy or scale-in event would spike a small batch of failed builds. Not heaps. Maybe 2% of running jobs at that moment. Enough that someone in the team Slack would go "oi, my build died again" once a day.

The error was always the same flavour: agent disconnected mid-step, job marked as lost, customer retries, moves on. Annoying but not loud enough to page anyone. Which is exactly why it survived for a month.

What was actually happening

ECS sends SIGTERM to your container when it wants the task gone. Scale-in, deployment, spot reclaim, all of it. You get a grace window, then SIGKILL. The default stopTimeout is 30 seconds.

What was actually happening

ECS sends SIGTERM to your container when it wants the task gone. Scale-in, deployment, spot reclaim, all of it. You get a grace window, then SIGKILL. The default stopTimeout is 30 seconds.

The SIGTERM our build workers ignored, and the 90s that fixed it

The SIGTERM our build workers ignored, and the 90s that fixed it

Other newsrooms on this story

Related reading

Rebuilding the Hull at Sea

I got tired of waiting for deploys, so I built a local Lambda runner

My cron job was silently failing on Cloudflare. The bug wasn't where I looked.

I Shipped a Bug to Production That Cost Us 3 Hours of Downtime

The Prometheus label that blew our monitoring bill out 6x

When Agents Should Stop: Designing Safety Boundaries That Work

Other newsrooms on this story

Related reading

Rebuilding the Hull at Sea

I got tired of waiting for deploys, so I built a local Lambda runner

My cron job was silently failing on Cloudflare. The bug wasn't where I looked.

I Shipped a Bug to Production That Cost Us 3 Hours of Downtime

The Prometheus label that blew our monitoring bill out 6x

When Agents Should Stop: Designing Safety Boundaries That Work