Building a Lean, Single-Worker Broken URL Monitor for Data Pipelines

The Technical Problem: Websites Drift, Pipelines Don't Know

Long-running scraping pipelines have a structural assumption baked in: the URLs you configured last month still resolve today. That assumption is wrong more often than you'd expect.

Sites reorganize their URL structure during CMS migrations. Documentation pages get archived or consolidated. Blog posts get unpublished. Product pages disappear. This is called site drift — the slow, continuous decay of a website's link graph over time — and it's completely normal behavior from the target site's perspective. From your pipeline's perspective it's a quiet source of wasted work.

The failure mode looks like this: your scheduled scraper fires, constructs its list of target URLs from a cached sitemap or a hardcoded config, and dispatches requests to all of them. Some of those URLs now return 404 Not Found or 500 Internal Server Error. The pipeline either silently swallows the errors, logs them somewhere nobody checks, or — worse — passes empty response bodies downstream into your parser, which produces garbage records. Your data store fills with empty or malformed entries. Compute units are consumed for zero useful output.

At small scale, this is a minor annoyance. At any meaningful schedule frequency — hourly, daily, continuous — it compounds into a real cost problem. You're paying for bandwidth and execution time on requests you already know are going to fail, because nobody built a gate to check first.

Building a Lean, Single-Worker Broken URL Monitor for Data Pipelines

Related reading

How I Fixed a 30% Bandwidth Leak in Our Scraping Pipeline with a Django Dynamic…

Microsoft Just Published a Blueprint for Self-Healing CI/CD. Here's What the…

Cleaning Background Noise and Scaling AI Scraping

Your Data Pipeline Is Probably More Fragile Than You Think

I Built an AI Pipeline for 10,000 Daily Listings. Here's What Broke at Scale.

54/60 Days System Design Questions