Schema Drift Is the Silent Killer. Here's What to Log So You Actually Catch It.

TL;DR — Most scraper "bugs" aren't bugs. They're the source site changing its data shape underneath you while your selectors and your code keep returning success. This is schema drift, and you cannot prevent it. You can only detect it. The detection has to be designed in. Here's how we do it.

I have a low opinion of any scraper that does not log a per-field availability rate. It's the single most useful number you can produce, and almost nobody produces it.

The premise: every record you scrape has a set of expected fields. After every run, you compute, for each field, the percentage of records that had a non-null value for it. You log that number. You alarm on it.

That's it. That's the whole technique.

Why this matters

I have a low opinion of any scraper that does not log a per-field availability rate. It's the single most useful number you can produce, and almost nobody produces it.

That's it. That's the whole technique.

Why this matters

Schema Drift Is the Silent Killer. Here's What to Log So You Actually Catch It.

Schema Drift Is the Silent Killer. Here's What to Log So You Actually Catch It.

Related reading

The Silent Killer in Your Streaming Pipeline: Schema Evolution Without Tears

schema drift를 fail이 아니라 warn으로 둔 이유

Catch LLM Schema Drift Before It Breaks Production

Building a Lean, Single-Worker Broken URL Monitor for Data Pipelines

HTTP 200 Is a Lie: A 30-Line Schema Canary for Source Drift

Your structured data is probably broken, and your crawler isn't telling you

Related reading

The Silent Killer in Your Streaming Pipeline: Schema Evolution Without Tears

schema drift를 fail이 아니라 warn으로 둔 이유

Catch LLM Schema Drift Before It Breaks Production

Building a Lean, Single-Worker Broken URL Monitor for Data Pipelines

HTTP 200 Is a Lie: A 30-Line Schema Canary for Source Drift

Your structured data is probably broken, and your crawler isn't telling you