When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages

I’ve been doing web scraping for years. For most projects, I lean on BeautifulSoup, cssselect, and a handful of regex patterns. You know the drill: inspect the page, find the selector, extract the text, clean it up. It works great when every page follows the same template.

Then I hit a project that involved scraping product details from hundreds of small e-commerce sites. Every site had its own HTML structure. Some used <div class="price">, others <span itemprop="price">, and a few just had $29.99 buried in a paragraph with no class at all. My carefully crafted selectors broke within the first dozen sites. I was spending more time writing conditional parsers than actually using the data.

What I Tried That Didn’t Work

My first instinct was to throw more code at the problem. I wrote a meta‑parser that tried multiple selectors and fell back to regex for patterns like prices. That worked… until a site used a different currency symbol, or a discount price that appeared after the original. Debugging became a nightmare.

Next, I tried training a simple classifier to tag elements (price, name, description) based on their attributes and surrounding text. I used scikit‑learn with features like class names, tag name, text length. It worked okay on the training set, but failed on new layouts. The feature space was too shallow.

What I Tried That Didn’t Work

When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages

When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages

Related reading

When Regex Fails: LLMs for Messy HTML Data

Why I Gave Up on Regex and Built an AI Data Extractor

I spent 3 days writing regexes. Then I asked an AI to do it in 10 minutes.

I stopped fighting broken parsers — here's how I use LLMs to extract web data…

I Gave Up on CSS Selectors: Using LLMs for Web Scraping

How I Stopped Fighting Regex and Finally Extracted Data with LLMs

Related reading

When Regex Fails: LLMs for Messy HTML Data

Why I Gave Up on Regex and Built an AI Data Extractor

I spent 3 days writing regexes. Then I asked an AI to do it in 10 minutes.

I stopped fighting broken parsers — here's how I use LLMs to extract web data…

I Gave Up on CSS Selectors: Using LLMs for Web Scraping

How I Stopped Fighting Regex and Finally Extracted Data with LLMs