When Regex Fails: LLMs for Messy HTML Data

Last month I inherited a project that needed to extract product information from a legacy e‑commerce site. The HTML was a nightmare—no semantic classes, inconsistent attribute names, and the occasional blob of inline JavaScript. I thought I could just write a few regular expressions and be done in an hour. Six hours later I was staring at a wall of conditional logic that broke every time the page changed.

I needed a better way, and I ended up using a large language model (LLM) to handle the fuzzy extraction. Here’s what I learned—dead ends included—and a working approach you can copy‑paste today.

The Problem

The site had product cards like this:

When Regex Fails: LLMs for Messy HTML Data

Related reading

When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages

I spent 3 days writing regexes. Then I asked an AI to do it in 10 minutes.

How I Stopped Fighting Regex and Finally Extracted Data with LLMs

When Regex Fails: My Journey to AI-Powered Data Extraction

Why I Gave Up on Regex and Built an AI Data Extractor

I Spent a Weekend Fighting HTML Parsing. Here's What Finally Worked