I Tried AI-Powered Web Scraping So My Selectors Could Finally Rest

A few months ago, I was building a price comparison tool that needed to pull product info from a dozen different e-commerce sites. Each one had its own lovingly crafted HTML structure—nested <div>s with classes like price-123abc that changed on every deployment. My initial approach was traditional: XPath, CSS selectors, and a sprinkle of regex. It worked until it didn’t. Then I discovered that I could throw an LLM at the raw HTML and let it figure out the extraction. Here’s what I learned.

The Problem That Made Me Want to Throw My Laptop

I had a scraper for Site A that used document.querySelector('.product-price'). It was fragile but worked for months. Then Site A redesigned. The selector broke. I updated it. A week later, another redesign. I started using regex to find patterns like \$\d+\.\d{2}. Then someone added a badge that said “$5 off” and my regex grabbed the wrong number.

I needed something that could understand the meaning of a price, not just its structure. That’s when I wondered: could GPT-4 (or any language model) parse the raw HTML and give me the structured data I needed?

What I Tried That Failed (So You Don’t Have To)

I Tried AI-Powered Web Scraping So My Selectors Could Finally Rest

Related reading

When Traditional Web Scraping Fails: A Practical AI Approach

I Gave Up on CSS Selectors: Using LLMs for Web Scraping

Why I Gave Up on Perfect Selectors and Asked GPT to Extract My Data

My Web Scraper Was Too Fragile — Here's How AI Fixed It

Why I Gave Up on Regex and Built an AI Data Extractor

I Spent 3 Days Scraping a Site — Then AI Did It in 10 Minutes