A few months ago, I was building a price comparison tool that needed to pull product info from a dozen different e-commerce sites. Each one had its own lovingly crafted HTML structure—nested <div>s with classes like price-123abc that changed on every deployment. My initial approach was traditional: XPath, CSS selectors, and a sprinkle of regex. It worked until it didn’t. Then I discovered that I could throw an LLM at the raw HTML and let it figure out the extraction. Here’s what I learned.
The Problem That Made Me Want to Throw My Laptop
I had a scraper for Site A that used document.querySelector('.product-price'). It was fragile but worked for months. Then Site A redesigned. The selector broke. I updated it. A week later, another redesign. I started using regex to find patterns like \$\d+\.\d{2}. Then someone added a badge that said “$5 off” and my regex grabbed the wrong number.
I needed something that could understand the meaning of a price, not just its structure. That’s when I wondered: could GPT-4 (or any language model) parse the raw HTML and give me the structured data I needed?
What I Tried That Failed (So You Don’t Have To)






