I Spent a Weekend Fighting HTML Parsing. Here's What Finally Worked

Last month, I needed to extract product specifications from a dozen e-commerce sites for a price comparison project. Simple, right? Just scrape the HTML, grab the <table> or <dl>, and parse it into JSON.

Two days later, I was ready to throw my laptop out the window. Every site had a different markup. Some used <div> soup, others hid data in JavaScript objects, and a few served the specs as an image of a table. Regex and BeautifulSoup got me maybe 40% of the way before everything fell apart.

What I Tried That Didn't Work

1. CSS selectors and XPath

I started with soup.select('table.specs tr'). Worked great on site A. Site B used ul.list. Site C had a nested <dl> inside a shadow DOM. I ended up with a 200-line function full of fallback logic that still missed half the fields.

I Spent a Weekend Fighting HTML Parsing. Here's What Finally Worked

Related reading

How I Finally Got Reliable Data from Messy HTML Tables

I Tried AI-Powered Web Scraping So My Selectors Could Finally Rest

I spent 3 days writing regexes. Then I asked an AI to do it in 10 minutes.

Why I Gave Up on Regex and Built an AI Data Extractor

I stopped fighting broken parsers — here's how I use LLMs to extract web data…

When Regex Fails: LLMs for Messy HTML Data