Last month, I needed to extract product specifications from a dozen e-commerce sites for a price comparison project. Simple, right? Just scrape the HTML, grab the <table> or <dl>, and parse it into JSON.

Two days later, I was ready to throw my laptop out the window. Every site had a different markup. Some used <div> soup, others hid data in JavaScript objects, and a few served the specs as an image of a table. Regex and BeautifulSoup got me maybe 40% of the way before everything fell apart.

What I Tried That Didn't Work

1. CSS selectors and XPath

I started with soup.select('table.specs tr'). Worked great on site A. Site B used ul.list. Site C had a nested <dl> inside a shadow DOM. I ended up with a 200-line function full of fallback logic that still missed half the fields.