I’ve been doing web scraping for years. For most projects, I lean on BeautifulSoup, cssselect, and a handful of regex patterns. You know the drill: inspect the page, find the selector, extract the text, clean it up. It works great when every page follows the same template.
Then I hit a project that involved scraping product details from hundreds of small e-commerce sites. Every site had its own HTML structure. Some used <div class="price">, others <span itemprop="price">, and a few just had $29.99 buried in a paragraph with no class at all. My carefully crafted selectors broke within the first dozen sites. I was spending more time writing conditional parsers than actually using the data.
What I Tried That Didn’t Work
My first instinct was to throw more code at the problem. I wrote a meta‑parser that tried multiple selectors and fell back to regex for patterns like prices. That worked… until a site used a different currency symbol, or a discount price that appeared after the original. Debugging became a nightmare.
Next, I tried training a simple classifier to tag elements (price, name, description) based on their attributes and surrounding text. I used scikit‑learn with features like class names, tag name, text length. It worked okay on the training set, but failed on new layouts. The feature space was too shallow.






