A few months ago, I was building a price tracker for limited-edition sneakers. I had a list of 50+ store URLs, and I needed to extract product name, price, availability, and size options. Classic scraping, right?
I started with CSS selectors. BeautifulSoup + requests. It worked for about a week. Then one site changed their class names. Another added a dynamic loader. A third injected ads that shifted the DOM. I spent more time fixing selectors than actually using the data.
I tried regex on the raw HTML. That was a disaster — fragile and unreadable. I tried headless browsers with Playwright, waiting for specific elements. Still broke when the layout changed.
The problem was fundamental: I was trying to reverse-engineer the presentation layer. But what I really wanted was the meaning of the content — the product's price, not the CSS class it lived in.
The turning point: LLMs for structured extraction






