A few months ago, I was building a price tracker for limited-edition sneakers. I had a list of 50+ store URLs, and I needed to extract product name, price, availability, and size options. Classic scraping, right?

I started with CSS selectors. BeautifulSoup + requests. It worked for about a week. Then one site changed their class names. Another added a dynamic loader. A third injected ads that shifted the DOM. I spent more time fixing selectors than actually using the data.

I tried regex on the raw HTML. That was a disaster — fragile and unreadable. I tried headless browsers with Playwright, waiting for specific elements. Still broke when the layout changed.

The problem was fundamental: I was trying to reverse-engineer the presentation layer. But what I really wanted was the meaning of the content — the product's price, not the CSS class it lived in.

The turning point: LLMs for structured extraction