TL;DR
Zero-shot JSON extraction replaces brittle CSS selectors with Large Language Models that map unstructured web content to predefined schemas semantically. By processing cleaned HTML or Markdown through an LLM context window, scraping pipelines become resilient to UI changes, A/B tests, and dynamic class names. This approach shifts data engineering effort from constant selector maintenance to high-level schema definition, enabling truly agentic data collection.
The Selector Maintenance Trap
Web scraping pipelines eventually hit the same bottleneck: selector maintenance. Traditional data extraction relies on identifying structural patterns in the Document Object Model (DOM). You write rules targeting specific HTML nodes using tools like XPath, BeautifulSoup, or Cheerio.
A standard selector might look like div.product-details > span:nth-child(3) > b.price-tag.






