I needed to extract invoice line items from hundreds of PDF documents. Dates, amounts, vendor names. Sounded trivial with AI. But every naive approach burned cash, hallucinated values, or choked on varied formats.

Here’s what I tried, what failed, and the technique that finally worked – no hype, just honest trade-offs.

The Real Problem

We got a stack of PDF invoices from old acquisitions. Different layouts, fonts, some scanned. My boss wanted a structured CSV. “Just use an LLM,” they said.

I started with the simplest thing: dump the PDF text into GPT-4 with a prompt asking for a JSON array. For ~10 documents it worked fine. For 200, I hit three walls: