Extracting structured data from messy text: what worked for me

I spent a good two weeks last quarter building an invoice extraction pipeline for our accounting team. The emails came in all shapes: some with PDF attachments, others with plain text tables, a few with scanned images that had been OCR'd into garbled nonsense. My job was to pull out vendor name, invoice number, date, and total amount.

At first I thought, "Regex, obviously." I wrote patterns for date formats, dollar amounts, and common invoice prefixes. It worked on the first ten samples. Then the real data came. One vendor sent invoices with "Invoice #" and another used "Ref:". Dates were mm/dd/yyyy, dd.mm.yyyy, or even "March 5, 2023". Regex broke fast.

I tried spaCy next. Training a custom NER model for four fields seemed reasonable. I manually labelled 200 invoices using Prodigy (the team had a license). The model got to ~85% F1, but then a new vendor showed up with a different layout and accuracy dropped to 60%. Retraining every week wasn't sustainable.

The approach that finally stuck: few-shot LLM extraction

I realised I didn't need a full-fledged model. I just needed something that could read instructions and follow examples. LLMs (even small ones) are surprisingly good at this when you provide a clear system prompt and a handful of examples.

Extracting structured data from messy text: what worked for me

Related reading

From Regex Hell to AI: How I Finally Tamed Messy PDF Invoices

I Thought Regex Could Handle It: My Data Extraction Rabbit Hole

Why regex couldn't parse my invoices (and what did)

I Spent a Month Fighting LLMs to Extract Structured Data

I stopped fighting with regex for data extraction. Here's how AI saved my…

I spent a week on regex before realizing AI agent was the answer for data…