Why regex couldn't parse my invoices (and what did)

I spent three weekends building a regex pipeline for invoice data extraction. By the end of it, I had...

domenica 21 giugno 2026 New tab

TL;DRAI

LLM function calling extracts invoices at 92% accuracy vs. 63% from regex, across languages and layouts, for ~$0.01 each. Tech teams skip regex and fine-tuning—structured outputs now reliably extract vendor, date, line items across heterogeneous invoice formats.

1,074 words~5 min read

I spent three weekends building a regex pipeline for invoice data extraction. By the end of it, I had 63% accuracy on a test set of 100 PDFs. My co-founder looked at me and said, "Is this production ready?"

No. No it wasn't.

This is the story of how I stopped trying to outsmart every edge case and started treating the problem as what it really is: a language understanding task. And no, the solution wasn't fine-tuning a model on my 100 invoices. The real trick was way simpler.

The problem: every invoice is a snowflake

We were building a small expense management tool. Users upload PDF invoices, and we need to extract vendor name, date, total amount, and line items. Simple, right?