I spent three weekends building a regex pipeline for invoice data extraction. By the end of it, I had 63% accuracy on a test set of 100 PDFs. My co-founder looked at me and said, "Is this production ready?"

No. No it wasn't.

This is the story of how I stopped trying to outsmart every edge case and started treating the problem as what it really is: a language understanding task. And no, the solution wasn't fine-tuning a model on my 100 invoices. The real trick was way simpler.

The problem: every invoice is a snowflake

We were building a small expense management tool. Users upload PDF invoices, and we need to extract vendor name, date, total amount, and line items. Simple, right?