My classifier calls an LLM on every single email. The LLM is not allowed to classify the email.

That sounds like a contradiction. It's the most important design decision in the thing.

A reader named @nazar_boyko left a comment on my last post — the one where a cheap model beat GPT-4o on email triage — and put it better than I did:

Once the LLM is a feature scorer and not the decider, "consistency over genius" falls right out of it, and a cheap fast model is exactly what you want for reading the same four signals the same way every time.

The price upset was the fun headline. This is the actual thesis. So here it is on its own.