A founder we work with had been stuck on the same problem for two months. Their RAG retrieval recall was sitting at 58%. They had tried OpenAI's embedding-3-small, then embedding-3-large, then BGE-M3, then Voyage. Each swap added a couple of points, then the curve flattened. The team was about to start fine-tuning their own embedding model.

We told them to stop and add a reranker first. The number went from 58% to 81% in a single afternoon. The fine-tuning project was cancelled.

This is the moment most teams discover that the bottleneck was never the embedding model. It was the architecture choice of using a single embedding per chunk to begin with. Late interaction is the family of techniques that fixes it, and it is the one most teams skip because the name sounds intimidating.

What a single embedding per chunk loses

A bi-encoder (which is what every standard embedding model is) takes a chunk of text, compresses it into a single fixed-length vector, and stores it. At query time, the user's question is also compressed into a single vector, and similarity is computed between the two.