When recall plateaus: the late-interaction technique most teams skip

A founder we work with had been stuck on the same problem for two months. Their RAG retrieval recall was sitting at 58%. They had tried OpenAI's embedding-3-small, then embedding-3-large, then BGE-M3, then Voyage. Each swap added a couple of points, then the curve flattened. The team was about to start fine-tuning their own embedding model.

We told them to stop and add a reranker first. The number went from 58% to 81% in a single afternoon. The fine-tuning project was cancelled.

This is the moment most teams discover that the bottleneck was never the embedding model. It was the architecture choice of using a single embedding per chunk to begin with. Late interaction is the family of techniques that fixes it, and it is the one most teams skip because the name sounds intimidating.

What a single embedding per chunk loses

A bi-encoder (which is what every standard embedding model is) takes a chunk of text, compresses it into a single fixed-length vector, and stores it. At query time, the user's question is also compressed into a single vector, and similarity is computed between the two.

When recall plateaus: the late-interaction technique most teams skip

Other newsrooms on this story

Related reading

From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

Replacing RAG with bash cut AI retrieval costs 30%

GLIA — A holographic memory for AI agents that isn't a graph and isn't RAG

Beyond Retrieval | Pinecone

Why your AI agent loops forever (and how to break the cycle)

Retrieval-Augmented Generation (RAG) | Pinecone