Originally published on lavkesh.com

I recall the excitement a year ago when models could hold a million tokens in context. That's about 750,000 words or ten average novels sitting in a single prompt. The demos were impressive, and researchers posted benchmarks, but soon teams realized that having a massive context window and knowing what to do with it are two different problems.

I'm not dismissing the capability; a million tokens in context is a real technical achievement. However, I think there's a version of the conversation happening right now that treats window size as the finish line, and that's worth pushing back on.

The pattern I've seen play out is that a team gets access to a long-context model, loads in a large document or codebase, sends a query, and gets back results that are okay, sometimes good, but often frustratingly hard to diagnose. The model technically saw everything in the prompt, but whether it used the right parts is a different question entirely.

Researchers have identified a phenomenon called 'lost in the middle,' where models tend to pay disproportionate attention to content at the beginning and end of a context window, underweighting material in the middle. So if you're feeding in a 200-page document and the critical detail is on page 94, you might not get the answer you're looking for.