Nearly Optimal Attention Coresets | Pinecone

We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values in , there exists a subset of size at most such thatsimultaneously for all queries whose norm is bounded by . This outperforms the best known results for this problem. We also offer an improved lower bound showing that -coresets must have size .tldr:Can we compress LLM's KV caches without breaking the attention mechanism? The answer is Yes.We prove that all KV caches admit small corests. The bound we get is nearly optimal.The Memory BottleneckMost modern Large Language Models rely on the Attention mechanism to manage knowledge and context. They store previously seen tokens in memory as vector arrays called a KV Cache. As context windows grow, this KV cache becomes massive. It eats up expensive GPU memory and significantly slows down text generation, creating a major bottleneck for long-context AI.Compression with coresetsA coreset, in general, is a highly representative subset of your data. The quality of a coreset is measured by two opposing objectives. First, it needs to be as small as possible. Smaller coresets consume less memory and compute resources. Second, they need to be accurate. Using the coreset, you expect to get almost the same result as using the whole dataset. The way to connect the two is to define an acceptable error tolerance and then find the smallest coreset that can achieve it. Coreset optimality is measured by the relationship between the error tolerance and the coreset size.This ResultThis paper shows that every KV cache admits a small coreset (a subset of keys and values) such that the attention vector computed only on the coreset is provably close to the attention computed on the entire KV store. And this holds simultaneously for all queries whose norm is bounded. The achieved coreset size improves on previously known bounds and (almost) matches the coreset lower bound.Why You Should CareIf you're an AI enthusiast, you know that scaling long context memory is the current frontier of AI. This tells AI researchers and engineers two important facts. One, that they don't necessarily need to invent entirely new attention architectures to solve the memory problem. Two, that compression by pruning is mathematically viable and provably effective. This also partly explains the empirical success of recent context pruning techniques.

Nearly Optimal Attention Coresets | Pinecone

Nearly Optimal Attention Coresets | Pinecone

Other newsrooms on this story

Related reading

A Flexible Resource for Top-Weighted Comparisons Between Sets and Rankings |…

Efficient Constant-Space Multi-Vector Retrieval | Pinecone

ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring | Pinecone

Learn | Pinecone

Full Observability for Pinecone: Introducing an Open-Source Monitoring Stack…

Pinecone Nexus: The Knowledge Engine for Agents | Pinecone

Other newsrooms on this story

Related reading

A Flexible Resource for Top-Weighted Comparisons Between Sets and Rankings |…

Efficient Constant-Space Multi-Vector Retrieval | Pinecone

ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring | Pinecone

Learn | Pinecone

Full Observability for Pinecone: Introducing an Open-Source Monitoring Stack…

Pinecone Nexus: The Knowledge Engine for Agents | Pinecone