Latent Context Language Models achieve 16x input compression without accuracy loss

AI models have a memory problem. The longer they run, the more tokens pile up from documents, reasoning traces, and conversation history. All that accumulated context demands more compute and more memory, which means slower responses and higher costs.

A research team spanning NYU, Columbia, Princeton, University of Maryland, Harvard, and Lawrence Livermore National Laboratory just published a paper proposing something better. Their solution, called Latent Context Language Models (LCLMs), compresses input context into compact latent embeddings at ratios as high as 16:1, with no accuracy loss on evaluated benchmarks.

How LCLMs actually work

The architecture pairs a relatively small 0.6 billion parameter encoder with a beefier 4 billion parameter decoder. Both were continuously pre-trained on over 350 billion tokens. The encoder handles the compression work, squeezing lengthy inputs down to dense representations. The decoder then reasons over those compressed embeddings as if it had the full original context.

The compression supports multiple ratios: 4x, 8x, and 16x. At the maximum 16x compression, the system maintained performance comparable to uncompressed baselines across the benchmarks tested.

Latent Context Language Models achieve 16x input compression without accuracy loss

Other newsrooms on this story

Related reading

LLM context compression at 16x beats KV cache

How sparse attention solves the memory bottleneck in long-context LLMs -…

Optimizing LLM Context Windows: Implementing Lossless Compression Strategies…

MIT's MeMo framework boosts LLM performance by 26% without retraining

Context Compression: Fitting More Useful Information Into Your LLM's Context…

Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog