AI models have a memory problem. The longer they run, the more tokens pile up from documents, reasoning traces, and conversation history. All that accumulated context demands more compute and more memory, which means slower responses and higher costs.

A research team spanning NYU, Columbia, Princeton, University of Maryland, Harvard, and Lawrence Livermore National Laboratory just published a paper proposing something better. Their solution, called Latent Context Language Models (LCLMs), compresses input context into compact latent embeddings at ratios as high as 16:1, with no accuracy loss on evaluated benchmarks.

How LCLMs actually work

The architecture pairs a relatively small 0.6 billion parameter encoder with a beefier 4 billion parameter decoder. Both were continuously pre-trained on over 350 billion tokens. The encoder handles the compression work, squeezing lengthy inputs down to dense representations. The decoder then reasons over those compressed embeddings as if it had the full original context.

The compression supports multiple ratios: 4x, 8x, and 16x. At the maximum 16x compression, the system maintained performance comparable to uncompressed baselines across the benchmarks tested.