This article is part of our coverage of the latest in AI research.

Researchers at Stanford University and Nvidia have developed a new model architecture and training technique for language models to handle very long context tasks without blowing up memory and compute costs. Their technique addresses the issue of “continual learning,” where models are designed to adapt to changing information in dynamic environments rather than remaining static after their initial training.

Their method, which they describe as “End-to-End Test-Time Training for Long Context” (TTT-E2E), defines language modeling as a continual learning problem where the model actively updates its own parameters during inference. The technique also makes changes to the transformer architecture in a way that doesn’t require caching the attention value of every token in the input sequence. This adjustment creates a best-of-both-worlds situation. Their experiments show that on 128k context tasks, the model achieves the accuracy of full-attention transformers while being 2.7x faster, matching the speed of linear-attention models such as Mamba 2.

To understand the significance of this approach, it is necessary to look at the current tradeoff between accuracy and efficiency when working on longer contexts. Full-attention transformers are currently the gold standard for accuracy because they are designed to recall every token in the input sequence.