How test-time training allows models to ‘learn’ long documents instead of just caching them - TechTalks

This article is part of our coverage of the latest in AI research.

Researchers at Stanford University and Nvidia have developed a new model architecture and training technique for language models to handle very long context tasks without blowing up memory and compute costs. Their technique addresses the issue of “continual learning,” where models are designed to adapt to changing information in dynamic environments rather than remaining static after their initial training.

Their method, which they describe as “End-to-End Test-Time Training for Long Context” (TTT-E2E), defines language modeling as a continual learning problem where the model actively updates its own parameters during inference. The technique also makes changes to the transformer architecture in a way that doesn’t require caching the attention value of every token in the input sequence. This adjustment creates a best-of-both-worlds situation. Their experiments show that on 128k context tasks, the model achieves the accuracy of full-attention transformers while being 2.7x faster, matching the speed of linear-attention models such as Mamba 2.

To understand the significance of this approach, it is necessary to look at the current tradeoff between accuracy and efficiency when working on longer contexts. Full-attention transformers are currently the gold standard for accuracy because they are designed to recall every token in the input sequence.

This article is part of our coverage of the latest in AI research.

How test-time training allows models to ‘learn’ long documents instead of just caching them - TechTalks

How test-time training allows models to ‘learn’ long documents instead of just caching them - TechTalks

Other newsrooms on this story

Related reading

Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog

This AI Model Never Stops Learning

Ulysses Sequence Parallelism: Training with Million-Token Contexts

How to Optimize Transformer-Based Models for Low-Precision Training | NVIDIA…

Long Context Fine-Tuning: A Technical Deep Dive

Recursive Language Models: A new framework for infinite context in LLMs -…

Other newsrooms on this story

Related reading

Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog

This AI Model Never Stops Learning

Ulysses Sequence Parallelism: Training with Million-Token Contexts

How to Optimize Transformer-Based Models for Low-Precision Training | NVIDIA…

Long Context Fine-Tuning: A Technical Deep Dive

Recursive Language Models: A new framework for infinite context in LLMs -…