How test-time training allows models to ‘learn’ long documents instead of just caching them - TechTalks
By treating language modeling as a continual learning problem, the TTT-E2E architecture achieves the accuracy of full-attention Transformers on 128k context tasks while matching the speed of linear models.