There’s a persistent question in AI development that sounds deceptively simple: why do bigger models just… work better? Not incrementally better. Qualitatively better, picking up skills that smaller models never seem to learn at all. A new paper from researchers at Stanford, Harvard’s Kempner Institute, MIT, and Anthropic finally offers a mechanistic answer, and it has real implications for how the industry thinks about scaling.
The study, titled “Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention” and published on arXiv (2605.29548), pinpoints a phenomenon called reduced gradient interference as the core reason larger models outperform smaller ones on rare and complex tasks. In English: bigger models get the easy stuff out of the way early, which frees up space for harder lessons to actually stick.
The gradient interference problem
In neural networks, gradient updates from frequent tasks are strong and persistent. They dominate the training process. Rare tasks produce weaker gradient signals that get overwritten in smaller models before they can solidify into learned behavior. The researchers found that larger models sidestep this problem through a specific sequence of events during training.












