Stanford, MIT, Harvard, Anthropic study reveals why larger models learn rare tasks better

There’s a persistent question in AI development that sounds deceptively simple: why do bigger models just… work better? Not incrementally better. Qualitatively better, picking up skills that smaller models never seem to learn at all. A new paper from researchers at Stanford, Harvard’s Kempner Institute, MIT, and Anthropic finally offers a mechanistic answer, and it has real implications for how the industry thinks about scaling.

The study, titled “Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention” and published on arXiv (2605.29548), pinpoints a phenomenon called reduced gradient interference as the core reason larger models outperform smaller ones on rare and complex tasks. In English: bigger models get the easy stuff out of the way early, which frees up space for harder lessons to actually stick.

The gradient interference problem

In neural networks, gradient updates from frequent tasks are strong and persistent. They dominate the training process. Rare tasks produce weaker gradient signals that get overwritten in smaller models before they can solidify into learned behavior. The researchers found that larger models sidestep this problem through a specific sequence of events during training.

Stanford, MIT, Harvard, Anthropic study reveals why larger models learn rare tasks better

Other newsrooms on this story

Related reading

Researchers pinpoint why larger language models pick up skills that small ones…

Anthropic researchers discover the weird AI problem: Why thinking longer makes…

The Same AI Model Can Perform 6x Better: Here's Why

Small Language Models Outperform Frontier AI On Cost, Speed And Accuracy

Can a Chip That Loves Zeros Make Huge AI Models More Efficient?

Agents-A1 achieves 1T-model performance through long-task training, not bigger…