TLDR; It’s true - direct scaling is showing signs of diminishing returns. However, a missing part of the debate is that massive scale is perhaps unavoidable. Rather, large base models are the critical infrastructure required to distill, train, and validate the efficient small models (or alternate architectures) of the future. In other words, efficiency gains we see today, including distillation, synthetic data, RLHF, and GRPO, are not alternatives to scaling. They are dividends from the scaling investments that preceded them.As investments in AI explode (Big Tech expected to spend 300 Billion on AI in 2025), data centers are built, and most industry labs seem entrenched in the view that more GPUs is better, there is a natural question: is the quest to throw compute at the problem valid? Or better still, should it be the only approach?There seem to be two camps. On one side are folks who agree that scaling has got us this far when nothing else did, and will likely be the gift that keeps on giving. A faction of the bitter lesson believers [1]. On the other are those who argue it’s a dead end, either due to the clear unsustainability of it all, or because there are fundamental flaws in the current transformer architecture itself that mean we can’t get much farther than we are today. This second camp often points to new approaches, such as world models and alternative architectures, as the path forward, or point to emerging cracks or instability in the scaling laws.As I think about both, I am left with the idea that there is some nuance that is worth clarifying. We probably should not try to scale ad infinitum, but at the same time, scale is unavoidable in that its more of a design feature than a bug.This post is in part a reflection inspired by Sara Hooker’s essay, On Slow Death of Scaling [2], where she argues that recent developments show cracks in the scaling laws that have enabled progress thus far, and that it is time to invest in other areas to extract performance gains: data quality, improvements in system scaffolding (agents), and a focus on UX. The reasons cited include diminishing returns on scaling, unreliable scaling laws, and the availability of surgical synthetic data generation for training smaller, higher-quality models.One of the arguments Sara makes on the unreliability of scaling laws is that they predict test loss, not downstream capabilities. When you measure what models actually do, the scaling law predictions sometimes break down, which means companies betting everything on scale are probably under-investing elsewhere.I agree with everything Sara says here (she’s great and definitely read more of her writing).However, the minor challenge I see is that a reader of that article might walk away thinking we probably do not need the super-scaled models. I think we critically and necessarily do. And that there is a cyclic, chicken-and-egg dependency between the super-scaled models (for lack of a better term) and all of the innovations that help us move away from them.In plain terms: small models are a derivative of the larger models, and probably will never be better than them. This follows a well-established engineering pattern.Several observations point to the constant need for super-scaled models - and how they are (IMO) necessary infrastructure for progress.Knowledge distillation, training smaller “student” models to mimic larger “teacher” models, has become a cornerstone of efficient AI deployment. But the technique has an obvious dependency: you need the teacher first. Even when the goal is to just improve performance (e.g., reasoning models), you still need strong, capable base models.DeepSeek’s work on distilling reasoning capabilities validated this directly: “reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to reasoning patterns discovered through RL on small models” [3]. The implication is clear. You cannot bootstrap reasoning in small models without first having large models that possess it.The synthetic data revolution, which enables training smaller, specialized models on carefully curated generated data, depends entirely on capable generators. Research on self-improvement methods shows that when models generate their own training data, they are “only limited by the best model available” [4]. This creates a ceiling: your synthetic data is only as good as your largest model.The industry pattern is consistent. NVIDIA’s Nemotron-4 340B, IBM’s LAB methodology, and Mixtral-8x7B all use large teacher models as the source for synthetic training data. The small, efficient models that result are derivatives of the large ones that generated their training signal.Reinforcement learning from human feedback (RLHF) and techniques like Group Relative Policy Optimization (GRPO) have dramatically improved model capabilities. But these techniques are refinements, not foundations.As Nathan Lambert notes in The RLHF Book: “Effective RLHF requires a strong starting point, so RLHF cannot be a solution to every problem alone” [5]. OpenAI’s research on weak-to-strong generalization found that “weak-to-strong generalization is particularly poor for ChatGPT reward modeling... naive RLHF will likely scale poorly to superhuman models without additional work” [6].GRPO, which enabled DeepSeek-R1’s impressive reasoning capabilities, works by sampling multiple completions and using group-relative rewards. But this requires a base model strong enough that variance exists in sample quality and some samples actually reach correct answers. DeepSeek-R1-Zero, often cited as evidence that pure RL can discover reasoning, started from DeepSeek-V3-Base, a 671-billion parameter model [3, 7]. The “pure RL without supervised fine-tuning” innovation was only possible because they had already paid the massive scaling tax.The relationship looks something like this:Large Scaled Models (the "inefficient" phase)