Getting the most out of your parametersTraditional scaling laws tell us that to achieve the best performance, we need to scale FLOPs, often with more parameters or data. But as models move to the edge and inference costs skyrocket, we wonder: Can we scale quality without inflating memory footprint?To that end, we’ve been exploring looped architectures, models that increase compute by passing activations through the same layers multiple times. While promising, these models have been unstable to train. We tackle this issue directly and introduce Parcae, a stable looped architecture that:Is better than prior looped models: Parcae achieves up to 6.3% lower validation perplexity than previous large-scale looped recipes.Punches above its weight: Our 770M Parcae matches the quality of a 1.3B parameter transformer trained on the same data, achieving the same performance with roughly half the parameters.Scales Predictably: We establish the first scaling laws for looping, finding that compute-optimal training requires increasing looping and data in tandem.Looped models are cool, but hard to train in practiceAs models move to the edge and inference deployments take on larger portions of compute, there is an increasing interest in scaling model quality without increasing parameters. One mechanism we have been excited about is layer looping, where initial works have trained looped models that match the quality of larger fixed-depth architectures.To turn a vanilla Transformer into a looped model, we follow prior work and partition its layers into three functional blocks: a prelude ($\mathcal{P}$), a recurrent ($\mathcal{R}$), and a coda ($\mathcal{C}$). The forward pass works in three stages:Embedding: The prelude transforms the input into a latent state $e$.Recurrence: The recurrent block iteratively updates a hidden state $h_t$ for $T$ loops. To maintain the input’s influence, $e$ is injected into each loop, typically via addition <a id="cite-1" href="#ref-1">[1]</a> ($h_{t+1} = \mathcal{R}(h_t + e)$) or concatenation with projection <a id="cite-2" href="#ref-2">[2]</a> ($h_{t+1} = \mathcal{R}(W[h_t; e])$).Output: The coda processes the final $h_T$ to generate the model’s output.Unfortunately, looped models are a headache to train <a id="cite-2b" href="#ref-2">[2]</a><a id="cite-3" href="#ref-3">[3]</a><a id="cite-4" href="#ref-4">[4]</a>. We personally found them to suffer from residual state explosion and loss spikes. What makes looped models even trickier is that the recurrent block is composed of several vanilla Transformer blocks, making it difficult to reason about the source of instability. Understanding the instability of loopingWhile instability is a fickle foe, we observed that a simple linear framework captured a significant source of instability. Specifically, we recast looping as a nonlinear time variant dynamical system over the residual, whose update rule is:$$h_{t+1} = \overline{A} h_t + \overline{B} e + \overline{\mathcal{R}}(h_t, e)$$where $\overline{A}, \overline{B}$ perform injection and $\overline{\mathcal{R}}$ is the contribution of the Transformer blocks to the residual stream. For the subquadratic sequence mixing fanatics out there, observe that if we ignore the nonlinear term $\overline{\mathcal{R}}$, the resulting system is a discrete linear time-invariant (LTI) dynamical system over the residual state, across model depth.What's cool is that for discrete LTI systems, their stability and convergence are determined by the eigenvalues of $\overline{A}$. Specifically, stability is categorized using the spectral norm $\rho(\overline{A})$ (i.e., the absolute largest eigenvalue of $\overline{A}$), with stable systems (convergent) being $\rho(\overline{A})<1$ and unstable (divergent) systems being $\rho(\overline{A})=1$.While this analysis bypasses the nonlinearities of looping (e.g., Attention and MLP units), the table and figure above confirm that our analysis is important empirically: divergent runs learn a spectral radius of $\rho(\overline{A}) \geq 1$, with convergent runs maintaining $\rho(\overline{A}) < 1$. When we maintain LTI conditions with Parcae, looped models become significantly more robust to hyperparameter selection.Parcae: A stable, hassle-free looped modelSo how do we stabilize? We designed a new looped model, Parcae, which explicitly maintains the stability conditions observed in the section above by construction. Specifically, we parameterize the input injection parameters using a continuous formulation $A, B$, which we discretize with ZOH and Euler schemes (i.e., $\overline{A} = \exp(\Delta A)$ and $\overline{B} = \Delta B$), using a learned $\Delta \in \mathbb{R}^{d_h}$. We then constrain $A := \texttt{Diag}(-\exp(\texttt{log}_A))$ as a negative diagonal matrix, where $\texttt{Diag}(-\exp(\cdot))$ of a vector enforces negativity and $\texttt{log}_A\in \mathbb{R}^{d_h}$ is our learnable vector. This ensures that $\rho(\overline{A}) < 1$!So, have we fixed all the issues and stabilized looped models? Unfortunately, there were still several other small tricks needed to get clean training of Parcae. For those interested, check out our [paper](link).Back to language modeling: Scaling up ParcaeNot only does Parcae train more reliably, but we found that it produces a higher-quality model in comparison to prior RDMs. Using the exact setup of RDMs <a id="cite-2c" href="#ref-2">[2]</a>, a prior looped model, we tested against parameter- and data-matched RDMs, observing that Parcae reduces validation perplexity by up to 6.3%.
Parcae: Doing more with fewer parameters using stable looped models
Parcae is a stable looped language model that matches the quality of a Transformer twice its size — a 770M model reaching 1.3B-level performance. We introduce the first scaling laws for looping and show that increasing recurrence, not just data, is a compute-efficient path to bet















