Back to Articles
TL;DR — I pretrained two decoder-only Transformers from scratch on 840 chain-of-thought conversations, changing only one thing: the number of attention layers (1 vs 12). The 12-layer model reached 22% lower training loss (perplexity 43.5 → 18.7) using only 32% more parameters. Small experiment, clear signal: depth helps the model fit reasoning patterns.
The question
Modern LLMs stack dozens of attention layers. The intuition is that reasoning isn't a one-shot operation — it's iterative: look, combine, refine, conclude. But is that actually visible in a controlled experiment, or is it just folklore that "bigger = deeper = better"?
I wanted to test it directly on the smallest setup that could possibly show the effect: chain-of-thought data, from-scratch pretraining, two Kaggle T4s, one variable.












