Does Depth Actually Help Reasoning? A Tiny Experiment on 2× T4

Back to Articles

TL;DR — I pretrained two decoder-only Transformers from scratch on 840 chain-of-thought conversations, changing only one thing: the number of attention layers (1 vs 12). The 12-layer model reached 22% lower training loss (perplexity 43.5 → 18.7) using only 32% more parameters. Small experiment, clear signal: depth helps the model fit reasoning patterns.

The question

Modern LLMs stack dozens of attention layers. The intuition is that reasoning isn't a one-shot operation — it's iterative: look, combine, refine, conclude. But is that actually visible in a controlled experiment, or is it just folklore that "bigger = deeper = better"?

I wanted to test it directly on the smallest setup that could possibly show the effect: chain-of-thought data, from-scratch pretraining, two Kaggle T4s, one variable.

Back to Articles

The question

I wanted to test it directly on the smallest setup that could possibly show the effect: chain-of-thought data, from-scratch pretraining, two Kaggle T4s, one variable.

Does Depth Actually Help Reasoning? A Tiny Experiment on 2× T4

Does Depth Actually Help Reasoning? A Tiny Experiment on 2× T4

Other newsrooms on this story

Related reading

URM shows how small, recurrent models can outperform big LLMs in reasoning…

The Sequence AI of the Week #867: Thinking in Latents: Why Sapient's HRM-Text…

Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE,…

Speculative decoding for high-throughput long-context inference

The Return of Recursion: How 5M-Parameter Models Are Outperforming Frontier…

The paradox of LLM self-distillation: Faster reasoning, weaker generalization -…

Other newsrooms on this story

Related reading

URM shows how small, recurrent models can outperform big LLMs in reasoning…

The Sequence AI of the Week #867: Thinking in Latents: Why Sapient's HRM-Text…

Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE,…

Speculative decoding for high-throughput long-context inference

The Return of Recursion: How 5M-Parameter Models Are Outperforming Frontier…

The paradox of LLM self-distillation: Faster reasoning, weaker generalization -…