How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs)

Very small language models (SLMs) can outperform leading large language models (LLMs) in reasoning tasks, according to a new study by Shanghai AI Laboratory. The authors show that with the right tools and test-time scaling techniques, an SLM with 1 billion parameters can outperform a 405B LLM on complicated math benchmarks.

The ability to deploy SLMs in complex reasoning tasks can be very useful as enterprises are looking for new ways to use these new models in different environments and applications.

Test-time scaling (TTS) is the process of giving LLMs extra compute cylces during inference to improve their performance on various tasks. Leading reasoning models, such as OpenAI o1 and DeepSeek-R1, use “internal TTS,” which means they are trained to “think” slowly by generating a long string of chain-of-thought (CoT) tokens.

An alternative approach is “external TTS,” where model performance is enhanced with (as the name implies) outside help. External TTS is suitable for repurposing exiting models for reasoning tasks without further fine-tuning them. An external TTS setup is usually composed of a “policy model,” which is the main LLM generating the answer, and a process reward model (PRM) that evaluates the policy model’s answers. These two components are coupled together through a sampling or search method.

The ability to deploy SLMs in complex reasoning tasks can be very useful as enterprises are looking for new ways to use these new models in different environments and applications.

How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs)

How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs)

Related reading

Small language models: Rethinking enterprise AI architecture

Small Language Models on Edge Devices: How 2.6B Parameters Are Outperforming…

Researchers pinpoint why larger language models pick up skills that small ones…

Small Language Models Outperform Frontier AI On Cost, Speed And Accuracy

Large Language Models Are Overkill For Some Marketing Tasks. Enter The Small…

LLM reasoning, automated: tokens drop 69.5%

Related reading

Small language models: Rethinking enterprise AI architecture

Small Language Models on Edge Devices: How 2.6B Parameters Are Outperforming…

Researchers pinpoint why larger language models pick up skills that small ones…

Small Language Models Outperform Frontier AI On Cost, Speed And Accuracy

Large Language Models Are Overkill For Some Marketing Tasks. Enter The Small…

LLM reasoning, automated: tokens drop 69.5%