A language model with 3 billion parameters just matched the reasoning performance of systems that are 200 times its size. The team behind it doesn’t work at OpenAI, Google DeepMind, or Anthropic. They work at a microblogging company.

Sina Weibo, the Chinese social media platform most people associate with viral posts rather than frontier AI research, published a 14-page technical report on arXiv detailing VibeThinker-3B. The model scored 94.3 on AIME 2026, one of the most demanding standardized math competitions in the world, placing it alongside DeepSeek V3.2 and its 671 billion parameters.

Small model, big numbers

The benchmark results tell the story. On AIME 2026, VibeThinker-3B hit 94.3, a score that climbs to 97.1 when using claim-level test-time scaling. On LiveCodeBench v6, a coding benchmark, it posted a Pass@1 score of 80.2. The model also demonstrated superior out-of-distribution performance on recent LeetCode contests, often matching or beating those much larger systems.

The model is built on top of Qwen2.5-Coder-3B as its base architecture. The Sina Weibo team, comprising nine researchers including Sen Xu, Shixi Liu, and Wei Wang, enhanced performance through a combination of curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation techniques. The paper also introduces the Parametric Compression-Coverage Hypothesis, which offers a theoretical framework for why smaller models can punch above their weight in structured reasoning tasks.