Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

You deployed a 7B model in production. Response times are fine — 45 ms per token — but you want to scale to a 70B without buying four more GPUs. Someone mentions MoE: "70B performance at 7B compute." It sounds like free lunch. So you look at the Mixtral 8x7B paper, you see 45 billion parameters and a claim that each token only activates about 13 billion of them, and you wonder: how is that physically possible, and what is the catch?

This post explains the sparse MoE architecture that powers Mixtral, DeepSeek-MoE, Qwen2.5-MoE, DBRX, and Grok-1: what the router actually does, why load-balancing is the hardest problem in training them, and the three specific constraints that determine whether MoE is the right choice for your deployment.

Why the distinction between total parameters and active parameters matters

A dense transformer (like Llama 3.2) activates 100 percent of its parameters for every token. The FFN layer in each transformer block runs the same matrix multiplication for every input. This makes memory use predictable and throughput easy to model, but it also means that scaling from 7B to 70B multiplies both memory and compute by 10x.