Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

You deployed a 7B model in production. Response times are fine — 45 ms per token — but you want to scale to a 70B without buying four more GPUs. Someone mentions MoE: "70B performance at 7B compute." It sounds like free lunch. So you look at the Mixtral 8x7B paper, you see 45 billion parameters and a claim that each token only activates about 13 billion of them, and you wonder: how is that physically possible, and what is the catch?

This post explains the sparse MoE architecture that powers Mixtral, DeepSeek-MoE, Qwen2.5-MoE, DBRX, and Grok-1: what the router actually does, why load-balancing is the hardest problem in training them, and the three specific constraints that determine whether MoE is the right choice for your deployment.

Why the distinction between total parameters and active parameters matters

A dense transformer (like Llama 3.2) activates 100 percent of its parameters for every token. The FFN layer in each transformer block runs the same matrix multiplication for every input. This makes memory use predictable and throughput easy to model, but it also means that scaling from 7B to 70B multiplies both memory and compute by 10x.

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Why the distinction between total parameters and active parameters matters

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Other newsrooms on this story

Related reading

Boosting MoE Training Throughput with Advanced Fusion Kernels | NVIDIA…

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Researchers train AI model that hits near-full performance with just 12.5…

EMO: Pretraining mixture of experts for emergent modularity | Ai2

EMO: Pretraining mixture of experts for emergent modularity

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

Other newsrooms on this story

Related reading

Boosting MoE Training Throughput with Advanced Fusion Kernels | NVIDIA…

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Researchers train AI model that hits near-full performance with just 12.5…

EMO: Pretraining mixture of experts for emergent modularity | Ai2

EMO: Pretraining mixture of experts for emergent modularity

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026