Boosting MoE Training Throughput with Advanced Fusion Kernels | NVIDIA Technical Blog

Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable substantially larger model capacity while activating only a subset of parameters for each token, offering an unparalleled approach for scaling performance within a practical compute budget. As model scales continue to grow, the optimization of these blocks becomes critical for maximizing training throughput.

To push these boundaries, we are introducing advanced fused MLP kernels for dense and MoE models, custom-built with the CuTe DSL. By tackling inherent memory and synchronization bottlenecks, these new kernels deliver an impressive 1.3x–2x kernel-level speedup over unfused paths while enabling sync-free MoE execution for full-iteration CUDA Graphs.

In NVIDIA’s full-stack DeepSeek-V3 pre-training setup, this optimization contributes an 8% end-to-end performance improvement. Similarly for the GPT-OSS pre-training setup, this optimization contributes a 93% end-to-end performance improvement. Whether you want to slash training times or optimize hardware utilization, these kernels are available today in the cuDNN Frontend and can be seamlessly accessed through Transformer Engine and Megatron-Core.

Boosting MoE Training Throughput with Advanced Fusion Kernels | NVIDIA Technical Blog

Boosting MoE Training Throughput with Advanced Fusion Kernels | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

Researchers train AI model that hits near-full performance with just 12.5…

Mixture of Experts (MoE): what it actually does under the hood, and when it…

EMO: Pretraining mixture of experts for emergent modularity | Ai2

EMO: Pretraining mixture of experts for emergent modularity

Mixture of Experts (MoE) Explained Simply: How Modern AI Models Get Bigger…

Train separately, merge together: Modular post-training with mixture-of-experts…

Other newsrooms on this story

Related reading

Researchers train AI model that hits near-full performance with just 12.5…

Mixture of Experts (MoE): what it actually does under the hood, and when it…

EMO: Pretraining mixture of experts for emergent modularity | Ai2

EMO: Pretraining mixture of experts for emergent modularity

Mixture of Experts (MoE) Explained Simply: How Modern AI Models Get Bigger…

Train separately, merge together: Modular post-training with mixture-of-experts…