Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable substantially larger model capacity while activating only a subset of parameters for each token, offering an unparalleled approach for scaling performance within a practical compute budget. As model scales continue to grow, the optimization of these blocks becomes critical for maximizing training throughput.

To push these boundaries, we are introducing advanced fused MLP kernels for dense and MoE models, custom-built with the CuTe DSL. By tackling inherent memory and synchronization bottlenecks, these new kernels deliver an impressive 1.3x–2x kernel-level speedup over unfused paths while enabling sync-free MoE execution for full-iteration CUDA Graphs.

In NVIDIA’s full-stack DeepSeek-V3 pre-training setup, this optimization contributes an 8% end-to-end performance improvement. Similarly for the GPT-OSS pre-training setup, this optimization contributes a 93% end-to-end performance improvement. Whether you want to slash training times or optimize hardware utilization, these kernels are available today in the cuDNN Frontend and can be seamlessly accessed through Transformer Engine and Megatron-Core.