Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure | NVIDIA Technical Blog

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and code—leading to added complexity, higher costs, and slower iteration.

MiniMax M3—available on NVIDIA accelerated infrastructure including NVIDIA Blackwell—changes this by enabling a single multimodal system capable of long-context reasoning, agentic workflows, and creative tasks.

The 428B parameter MoE supports up to 1M tokens and native multimodal input. Developers can build applications like long video understanding, extended coding sessions (8+ hours), and high-quality design workflows—all with a unified model and production-ready deployment paths on NVIDIA platforms.

Name MiniMax M3 Input modalities Video, image, text Total parameters 428B Visual encoder parameters 600M Active parameters 22B Context length 1M Experts Total 128, 4 experts activated per token Precision format BF16, MXFP8 Table 1. MiniMax M3 a VLM MoE model specs

MiniMax M3’s core architectural innovation is MiniMax Sparse Attention (MSA), which replaces standard quadratic attention with a pre-filtering stage that identifies relevant context blocks and attends only to those. At the operator level, each KV cache block is read once with contiguous memory access—more than 4x faster than existing sparse attention implementations. This yields 1/20th the per-token compute of M2 at 1M-token context, with 9x faster prefill and 15x faster decoding, all without compressing key-values or sacrificing precision. The model also trains text, images, and video natively from step 0 across ~100 trillion interleaved tokens, rather than adding multimodality post-training.

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure | NVIDIA Technical Blog

Related reading

MiniMax M2.7 Advances Scalable Agentic Workflows on NVIDIA Platforms for…

MiniMax debuts AI model built for long and complex coding tasks

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and…

MiniMax teases M3 model with new sparse attention mechanism, 15.6X long-context…

MiniMax M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark…

Run MiniMax models on Amazon Bedrock | Amazon Web Services