As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an optimization technique that compresses model weights into a smaller data format. One quantization format is NVFP4, an innovative 4-bit floating point introduced with NVIDIA Blackwell architecture.
That’s the approach behind our new Nemotron 3 Ultra NVFP4 checkpoint: we quantized the model into NVFP4 using NVIDIA Model Optimizer. The result is a model that achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 model on decode-heavy workloads while matching BF16 accuracy across nearly every benchmark, as shown in Figure 1.
While the performance benefits of NVFP4 are well understood, the process of producing a high-quality NVFP4 checkpoint is not. This post walks through how we quantized Nemotron 3 Ultra (550B) to NVFP4 with NVIDIA Model Optimizer, and shows developers how to generate the best quantized checkpoints for their own models.
Figure 1. Performance of Nemotron 3 Ultra NVFP4 compared to other NVFP4 models
The Nemotron 3 Ultra NVFP4 checkpoint







