Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer | NVIDIA Technical Blog

As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an optimization technique that compresses model weights into a smaller data format. One quantization format is NVFP4, an innovative 4-bit floating point introduced with NVIDIA Blackwell architecture.

That’s the approach behind our new Nemotron 3 Ultra NVFP4 checkpoint: we quantized the model into NVFP4 using NVIDIA Model Optimizer. The result is a model that achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 model on decode-heavy workloads while matching BF16 accuracy across nearly every benchmark, as shown in Figure 1.

While the performance benefits of NVFP4 are well understood, the process of producing a high-quality NVFP4 checkpoint is not. This post walks through how we quantized Nemotron 3 Ultra (550B) to NVFP4 with NVIDIA Model Optimizer, and shows developers how to generate the best quantized checkpoints for their own models.

Figure 1. Performance of Nemotron 3 Ultra NVFP4 compared to other NVFP4 models

The Nemotron 3 Ultra NVFP4 checkpoint

Figure 1. Performance of Nemotron 3 Ultra NVFP4 compared to other NVFP4 models

The Nemotron 3 Ultra NVFP4 checkpoint

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer | NVIDIA Technical Blog

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer |…

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell |…

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference…

Nvidia’s NVFP4 enables 4-bit LLM training without the accuracy trade-off -…

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

NVIDIA Nemotron 3 Ultra 550B: Developer Guide — Architecture, Benchmarks &…

Related reading

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer |…

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell |…

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference…

Nvidia’s NVFP4 enables 4-bit LLM training without the accuracy trade-off -…

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

NVIDIA Nemotron 3 Ultra 550B: Developer Guide — Architecture, Benchmarks &…

Other newsrooms on this story