Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell | NVIDIA Technical Blog

Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step time can add up to days of training and substantial compute costs. Numerical precision is one of the highest-leverage knobs available, but low- bit mixed-precision pretraining is hard to get right.

To address this, the NVFP4 training recipe in TransformerEngine uses subbyte precision for JAX pretraining. For an end-to-end example, see the recipe in MaxText, a high-performance, scalable LLM framework library. The result is high-throughput, 4-bit mixed-precision pre-training on NVIDIA Blackwell with no measurable accuracy loss compared to the FP8 baseline.

This post explains the NVFP4 format and how it’s built to achieve high performance and accuracy at ultra-low precision. It also shows how to apply a MaxText NVFP4 pretraining recipe and collect performance data showing performance gains. For methodology details, see the NVFP4 pretraining paper.

NVFP4 format and benefits

This NVFP4 introductory post explains its format and how two-level microscaling encodes higher signals with less error than other microscaling formats. It also explains how native hardware support of NVFP4 on the NVIDIA GB300 Grace Blackwell Ultra Superchip delivers 7x GEMM throughput compared to native FP8 precision on the NVIDIA Hopper. That higher throughput, along with the NVFP4 pretraining recipe, shortens training step time with negligible accuracy loss. This enables AI factories to train more and larger models within the same time budget, or train models faster with a shorter time budget.

NVFP4 format and benefits

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell | NVIDIA Technical Blog

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a…

Nvidia’s NVFP4 enables 4-bit LLM training without the accuracy trade-off -…

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs |…

Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog

Nous Research Releases Token Superposition Training to Speed Up LLM…

Production-Ready W4A8 vLLM Integration Recovery Techniques

Other newsrooms on this story

Related reading

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a…

Nvidia’s NVFP4 enables 4-bit LLM training without the accuracy trade-off -…

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs |…

Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog

Nous Research Releases Token Superposition Training to Speed Up LLM…

Production-Ready W4A8 vLLM Integration Recovery Techniques