Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step time can add up to days of training and substantial compute costs. Numerical precision is one of the highest-leverage knobs available, but low- bit mixed-precision pretraining is hard to get right.
To address this, the NVFP4 training recipe in TransformerEngine uses subbyte precision for JAX pretraining. For an end-to-end example, see the recipe in MaxText, a high-performance, scalable LLM framework library. The result is high-throughput, 4-bit mixed-precision pre-training on NVIDIA Blackwell with no measurable accuracy loss compared to the FP8 baseline.
This post explains the NVFP4 format and how it’s built to achieve high performance and accuracy at ultra-low precision. It also shows how to apply a MaxText NVFP4 pretraining recipe and collect performance data showing performance gains. For methodology details, see the NVFP4 pretraining paper.
NVFP4 format and benefits
This NVFP4 introductory post explains its format and how two-level microscaling encodes higher signals with less error than other microscaling formats. It also explains how native hardware support of NVFP4 on the NVIDIA GB300 Grace Blackwell Ultra Superchip delivers 7x GEMM throughput compared to native FP8 precision on the NVIDIA Hopper. That higher throughput, along with the NVFP4 pretraining recipe, shortens training step time with negligible accuracy loss. This enables AI factories to train more and larger models within the same time budget, or train models faster with a shorter time budget.











