NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

Pretraining frontier-scale LLMs in FP8 is now standard practice, but moving to 4-bit floating point has remained an open research problem because narrower formats compress dynamic range and amplify quantization error at long token horizons. A new research from NVIDIA describes a pretraining methodology built around NVFP4, a 4-bit microscaling format supported natively by Blackwell Tensor Cores, and validates it by pretraining a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. The research team state this is the longest publicly documented training run in 4-bit precision to date. The resulting model attains 62.58% on MMLU-Pro 5-shot versus 62.62% for the FP8 baseline, and is supported in NVIDIA’s Transformer Engine.

What NVFP4 Actually is

To understand why NVFP4 is important, it helps to revisit how microscaling formats work. In a microscaling (MX) format, a contiguous block of low-precision elements shares a single scale factor, which is used to map the block back into a wider numerical range during the matrix multiply. MXFP4 uses 32-element blocks where each element is stored as E2M1 — 1 sign bit, 2 exponent bits, 1 mantissa bit — encoding only the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6. Block scale factors are stored in UE8M0, which restricts them to powers of two.

What NVFP4 Actually is

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

Other newsrooms on this story

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

Other newsrooms on this story

Related reading

Nvidia’s NVFP4 enables 4-bit LLM training without the accuracy trade-off -…

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell |…

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs |…

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model…

Production-Ready W4A8 vLLM Integration Recovery Techniques

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

Related reading

Nvidia’s NVFP4 enables 4-bit LLM training without the accuracy trade-off -…

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell |…

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs |…

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model…

Production-Ready W4A8 vLLM Integration Recovery Techniques

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out