Pretraining frontier-scale LLMs in FP8 is now standard practice, but moving to 4-bit floating point has remained an open research problem because narrower formats compress dynamic range and amplify quantization error at long token horizons. A new research from NVIDIA describes a pretraining methodology built around NVFP4, a 4-bit microscaling format supported natively by Blackwell Tensor Cores, and validates it by pretraining a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. The research team state this is the longest publicly documented training run in 4-bit precision to date. The resulting model attains 62.58% on MMLU-Pro 5-shot versus 62.62% for the FP8 baseline, and is supported in NVIDIA’s Transformer Engine.
What NVFP4 Actually is
To understand why NVFP4 is important, it helps to revisit how microscaling formats work. In a microscaling (MX) format, a contiguous block of low-precision elements shares a single scale factor, which is used to map the block back into a wider numerical range during the matrix multiply. MXFP4 uses 32-element blocks where each element is stored as E2M1 — 1 sign bit, 2 exponent bits, 1 mantissa bit — encoding only the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6. Block scale factors are stored in UE8M0, which restricts them to powers of two.










