As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy Optimization (GRPO) power this transition, enabling reasoning-grade models to continuously improve through iterative feedback. Unlike standard supervised fine-tuning, RL training loops are bifurcated into two distinct, high-intensity phases: a generation phase with a stringent latency requirement and a training phase requiring high throughput.

To make these workloads viable, researchers and engineers are turning to low-precision datatypes like FP8 to boost performance in training and throughput-oriented generation. Moreover, in some scenarios where generation is bound by GPU memory bandwidth, using low-precision parameters can improve performance due to fewer bytes per parameter.

This post dives deep into the systemic challenges of low-precision RL and how NVIDIA NeMo RL—an open source library within the NVIDIA NeMo framework—speeds up RL workloads while maintaining accuracy.

FP8 for linear layers in RL

Our recipe uses the block-wise quantized FP8 introduced by the DeepSeek-V3 Technical Report. Table 1 gives the details of tensor formats in linear projection layers.