Back to Articles

1. The One Terabyte Problem 2. Why bf16 RL Weights Are Almost Always Sparse 3. HF Buckets and the Architecture 3.1 What is a Bucket? 3.2 The Three Boxes 4. The Protocol 4.1 Safetensors as the Wire Format 4.2 The Trainer Side: a Boolean Mask From an Optimizer Hook 4.3 The vLLM Side: a 30 Line Extension 5. Standing It Up on Spaces, For Real 6. So What Does This Actually Unlock? 7. What's Still on Our Plate 8. Try It TL;DR, because you have models to train and we respect that:

Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step.

It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny.

We landed a TRL PR that encodes just the changed elements as a sparse safetensors file, uploads it to a Hugging Face Bucket, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB.