Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Back to Articles

1. The One Terabyte Problem 2. Why bf16 RL Weights Are Almost Always Sparse 3. HF Buckets and the Architecture 3.1 What is a Bucket? 3.2 The Three Boxes 4. The Protocol 4.1 Safetensors as the Wire Format 4.2 The Trainer Side: a Boolean Mask From an Optimizer Hook 4.3 The vLLM Side: a 30 Line Extension 5. Standing It Up on Spaces, For Real 6. So What Does This Actually Unlock? 7. What's Still on Our Plate 8. Try It TL;DR, because you have models to train and we respect that:

Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step.

It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny.

We landed a TRL PR that encodes just the changed elements as a sparse safetensors file, uploads it to a Hugging Face Bucket, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB.

Back to Articles

It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny.

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Other newsrooms on this story

Related reading

Introducing Storage Buckets on the Hugging Face Hub

A New Computational Method from the UAE Stores Information as a Rule Rather…

LeRobot v0.5.0: Scaling Every Dimension

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Continuous batching for GRPO, now in TRL

3-Part Series: LLM Latency in Production (Part 1) | Towards AI

Other newsrooms on this story

Related reading

Introducing Storage Buckets on the Hugging Face Hub

A New Computational Method from the UAE Stores Information as a Rule Rather…

LeRobot v0.5.0: Scaling Every Dimension

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Continuous batching for GRPO, now in TRL

3-Part Series: LLM Latency in Production (Part 1) | Towards AI