Back to Articles
Just one flag Some numbers When to reach for it One more thing Getting it Still moving Resources
Continuous batching has been a continued effort in transformers for a few months now. The aim is a fast, memory-aware generation path that lives inside the library itself, and it has been documented as it grew, first the core mechanism, then the asynchronous version (h/t @ror 🐐).
Now those efforts have gone beyond generation and into training. GRPO in TRL can use continuous batching for its rollouts.
Online RL is generation-heavy: producing the rollouts is usually the most expensive part of the loop, so the generation path is where the speed lives. Until now TRL gave you two options: the default generate(), simple and in-process but wasteful when you ask for many completions, or vLLM, very fast but a separate inference engine to bring in and manage (as its own server, or colocated on the training GPUs). Continuous batching fills the gap in the middle: an in-process path that does not waste compute and memory at high N, using transformers directly, with no vLLM dependency and no weight syncing between two copies of the model.















