Back to Articles

Migration Objective Failure Modes V1 Backend Fixes Logprob Semantics Runtime Defaults Inflight Weight Updates The Remaining Gap: fp32 lm_head Ablations Why We Fixed Backend Correctness First PipelineRL uses vLLM as the inference engine for rollout generation. The

inference engine samples tokens and returns token logprobs; the trainer uses

those logprobs to compute policy ratios, KL, clip rate, entropy, and reward.

Any discrepancy in how those logprobs are computed can change the training