Back to Articles
Migration Objective Failure Modes V1 Backend Fixes Logprob Semantics Runtime Defaults Inflight Weight Updates The Remaining Gap: fp32 lm_head Ablations Why We Fixed Backend Correctness First PipelineRL uses vLLM as the inference engine for rollout generation. The
inference engine samples tokens and returns token logprobs; the trainer uses
those logprobs to compute policy ratios, KL, clip rate, entropy, and reward.
Any discrepancy in how those logprobs are computed can change the training







