Deriving the PPO Loss from First Principles

Back to Articles

I have been trying to wrap my head around reinforcement learning methods like DPO, GRPO, and RLVR for a while now, especially with all the recent work showing how effective they can be for LLM post-training. Since I amm still pretty new to RL, I figured the best place to start was Proximal Policy Optimization (PPO), the algorithm OpenAI used to show how reinforcement learning could meaningfully improve LLM alignment (InstructGPT paper). My hope is that getting comfortable with PPO will give me the right mental model for the policy-gradient side of things and make it easier to understand the newer LLM-specific RL methods built on similar ideas.

If you start learning RL, you quickly realize it involves a lot of math! So I decided to lean into that and do a few (possibly annoying) derivation sessions to really understand the PPO objective by building it up from first principles, similar to how Umar Jamil does in his video.

A huge shoutout to Umar Jamil's video on RLHF and PPO: it was incredibly helpful for building intuition and understanding the math behind the PPO loss.

Below is my attempt at the derivation based on the original PPO and InstructGPT papers and Umar Jamil’s video.

Back to Articles

A huge shoutout to Umar Jamil's video on RLHF and PPO: it was incredibly helpful for building intuition and understanding the math behind the PPO loss.

Below is my attempt at the derivation based on the original PPO and InstructGPT papers and Umar Jamil’s video.

Deriving the PPO Loss from First Principles

Deriving the PPO Loss from First Principles

Other newsrooms on this story

Related reading

A Guide to Reinforcement Learning Post-Training for LLMs: PPO, DPO, GRPO, and…

Reasoning RL in 2026: GRPO, DPO, RLVR, Agentic PO & Beyond

What is RLHF? Reinforcement learning from human feedback for AI alignment

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Run High-Throughput Reinforcement Learning Training with End-to-End FP8…

Reinforcement learning Archives

Other newsrooms on this story

Related reading

A Guide to Reinforcement Learning Post-Training for LLMs: PPO, DPO, GRPO, and…

Reasoning RL in 2026: GRPO, DPO, RLVR, Agentic PO & Beyond

What is RLHF? Reinforcement learning from human feedback for AI alignment

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Run High-Throughput Reinforcement Learning Training with End-to-End FP8…

Reinforcement learning Archives