Back to Articles
I have been trying to wrap my head around reinforcement learning methods like DPO, GRPO, and RLVR for a while now, especially with all the recent work showing how effective they can be for LLM post-training. Since I amm still pretty new to RL, I figured the best place to start was Proximal Policy Optimization (PPO), the algorithm OpenAI used to show how reinforcement learning could meaningfully improve LLM alignment (InstructGPT paper). My hope is that getting comfortable with PPO will give me the right mental model for the policy-gradient side of things and make it easier to understand the newer LLM-specific RL methods built on similar ideas.
If you start learning RL, you quickly realize it involves a lot of math! So I decided to lean into that and do a few (possibly annoying) derivation sessions to really understand the PPO objective by building it up from first principles, similar to how Umar Jamil does in his video.
A huge shoutout to Umar Jamil's video on RLHF and PPO: it was incredibly helpful for building intuition and understanding the math behind the PPO loss.
Below is my attempt at the derivation based on the original PPO and InstructGPT papers and Umar Jamil’s video.







