RLHF vs DPO vs IPO vs KTO: which alignment method should you use

You have a base model, say Llama 3.2 8B, that can write poetry in any meter and pass the bar exam. It can also generate instructions for synthesizing controlled substances, roleplay as a manipulative therapist, and explain in loving detail why your pull request is an affront to good taste. You need to align it — remove the harmful outputs while keeping the capability. Your mentor says "use RLHF." A paper on your feed says DPO is simpler. Your colleague swears by KTO because they only have thumbs-up/thumbs-down log data from production. Where do you start?

Choosing an alignment method is not a theoretical debate. It is a practical decision that depends on your data, your compute budget, and the failure modes you are trying to avoid. This post compares the four dominant approaches side by side, with the actual math, the data requirements, and the sharp edges you will hit in production.

Why this matters

The alignment method you pick determines three things that directly affect shipping timelines:

RLHF vs DPO vs IPO vs KTO: which alignment method should you use

Why this matters

The alignment method you pick determines three things that directly affect shipping timelines:

RLHF vs DPO vs IPO vs KTO: which alignment method should you use

RLHF vs DPO vs IPO vs KTO: which alignment method should you use

Other newsrooms on this story

Related reading

DPO vs RLHF: The Alignment Tax You Pay Without Knowing

LoRA and QLoRA fine-tuning: what they actually do under the hood

How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding…

What is RLHF? Reinforcement learning from human feedback for AI alignment

Deriving the PPO Loss from First Principles

RLAIF Is Eating RLHF — Here Are the Four Places Human Feedback Still Wins

Other newsrooms on this story

Related reading

DPO vs RLHF: The Alignment Tax You Pay Without Knowing

LoRA and QLoRA fine-tuning: what they actually do under the hood

How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding…

What is RLHF? Reinforcement learning from human feedback for AI alignment

Deriving the PPO Loss from First Principles

RLAIF Is Eating RLHF — Here Are the Four Places Human Feedback Still Wins