Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences

In the previous article, we explored the part where we collect human preferences. In this article, we...

sabato 23 maggio 2026 New tab

258 words~1 min read

In the previous article, we explored the part where we collect human preferences. In this article, we will see how to use this data to train the models.

To train a model that gives higher scores to preferred responses, we first make a copy of the model that has already gone through supervised fine-tuning.

Modifying the Model

Next, we modify this copied model.

We remove the unembedding layer, which normally predicts the next token, and replace it with a single output value.

Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences

Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences

Other newsrooms on this story

Related reading

Understanding Reinforcement Learning with Human Feedback Part 3: Collecting…

Understanding Reinforcement Learning with Human Feedback Part 5: Training the…

What is RLHF? Reinforcement learning from human feedback for AI alignment

How to Fine-Tune LLMs on Your Own Data: Open-Source Models, RL Environments,…

Teaching the model: Designing LLM feedback loops that get smarter over time

Reinforcement learning Archives

Other newsrooms on this story

Related reading

Understanding Reinforcement Learning with Human Feedback Part 3: Collecting…

Understanding Reinforcement Learning with Human Feedback Part 5: Training the…

What is RLHF? Reinforcement learning from human feedback for AI alignment

How to Fine-Tune LLMs on Your Own Data: Open-Source Models, RL Environments,…

Teaching the model: Designing LLM feedback loops that get smarter over time

Reinforcement learning Archives