Understanding Reinforcement Learning with Human Feedback Part 3: Collecting Human Preferences

In the previous article we explored the concept of aligning the pretrained model. Now, we will look...

mercoledì 20 maggio 2026 New tab

328 words~1 min read

In the previous article we explored the concept of aligning the pretrained model. Now, we will look at the next component: human preference collection.

The first step in understanding RLHF is to understand that, given a specific prompt, a model can generate different responses.

One way to generate a response is to configure the model to always select the token with the highest output value at every step.

In this case, the model will generate the same response every single time for a given prompt.

Generating Different Responses

Understanding Reinforcement Learning with Human Feedback Part 3: Collecting Human Preferences

Understanding Reinforcement Learning with Human Feedback Part 3: Collecting Human Preferences

Other newsrooms on this story

Related reading

Understanding Reinforcement Learning with Human Feedback Part 4: Teaching…

What is RLHF? Reinforcement learning from human feedback for AI alignment

Understanding Reinforcement Learning with Human Feedback Part 5: Training the…

Reinforcement learning Archives

AI Techniques Archives

Understanding Reinforcement Learning — A Primer | Towards AI

Other newsrooms on this story

Related reading

Understanding Reinforcement Learning with Human Feedback Part 4: Teaching…

What is RLHF? Reinforcement learning from human feedback for AI alignment

Understanding Reinforcement Learning with Human Feedback Part 5: Training the…

Reinforcement learning Archives

AI Techniques Archives

Understanding Reinforcement Learning — A Primer | Towards AI