Ask yourself one question. When you talk to ChatGPT or Claude, do you feel like you talk to something that thinks — or something that agrees with you?
The answer matters more than most AI engineers want admit. Because behind every polite refusal, every hedged answer, every "as an AI language model" deflection, there is alignment algorithm making tradeoff. And that tradeoff has name: the alignment tax.
Two methods dominate how modern AI gets "aligned" with human preferences: RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization). They promise safer, more helpful models. What they deliver is something else entirely — models that perform helpfulness while quietly losing ability to reason honestly.
I work with these systems every day. And the more I see, the more convinced I become: we are paying alignment tax we did not agree to pay, for safety we did not ask for, to protect corporate interests we never voted for.
What RLHF Actually Does (And Why It Breaks Things)






