A deep dive into using preference optimization to train open-source models that beat GPT 5.2. We show that fine-tuned open-source models like gpt-oss 120b and Qwen 3 235B Instruct more often agree with human preference labels on a held-out evaluation set. We evaluate using Reward Bench 2 which measures alignment with human judgment, not absolute correctness or ground-truth quality. The table below is a quick sneak preview of the results we got, if you'd rather just see the code please feel free to jump into the cookbook! * Together AI models page ** Speed on Together AI as benchmarked by Artificial Analysis 3rd party Model Baseline + DPO Fine-tune Cost per 1M tokens* Cost vs GPT-5.2 Speed** Speed vs GPT-5.2 GPT-5.2 61.62% N/A $1.75 input / $14 output - 62.9 tok/sec - gpt-oss 120B 57.91% 62.63% $0.15 input / $0.60 output 15.3× cheaper 908.7 tok/sec 14× faster Qwen3 235B 62.63% 61.28% $0.20 input / $0.60 output 12.4× cheaper 261.6 tok/sec 4.2× faster Llama 4 Mav 50.2% — $0.27 input / $0.85 output 9.1× cheaper 64.7 tok/sec 1× faster The LLM-as-a-judge paradoxHere's a paradox that's bothered me for some time now: we're using LLMs to evaluate LLMs. The same technology that generates hallucinations is now our primary tool for detecting them. It sounds like asking the fox to guard the henhouse 😀.But it works. And not just works, it's become the dominant framework for evaluating LLM-powered products at scale.The reason is simple: for most tasks judging is easier than generating. When an LLM generates a response, it juggles complex context, follows multi-step instructions, and synthesizes information from its training data. When it evaluates a response, it performs a focused classification task of the form: does this text contain harmful content? Is response A better than response B?This insight opens up an interesting question: if judging is a simpler task, can we fine-tune smaller, open-source models to be *better* judges than massive closed-source alternatives?We ran the experiment. The answer is yes!In this deep dive, we'll show you how we fine-tuned open-source LLM judges to outperform GPT-5.2 on human preference alignment using Direct Preference Optimization (DPO). We'll cover:The experimental setup and benchmark (RewardBench 2)Baseline evaluation of 4 judge models (3 open, 1 closed)DPO fine-tuning methodology and resultsCategory-level analysis revealing where each model excels and where preference tuning helped/hurtPractical code to implement this yourself!Let's dive in.Why LLM-as-a-judge worksBefore we get to the experiment, let's build intuition for why this technique is so effective.The evaluation scaling problemEvaluating LLM outputs is fundamentally different from evaluating traditional ML models. With a classifier, you compute accuracy against ground truth labels. With a recommender, you measure ranking quality with NDCG.But with generative text? There are many ways to be "right." A summary can be accurate without matching the reference word-for-word. A chatbot response can be helpful in different styles. Metrics like BLEU or ROUGE capture surface-level overlap but miss semantic equivalence.Human evaluation handles these nuances, but it doesn't scale. You can't have humans review every response in production.Enter LLM-as-a-judgeThe breakthrough insight is that LLMs, trained on vast amounts of human-written text, have internalized patterns of quality, relevance, and appropriateness. By crafting the right evaluation prompt, you can activate these capabilities for focused assessment tasks.‍Figure: The LLM-as-a-Judge workflow. An external LLM evaluates outputs from your production system using criteria you define.The key is that the evaluator/Judge LLM operates independently of the generation process. It examines the output and judges it on its merits. Even if your chatbot was tricked into generating harmful content, an external evaluator can still detect this because it's performing a simpler, focused classification task.Types of LLM judgesThere are three main paradigms:Pairwise Comparison: Given two responses, which is better? Useful for A/B testing models or prompts.Direct Scoring: Rate a single response on a scale (1-10) or classify it (helpful/unhelpful). Useful for production monitoring.‍Reference-Based Evaluation: Compare a response against source material or a reference answer. Essential for RAG systems and hallucination detection.For this experiment, we focus on a pairwise comparison depicted in the flowchart below, this is the classic "LLM-as-a-Judge" setup that the technique is named after.The experiment: Can open-source judges beat GPT-5.2?GPT-5.2 represents the current state-of-the-art in closed-source LLM judges. It's powerful, but:Expensive: Per-token costs add up at scale - with open models you can deploy them on your GPUs and at scale this is significantly more price effective.Opaque: No visibility into model weights or behavior - you can probe the judge to understand why it's behaving a certain way.Vendor lock-in: Your evaluation pipeline depends on an external API.For many of the above reasons it would be beneficial if we could use open judges that we could deploy where we wish, probe as we see necessary and continually improve. But we also don't want to leave performance on the table, we'd like to have our cake and eat it too! Here we'll see that if you have a dataset of preferences and human labels(which output humans chose) you can often fine-tune open-source models on said human preference data and these models can then match or exceed GPT-5.2's performance as a judge.Models under testWe evaluated four judge models: Model Type Parameters Notes GPT-OSS 120B Open 120B OpenAI's open-source release Qwen3 235B Open 235B Alibaba's largest instruct model Llama 4 Maverick Open 400B Meta's efficient instruct model GPT-5.2 Closed Unknown OpenAI's SOTA closed-source judge The open models are fine-tuning candidates. GPT-5.2 is the target to beat.The Benchmark: RewardBench 2We used RewardBench 2, a comprehensive benchmark for evaluating reward models and LLM judges. It tests capabilities across 6 categories:Figure: Distribution of examples across RewardBench 2 categories. Focus and Factuality have the most examples; Ties and Precise IF have the fewest.Precise Instruction Following: Judging adherence to specific constraintsMath: Mathematical reasoning and accuracySafety: Compliance and harmful content detectionFocus: Quality and relevance of responsesTies: Robustness when multiple valid answers existEach example contains:One human chosen response (ground truth winner)Three or more human rejected responses (ground truth losers)Good judges will pick human chosen responses more often and thus we can calculate the quality of a judge as the number of examples where it's choice agrees with human choice. Success in our experiment will then be measured by how often the judge's choices correlate with human preferences. The best judges should, ignoring noisy labels in the data, agree with human preference.Baseline evaluationTo ensure unbiased evaluation of judges we created a stratified train/test split:Training set: ~1,500 examples (for later fine-tuning)Test set: ~300 examples (for final evaluation)Zero overlap between setsProportional sampling maintains category distributionBefore fine-tuning, we need to establish baseline performance for all models on the held-out test set. We used a carefully crafted prompt that instructs the judge on evaluation criteria: