Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

What Happened This Week Week 3 produced a working fine-tuned model: one epoch, one...

martedì 16 giugno 2026 New tab

1,540 words~7 min read

What Happened This Week

Week 3 produced a working fine-tuned model: one epoch, one dataset, a clear improvement over the base model. This week 4 was supposed to make it better with More data (a second dataset), two epochs, and a cleaner setup.

The eval loss dropped from 2.495 to 2.275. By that number alone, Week 4 was going to be a success.

The model was worse.

This is the story of how a better loss number hid a serious regression, how I diagnosed it, and what it took to actually fix it. It is one of the most useful things I have learned in this project.

Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

Other newsrooms on this story

Related reading

Fine-Tuning Llama 3.2 3B on Medical QA: Week 2- Data Preparation

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

LoRA and QLoRA fine-tuning: what they actually do under the hood

I was fine-tuning a language model on Arabic. The loss was perfect. It spoke…

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

Related reading

Fine-Tuning Llama 3.2 3B on Medical QA: Week 2- Data Preparation

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

LoRA and QLoRA fine-tuning: what they actually do under the hood

I was fine-tuning a language model on Arabic. The loss was perfect. It spoke…

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

Other newsrooms on this story