In Part 1 I fully fine-tuned a 270M model — updating every weight. That's fine for a tiny model. It gets painful as models grow, because full fine-tuning needs gradients and optimizer state for every parameter (~4× the model size in memory).
So: what do you do when the model is too big to comfortably fine-tune all of?
The idea behind LoRA
LoRA (Low-Rank Adaptation) rests on one observation: the change fine-tuning makes to a weight matrix is "low rank" — it lives in a small subspace. You don't need to learn the full update ΔW; you can learn it as the product of two skinny matrices, B·A:
output = W·x + (B·A)·x






