Fine-tuning a large model used to mean one painful thing: update every weight in it, keep a full copy per task, and pay for the GPUs to do it. A 7-billion-parameter model has 7 billion knobs. The Adam optimizer keeps two extra numbers per knob, so you're suddenly holding roughly three times the model in memory. Then, for every new task you tune, you store another 13 GB checkpoint. It works, but it's absurdly out of proportion to how much the model actually needs to change to, say, adopt a new tone or learn your domain's vocabulary.
LoRA — Low-Rank Adaptation — is the trick that makes that whole problem go away. And it's built on one clean observation.
The update is low-rank
When you fine-tune, what you really produce is a difference: the new weights minus the old weights, call it ΔW. The insight behind LoRA is that this difference has low "intrinsic rank." The useful change lives in a tiny subspace, not the full d×d space of the matrix. A big matrix of numbers can often be reconstructed almost perfectly from the product of two skinny matrices.
So instead of learning ΔW directly, LoRA learns two small factors whose product is ΔW:






