LoRA and QLoRA fine-tuning: what they actually do under the hood

You spent three weeks curating a dataset of legal contract summaries: 12,000 pairs of dense legalese and plain-English counterparts. The model you picked -- a 7B parameter instruction-tuned Llama -- understands your prompts but produces summaries that read like a junior associate who memorized Blackstone but never saw a real merger clause. You reach for full fine-tuning, the obvious move. Then torch.cuda.OutOfMemoryError hits at step 20 on your RTX 4090. You try gradient checkpointing. You try a smaller batch. You try half-precision. Still OOM. Your colleague says "just use LoRA" and walks off, as if that explains anything.

This is the gap this post fills. You do not need another high-level "LoRA is a PEFT method" post. You need the math and the trade-offs that let you decide between LoRA, QLoRA, and full fine-tuning for your specific hardware and quality requirements.

Why parameter-efficient fine-tuning exists

The cost of full fine-tuning is straightforward: a model with P parameters requires storing, at minimum, the model weights (2P bytes for fp16), the optimizer states (8P bytes for Adam), and the gradients (2P bytes). For Llama 3 8B with fp16 parameters, that is roughly 16 GB for weights plus 64 GB for optimizer state plus 16 GB for gradients -- 96 GB total. An RTX 4090 has 24 GB. A single A100-80 has exactly enough, barely, with no room for a batch size above 1.

LoRA and QLoRA fine-tuning: what they actually do under the hood

Why parameter-efficient fine-tuning exists

LoRA and QLoRA fine-tuning: what they actually do under the hood

LoRA and QLoRA fine-tuning: what they actually do under the hood

Related reading

How to use Alpaca-LoRA to fine-tune a model like ChatGPT – Replicate blog

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding…

Beyond LoRA: Can you beat the most popular fine-tuning technique?

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

Related reading

How to use Alpaca-LoRA to fine-tune a model like ChatGPT – Replicate blog

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding…

Beyond LoRA: Can you beat the most popular fine-tuning technique?

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)