QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

Part 3 of a 4-part series. QLoRA explained — quantize the frozen base to 4-bit, then LoRA on top. The BitsAndBytesConfig that matters, the memory-footprint moment, and why it's about fit, not speed.

domenica 21 giugno 2026 New tab

507 words~2 min read

In Part 2, LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to Qwen2.5-7B — and hit a wall that LoRA alone doesn't solve.

The problem

A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would barely load, with no room left to actually train.

The QLoRA insight

QLoRA asks the question that naturally follows from LoRA: the base is frozen and only ever read — so why store it in full precision?

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

Other newsrooms on this story

Related reading

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

LoRA and QLoRA fine-tuning: what they actually do under the hood

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

How to use Alpaca-LoRA to fine-tune a model like ChatGPT – Replicate blog

Other newsrooms on this story

Related reading

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

LoRA and QLoRA fine-tuning: what they actually do under the hood

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

How to use Alpaca-LoRA to fine-tune a model like ChatGPT – Replicate blog