Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4
You just finished fine-tuning a 7B parameter model. The raw FP16 weights are 14 GB. Your target deployment is a single consumer GPU with 8 GB of VRAM, or perhaps an ARM MacBook with unified memory, or maybe a cloud instance where you pay per GB of GPU memory. The numbers do not add up. The model, as is, does not fit. You need to shrink it, and you need to shrink it in a way that does not turn it into a random-number generator.
This is where weight quantization enters the picture. Reducing each parameter from 16 bits to 4 bits drops the memory footprint by 4x, from 14 GB to roughly 3.5 GB for a 7B model. The trick is how you do it, because not all 4-bit values are the same, and the trade-offs between memory, speed, accuracy, and portability are different for every format.
Why quantization format choice matters
The format determines three things: which hardware can run the model, how fast inference runs, and how much accuracy you give up. These three constraints are in tension. A format optimized for CPU inference (GGUF) uses a different quantization scheme than one designed for GPU batch serving (GPTQ). A format that preserves more accuracy at the same bit-width (AWQ) may cost more to calibrate. A format designed for training (NF4 via bitsandbytes) is not the best choice for inference deployment.








