Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

You just finished fine-tuning a 7B parameter model. The raw FP16 weights are 14 GB. Your target deployment is a single consumer GPU with 8 GB of VRAM, or perhaps an ARM MacBook with unified memory, or maybe a cloud instance where you pay per GB of GPU memory. The numbers do not add up. The model, as is, does not fit. You need to shrink it, and you need to shrink it in a way that does not turn it into a random-number generator.

This is where weight quantization enters the picture. Reducing each parameter from 16 bits to 4 bits drops the memory footprint by 4x, from 14 GB to roughly 3.5 GB for a 7B model. The trick is how you do it, because not all 4-bit values are the same, and the trade-offs between memory, speed, accuracy, and portability are different for every format.

Why quantization format choice matters

The format determines three things: which hardware can run the model, how fast inference runs, and how much accuracy you give up. These three constraints are in tension. A format optimized for CPU inference (GGUF) uses a different quantization scheme than one designed for GPU batch serving (GPTQ). A format that preserves more accuracy at the same bit-width (AWQ) may cost more to calibrate. A format designed for training (NF4 via bitsandbytes) is not the best choice for inference deployment.

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Why quantization format choice matters

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Other newsrooms on this story

Related reading

How to Pick a GGUF Quant Level for Your VRAM Budget

A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with…

Production-Ready W4A8 vLLM Integration Recovery Techniques

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they…

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer |…

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

Other newsrooms on this story

Related reading

How to Pick a GGUF Quant Level for Your VRAM Budget

A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with…

Production-Ready W4A8 vLLM Integration Recovery Techniques

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they…

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer |…

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x