Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer | NVIDIA Technical Blog

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments.

This post walks through how to use NVIDIA Model Optimizer to quantize a CLIP model in FP8 format with the post-training quantization (PTQ) method. For a general introduction to model quantization, see Model Quantization: Concepts, Methods, and Why It Matters.

What is NVIDIA Model Optimizer?

The NVIDIA Model Optimizer (ModelOpt) library incorporates state-of-the-art model optimization techniques to compress and accelerate AI models. These techniques include quantization, distillation, pruning, speculative decoding, and sparsity. ModelOpt accepts Hugging Face, PyTorch, or ONNX format models as input and provides Python APIs for users to easily combine different optimization techniques to produce optimized checkpoints.

ModelOpt supports highly performant quantization formats such as FP4, FP8, INT8, and INT4, and advanced algorithms including SmoothQuant, AWQ, SVDQuant, and Double Quantization. It supports both PTQ and quantization-aware training (QAT).

What is NVIDIA Model Optimizer?

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer | NVIDIA Technical Blog

Other newsrooms on this story

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference…

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model…

Compute-Optimal Quantization-Aware Training

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

Inference Optimization for the Rest of Us — KV Cache, Quantization, and Latency…

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Related reading

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference…

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model…

Compute-Optimal Quantization-Aware Training

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

Inference Optimization for the Rest of Us — KV Cache, Quantization, and Latency…

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4