Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster inference, higher throughput, and more efficient GPU utilization at scale.

In a previous post, we produced a high-quality FP8-quantized Contrastive Language-Image Pretraining (CLIP) checkpoint with NVIDIA TensorRT Model Optimizer.

This post picks up where we left off, walking through how to export the checkpoint to ONNX and compile it into an NVIDIA TensorRT engine ready for production inference. We also profile the resulting FP8 TensorRT engine against the FP16 baseline to measure the real-world speedup the quantized model delivers.

Figure 1 shows the five stages of a typical end-to-end quantization workflow. This is the standard pipeline for deploying a quantized CLIP model. Quantized LLMs follow a different path through TensorRT-LLM, which is covered in this tutorial.

Figure 1. End-to-end quantization and deployment workflow with ModelOpt and TensorRT