Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the challenge is scaling across multiple devices without sacrificing the critical optimizations—like kernel fusions, memory planning, and quantization—that NVIDIA TensorRT delivers for production deployments.

Multi-device inference support, a new feature introduced in TensorRT 11.0, brings native high-performance multi-GPU inference to the TensorRT runtime, enabling multi-device production deployments targeting edge devices.

Combining the multi-device inference support in TensorRT with Torch-TensorRT, developers can convert and deploy massive PyTorch models out-of-framework, shattering single-device memory and compute limits.

Download TensorRT 11.0 with multi-device inference support from NVIDIA Developer Portal to unlock native, high-performance multi-device acceleration for your models.

NVIDIA NCCL: The transport layer for distributed inference