Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library | NVIDIA Technical Blog

Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques such as disaggregated serving, KV cache loading, and wide expert parallelism.In disaggregated serving environments, prefill and decode phases are run on separate GPUs, requiring efficient KV cache transfers between them. Low-latency and high-throughput communication to move these KV caches are critical to gain benefits from disaggregated serving.

In KV cache loading, storage is used to help with growing KV caches in multiturn and agentic AI workloads such as coding assistants and reasoning. For the case of long context KV, the previous results can be loaded from local SSDs and remote storage, instead of recomputing them as prefill. This is one example that explains why storage is becoming a core part of inference workloads.

In wide expert parallelism, experts are split across many GPUs, where the intermediate results (activations) have to be dispatched to and combined from these experts. Due to the requirement for ultra-low-latency communication for intermediate activations between stages, these transfers are typically initiated by the GPU through optimized kernels, referred to as device side APIs for networking, or device API in short.

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library | NVIDIA Technical Blog

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library | NVIDIA Technical Blog

Related reading

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

NVIDIA Technical Blog

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference…

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA…

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

Optimizing inference speed and costs: Lessons learned from large-scale…

Related reading

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

NVIDIA Technical Blog

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference…

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA…

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

Optimizing inference speed and costs: Lessons learned from large-scale…