Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages have fundamentally different compute profiles, yet traditional deployments force them onto the same hardware, leaving GPUs underutilized and scaling inflexible.

Disaggregated serving addresses this by splitting the inference pipeline into distinct stages such as prefill, decode, and routing, each running as an independent service that can be resourced and scaled on its own terms.

This post will give an overview of how disaggregated inference gets deployed on Kubernetes, explore different ecosystem solutions and how they execute on a cluster, and evaluate what they provide out of the box.

How do aggregated and disaggregated inference differ?

Before diving into Kubernetes manifests, it helps to understand the two inference deployment modes for LLMs: In aggregated serving, a single process (or tightly coupled group of processes) handles the entire inference lifecycle from input to output. Disaggregated serving splits the pipeline into distinct stages such as prefill, decode, and routing, each running as independent services (see Figure 1, below).

How do aggregated and disaggregated inference differ?

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

Related reading

NVIDIA Technical Blog

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer…

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

Related reading

NVIDIA Technical Blog

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer…

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML