GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler.

The problem with default autoscaling

The Kubernetes Horizontal Pod Autoscaler (HPA) was designed to scale on CPU and memory metrics. For traditional web workloads, that is enough. For LLM inference, it is not. A GPU can be running at 95% utilization while the HPA sees low CPU and decides not to scale.

KEDA (Kubernetes Event-driven Autoscaling) addresses part of this by enabling scaling on external events and custom metrics. But someone still has to read the GPU hardware metrics and expose them in a form KEDA can consume. That is the role of the external scaler.

How the external scaler works

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

Other newsrooms on this story

Related reading

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU…

NVIDIA Technical Blog

NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes |…

Inference Optimization for the Rest of Us — KV Cache, Quantization, and Latency…

AI Workloads Are Reshaping Kubernetes in 2026: GPU Scheduling, MLOps, and the…