If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler.
The problem with default autoscaling
The Kubernetes Horizontal Pod Autoscaler (HPA) was designed to scale on CPU and memory metrics. For traditional web workloads, that is enough. For LLM inference, it is not. A GPU can be running at 95% utilization while the HPA sees low CPU and decides not to scale.
KEDA (Kubernetes Event-driven Autoscaling) addresses part of this by enabling scaling on external events and custom metrics. But someone still has to read the GPU hardware metrics and expose them in a form KEDA can consume. That is the role of the external scaler.
How the external scaler works







