A single A10 GPU on OCI costs $1.52/hr. Running 24/7, that's $1,094/month. For a production inference service with steady traffic, that's fine. But I had a staging environment and a couple of internal tools that got maybe 20 requests per day. I was paying over $2,000/month for GPUs that sat idle 95% of the time.
The obvious solution: scale to zero when there's no traffic, spin up when a request comes in. KEDA does this on Kubernetes, but getting it to work properly with GPU pods took some figuring out.
Why Scaling GPUs Is Harder Than Scaling CPU Pods
With normal HTTP services, KEDA watches a metric (HTTP requests, queue depth, whatever), and Kubernetes can spin up a new pod in seconds. The user barely notices.
GPU pods are different:












