Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements...

martedì 16 giugno 2026 New tab

995 words~5 min read

Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements were simple: OpenAI-compatible API, auto-scaling, and it couldn't cost more than the team's coffee budget. AWS wanted $3.06/hr for a g5.xlarge. Azure quoted something similar.

Then I looked at OCI's GPU shapes. VM.GPU.A10.1 — a single NVIDIA A10 with 24GB VRAM — at $1.52/hr on-demand. Half the price. And on preemptible? $0.46/hr. That's a latte.

Here's how I got vLLM running on OKE in about 20 minutes.

The OKE Cluster Setup

If you already have an OKE cluster, skip ahead. If not, this is the fastest path:

Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

Related reading

I Stopped Paying for Idle GPUs - Scale-to-Zero AI Inference on OKE with KEDA

Why Your Local LLM Setup Is Costing More Than You Think — And What Happens When…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…

I Tested 9 Serverless GPU Providers for AI Inference in 2026. Here's What I'd…

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers

Related reading

I Stopped Paying for Idle GPUs - Scale-to-Zero AI Inference on OKE with KEDA

Why Your Local LLM Setup Is Costing More Than You Think — And What Happens When…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…

I Tested 9 Serverless GPU Providers for AI Inference in 2026. Here's What I'd…

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers