Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements were simple: OpenAI-compatible API, auto-scaling, and it couldn't cost more than the team's coffee budget. AWS wanted $3.06/hr for a g5.xlarge. Azure quoted something similar.

Then I looked at OCI's GPU shapes. VM.GPU.A10.1 — a single NVIDIA A10 with 24GB VRAM — at $1.52/hr on-demand. Half the price. And on preemptible? $0.46/hr. That's a latte.

Here's how I got vLLM running on OKE in about 20 minutes.

The OKE Cluster Setup

If you already have an OKE cluster, skip ahead. If not, this is the fastest path: