I have done this setup process more times than I want to count. Every time I find something that the documentation skipped or assumed. This is the version I wish I had read first.
This covers deploying a production-ready self-hosted LLM inference server for an enterprise RAG use case. I am using Llama 3 8B with vLLM on a single A100 instance. Adjust for your hardware.
What you actually need before you touch a single command
GPU memory math first. Llama 3 8B in fp16 needs roughly 16GB VRAM just for model weights. Add KV cache for your expected concurrent sessions and you are pushing 35-40GB. One A100 80GB handles this comfortably. One A100 40GB will work but you are tight. Two A10Gs in tensor parallel will work. Know your numbers before provisioning.
Your network topology matters. The inference server needs to reach your vector database and your application layer. If those are in a private VPC, your inference server needs to be in the same VPC or peered. Setting this up after the fact while production is waiting is miserable.






