I run a homelab with four RTX 3090s — 96 GB of VRAM, 44 CPU cores. For two weeks I tried to make it my daily driver for local LLM inference instead of paying for cloud APIs. I got it working. Then I looked at the numbers and subscribed to a paid API anyway.
Here's the uncomfortable part, and the optimizations that still made it worth doing.
## The setup
4× RTX 3090 (Ampere — no native BF16), 96 GB VRAM total, 44 cores
Models: Qwen3.6-35B-A3B (Q8_0, MoE) and Qwen3-Coder-Next (Q6_K, hybrid)







