I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

I run a homelab with four RTX 3090s — 96 GB of VRAM, 44 CPU cores. For two weeks I tried to make it...

sabato 20 giugno 2026 New tab

TL;DRAI

Building 96GB VRAM local LLMs (4 RTX 3090s) hit 6% GPU utilization due to sequential CPU dispatch bottleneck. Cloud APIs win economically when accounting for hardware depreciation and 11kWh/day power costs—it's a TCO problem, not capability gap.

373 words~2 min read

I run a homelab with four RTX 3090s — 96 GB of VRAM, 44 CPU cores. For two weeks I tried to make it my daily driver for local LLM inference instead of paying for cloud APIs. I got it working. Then I looked at the numbers and subscribed to a paid API anyway.

Here's the uncomfortable part, and the optimizations that still made it worth doing.

## The setup

4× RTX 3090 (Ampere — no native BF16), 96 GB VRAM total, 44 cores

Models: Qwen3.6-35B-A3B (Q8_0, MoE) and Qwen3-Coder-Next (Q6_K, hybrid)

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

Other newsrooms on this story

Related reading

Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

AI-NT-No-Problem: Cramming a 9950X3D and RTX 5090 Into an SFF Custom Loop

Tesla P40 in a Homelab: 24GB of Inference on a Budget

Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent…

768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter…

Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible…

Other newsrooms on this story

Related reading

Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

AI-NT-No-Problem: Cramming a 9950X3D and RTX 5090 Into an SFF Custom Loop

Tesla P40 in a Homelab: 24GB of Inference on a Budget

Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent…

768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter…

Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible…