This article was originally published on runaihome.com
TL;DR: Qwen 3.6 35B-A3B is a Mixture-of-Experts model that costs only 3B parameters of compute per token — but all 35B must live in VRAM simultaneously, setting a hard 24GB floor. On a 24GB RTX 4090 it reaches 120 tok/s at Q4_K_M; a used RTX 3090 gets 107 tok/s with Ollama, or 135.7 tok/s with a properly tuned llama.cpp config. Anything with 16GB VRAM needs CPU offloading that cuts speed by half.
RTX 4090 24GB
RTX 3090 24GB
Mac Mini M4 Pro 24GB






