Every few days someone in a local LLM thread asks the same question: "will this run on my 3060?" And the answers are almost always vibes. "Should be fine." "Probably need to quantize." Nobody shows the math, so you download 16GB, load it up, and find out the hard way.
I did exactly that a while back. Grabbed an 8B model, it loaded fine on a 12GB card, I felt clever, and then it OOM'd about 20,000 tokens into a long document. The weights fit. The KV cache didn't. That gap is the whole reason for this post.
So here is the actual math, with real numbers for Llama 3 and Gemma, including the part that surprised me, where two models that look identical on paper need very different amounts of memory.
Three things eat your VRAM
When you run a model locally, your GPU memory goes to three places:






