Did you know that a 35-billion-parameter model can generate tokens at the same compute cost as a 4B model? That single fact made me abandon a multi-model agent architecture I'd spent a weekend building. But I had to run the benchmarks first to understand why.

Here's the full breakdown, with commands, numbers, and the architectural reason it all falls apart on shared-memory hardware.

The Discovery That Changed Everything

I'd been running qwen3.6:35b on my Minisforum UM790Pro for weeks as my daily coding assistant. 17.8 tokens/second -- genuinely usable for interactive work. But I kept wondering: could I run a lightweight sidecar model alongside it for quick classification and tool-calling in an agent pipeline?

Before I even started benchmarking, I dug into what qwen3.6:35b actually is under the hood. It's a Mixture of Experts model: 256 total experts with only 8 activated per token. The architecture also incorporates SSM (State Space Model) components alongside traditional attention -- Mamba-style layers that handle certain sequence patterns more efficiently than pure transformers.