Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Did you know that a 35-billion-parameter model can generate tokens at the same compute cost as a 4B model? That single fact made me abandon a multi-model agent architecture I'd spent a weekend building. But I had to run the benchmarks first to understand why.

Here's the full breakdown, with commands, numbers, and the architectural reason it all falls apart on shared-memory hardware.

The Discovery That Changed Everything

I'd been running qwen3.6:35b on my Minisforum UM790Pro for weeks as my daily coding assistant. 17.8 tokens/second -- genuinely usable for interactive work. But I kept wondering: could I run a lightweight sidecar model alongside it for quick classification and tool-calling in an agent pipeline?

Before I even started benchmarking, I dug into what qwen3.6:35b actually is under the hood. It's a Mixture of Experts model: 256 total experts with only 8 activated per token. The architecture also incorporates SSM (State Space Model) components alongside traditional attention -- Mamba-style layers that handle certain sequence patterns more efficiently than pure transformers.

Here's the full breakdown, with commands, numbers, and the architectural reason it all falls apart on shared-memory hardware.

The Discovery That Changed Everything

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Other newsrooms on this story

Related reading

Speculative Decoding: 20-50% Faster LLM Inference

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

I Built a Memory API That Beats Mem0 on LongMemEval Without Using a Single LLM…

AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

This Half-Gigabyte AI Model Runs Local Agents on Your Phone - Decrypt

Other newsrooms on this story

Related reading

Speculative Decoding: 20-50% Faster LLM Inference

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

I Built a Memory API That Beats Mem0 on LongMemEval Without Using a Single LLM…

AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

This Half-Gigabyte AI Model Runs Local Agents on Your Phone - Decrypt