Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.

At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.

We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.

Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing pure host CPU execution, the engine delivered a sustained 21.38 Tokens Per Second (TPS) over a massive 5,000-token context window.

The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.

At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.

The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Other newsrooms on this story

Related reading

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs,…

Mixture of Experts (MoE): what it actually does under the hood, and when it…

Cheaper, Better, Faster, Stronger | Mistral AI

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Boosting MoE Training Throughput with Advanced Fusion Kernels | NVIDIA…

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

Related reading

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs,…

Mixture of Experts (MoE): what it actually does under the hood, and when it…

Cheaper, Better, Faster, Stronger | Mistral AI

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Boosting MoE Training Throughput with Advanced Fusion Kernels | NVIDIA…

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

Other newsrooms on this story