I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.

The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.

Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.

What's in it:

SSD-streamed expert loading via io_uring fixed buffers and O_DIRECT pread

Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.

What's in it:

SSD-streamed expert loading via io_uring fixed buffers and O_DIRECT pread

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

Other newsrooms on this story

Related reading

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

How I Fixed LLM Hallucinations on a 512MB Server with Pure Math

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

8GB to 70B: A Real Hardware Guide for Local LLMs

Hardware Guide: What Do You Actually Need to Run Local LLMs?

Other newsrooms on this story

Related reading

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

How I Fixed LLM Hallucinations on a 512MB Server with Pure Math

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

8GB to 70B: A Real Hardware Guide for Local LLMs

Hardware Guide: What Do You Actually Need to Run Local LLMs?