Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

Inference speed is becoming a competitive metric for large language models. Xiaomi’s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed is a high-speed serving mode for the existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability. It changes how fast the model produces output tokens. The speedup comes from three coordinated techniques across the model and the serving system. Xiaomi calls this approach extreme model-system codesign. Crucially, the entire stack runs on a single standard 8-GPU commodity node.

The Speed Case: Three Layers Working Together

The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Lower bit-width weights move through memory faster, which directly lifts decode speed. Xiaomi uses the MXFP4 format, applied selectively to the MoE Experts only. Other modules keep higher precision, reported as FP8 by TileRT. Experts hold most parameters and tolerate quantization best, so the tradeoff is favorable. Quantization-Aware Training (QAT) keeps benchmark quality essentially on par with the original.

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

Other newsrooms on this story

Related reading

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude - Decrypt

小米上线MiMo-V2.5-Pro-UltraSpeed模式-36氪

Xiaomi's MiMo Code outperforms Claude Code in 200+ step tasks

Xiaomi's MiMo Code gets better as tasks get harder. Here's how.

Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code…

Google brings multi-token prediction Gemma 4 LLMs - TechTalks