M*: A Modular, Extensible, Serving System for Multimodal Models

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Xikai(Noah) Meng, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

June 15, 2026

Stanford University · University of Washington · Correspondence: atindra@cs.stanford.edu

Today's models no longer fit the mold of autoregressive token generation, but the systems supporting LLM inference have not kept up. These models have composite architectures best captured by dataflow graphs. Requests are just walks on these graphs. M* is designed to fit this paradigm and maximize flexibility and performance for current and future composite models. In our tests, M* achieves nearly 2.7x higher throughput vs. vLLM-Omni and 4x higher throughput vs. SGLang-Omni while maintaining a lower RTF than both on Qwen3-Omni TTS workload.

Inference is no longer a single loop

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Xikai(Noah) Meng, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

June 15, 2026

Stanford University · University of Washington · Correspondence: atindra@cs.stanford.edu

Inference is no longer a single loop

M*: A Modular, Extensible, Serving System for Multimodal Models

M*: A Modular, Extensible, Serving System for Multimodal Models

Other newsrooms on this story

Related reading

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed…

Integrating LLMs with Computer Vision for Multimodal Understanding

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and…

Small language models: Rethinking enterprise AI architecture

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the…

MIT's MeMo framework boosts LLM performance by 26% without retraining

Related reading

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed…

Integrating LLMs with Computer Vision for Multimodal Understanding

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and…

Small language models: Rethinking enterprise AI architecture

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the…

MIT's MeMo framework boosts LLM performance by 26% without retraining

Other newsrooms on this story