EMO: Pretraining mixture of experts for emergent modularity

Back to Articles

How do we get modularity to emerge? Benchmark results What are expert subsets specializing to? What we're releasing 🧠 Models: https://huggingface.co/collections/allenai/emo | 📄 Tech report: https://allenai.org/papers/emo | 💻 Code: https://github.com/allenai/EMO | 📊 Visualization: https://emovisualization.netlify.app/

Today we're releasing EMO, a new mixture-of-experts (MoE) model pretrained end-to-end so that modular structure emerges directly from the data without relying on human-defined priors. EMO lets you use a small subset of its experts - just 12.5% of the total - for a given task while keeping near full-model performance, and still works as a strong general-purpose model when all experts are used together.

Large language models are typically trained and deployed as monolithic systems: a single model is initialized, pretrained, fine-tuned, and served as one unified entity. But applications often need only a subset of capabilities, such as code generation, mathematical reasoning, or domain-specific knowledge. As frontier language models routinely reach trillions of parameters, using and adapting the full model becomes impractical for most users and incurs unnecessary computational cost and memory to host parameters that may not even be needed.

EMO: Pretraining mixture of experts for emergent modularity

Related reading

Researchers train AI model that hits near-full performance with just 12.5…

NEO-unify: Building Native Multimodal Unified Models End to End

Google DeepMind’s AI Agent Dreams Up Algorithms Beyond Human Expertise

Chinese delivery giant Meituan releases AI model to take on Alibaba, DeepSeek

Ai2’s MolmoAct model ‘thinks in 3D’ to challenge Nvidia and Google in robotics…

DeepSeek secrets unveiled: engineers reveal science behind Chinese AI model