MolmoMotion: Language-guided 3D motion forecasting

Back to Articles

MolmoMotion: Under the hood Introducing MolmoMotion-1M and PointMotionBench Experiments and performance 3D motion forecasting Downstream evaluation: robotics planning Downstream evaluation: video generation Limitations and what's next 🧠 Models: https://huggingface.co/collections/allenai/molmomotion | 📄 Tech Report: https://allenai.org/papers/molmomotion | 📊 Data: https://huggingface.co/datasets/allenai/molmo-motion-1m | 💻 Code: https://github.com/allenai/molmo-motion.git | 🌐 Project Page: https://molmomotion.github.io/

Machines have become remarkably good at perceiving motion. Given a video, modern models can track how objects and points move through a scene with exceptionally high confidence. But perception is inherently retrospective: it explains motion that has already happened. Many of the systems and applications we want to build need to look forward instead. A robot reaching for a cup has to anticipate how the cup will move before it touches it. A video generator has to know what realistic motion comes next if it's going to produce physically plausible frames.

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

This idea was the motivation behind MolmoMotion, a new motion forecasting model we're releasing today. Given a video frame, 3D points marked on an object, and written instructions describing the intended action (e.g., “Move and rotate the wooden bowl with fruit on the table”), MolmoMotion predicts where those points will move over the next few seconds in 3D space—achieving substantially stronger performance than existing forecasting methods.

Back to Articles

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

MolmoMotion: Language-guided 3D motion forecasting

MolmoMotion: Language-guided 3D motion forecasting

Related reading

Molmo learns to point and act | Ai2

MolmoPoint: Better pointing architecture for vision-language models | Ai2

Ai2’s MolmoAct model ‘thinks in 3D’ to challenge Nvidia and Google in robotics…

MolmoBot: Training robot manipulation entirely in simulation | Ai2

MolmoAct 2: An open foundation for robots that work in the real world | Ai2

Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across…

Related reading

Molmo learns to point and act | Ai2

MolmoPoint: Better pointing architecture for vision-language models | Ai2

Ai2’s MolmoAct model ‘thinks in 3D’ to challenge Nvidia and Google in robotics…

MolmoBot: Training robot manipulation entirely in simulation | Ai2

MolmoAct 2: An open foundation for robots that work in the real world | Ai2

Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across…