Nvidia just dropped what it’s calling the first open omni-model built specifically for physical AI. Cosmos 3, unveiled on May 31, integrates reasoning, world generation, and action capabilities into a single system designed to help robots and autonomous vehicles actually understand the messy, unpredictable real world.
Cosmos 3 can generate predictive video sequences of up to 30 seconds based on text, image, or video inputs, essentially letting a robot “imagine” what happens next in its environment before it moves a single actuator.
What Cosmos 3 actually does
Cosmos 3 uses what Nvidia calls a Mixture of Transformers architecture to process multiple types of input simultaneously. The model supports sound and action modalities, meaning a robot equipped with Cosmos 3 can process what it sees, hears, and does in a unified framework.
The practical application centers on something called robot policy learning. Cosmos 3 serves as a backbone for what Nvidia terms World Action Models, or WAMs, which allow embodied agents to operate across environments they’ve never encountered before.











