The new, open NVIDIA world foundation model brings vision reasoning, multimodal generation and action prediction together to help robots, autonomous vehicles and vision AI agents think before acting in the real world.
The real world is always in motion. To operate autonomously, physical AI systems — including robots, autonomous vehicles (AVs) and smart spaces — need to understand not just what they see and what caused that to happen, but what’s likely to happen next.
In a warehouse, a robot may encounter object configurations it’s never seen before. On the road, an AV may need to respond when a pedestrian steps out from between parked cars. And in a factory, a safety system must predict where a forklift is heading, not just detect that it’s there.
Capturing and recreating those scenarios in the real world is slow, expensive and often impossible to repeat at scale.
NVIDIA Cosmos 3 is built for that loop. The new world foundation model — announced today at NVIDIA GTC Taipei at COMPUTEX — combines vision reasoning and multimodal generation across text, video, images, ambient sound and action in a single model to help developers create world data with physical context.








