Google DeepMind just dropped what might be the most capable video generation model yet. Gemini Omni, unveiled at Google I/O on May 19-20, 2026, accepts text, images, audio, and video as inputs and spits out short video clips, roughly 10 seconds long, complete with synchronized audio.
The model’s first variant, Gemini Omni Flash, is the tip of the spear. It replaces Google’s earlier Veo model inside the Gemini app, marking a shift from standalone video generation toward what Google is calling “anything from anything” creation.
What Gemini Omni actually does
Early demonstrations showed effective text rendering within video, along with advanced scene editing capabilities.
Google is emphasizing improvements in world understanding, physics simulation, and character consistency. The company drew comparisons to its Nano Banana image model, which earned praise for visual fidelity. Gemini Omni extends that same logic into motion and sound, wrapping everything into a conversational interface where users can iteratively edit and refine their clips through dialogue.











