The first model in DeepMind’s new Omni family will generate and edit video from any combination of image, audio, video, and text inputs. Speech-editing is being withheld; SynthID watermarking is on by default.

Google introduced Gemini Omni on Tuesday at the I/O 2026 developer conference, a new multimodal model family from Google DeepMind designed to generate and edit video from any combination of image, audio, video, and text inputs.

The first model in the family, Gemini Omni Flash, started rolling out the same day to the Gemini app and Google Flow for Google AI Plus, Pro, and Ultra subscribers, and to YouTube Shorts and the YouTube Create app at no cost. API access for developers and enterprise customers will follow in the coming weeks.

The product framing, from Koray Kavukcuoglu, CTO of Google DeepMind and Chief AI Architect at Google, is that Omni ‘combines images, audio, video, and text as input and generates high-quality videos grounded in Gemini’s real-world knowledge.’ Inputs can be mixed in a single prompt.

Edits are made conversationally, with each instruction building on the previous one, so that characters, physics, and scene context persist across turns. Output modalities beyond video, including image and audio generation, are ‘coming in time,’ Kavukcuoglu wrote on the company’s blog.