I spent a weekend wiring Google's Gemini and Veo APIs into a single app just to feel where the edges of multimodal AI actually are. It turned into a small studio I now use daily, and along the way I learned more about these models from plumbing them than from any paper. Here's the honest technical debrief.

Three pipelines, three completely different problems

I wanted one prompt box that could do video, image editing, and document Q&A. Naively I assumed they'd share most of the stack. They don't.

1. Image-to-video: the enemy is time, not pixels

Generating one good frame is solved. Video is about temporal coherence — frame 13 must agree with frame 12 or you get flicker and identity drift. Modern video models treat the clip as one object in space and time (latent diffusion over a width x height x time volume, with spatiotemporal attention) rather than 120 independent images. Conditioning on a reference image as the first frame is what makes image-to-video feel controlled: you've handed the model a strong anchor and asked it to extrapolate motion, not invent a world.