xAI's Grok Imagine 1.5 lands as an image-to-video model with a deliberate trade-off: tighter fidelity to your source frame, new audio prompting, and a hard requirement that you bring a still image to start. Here's what actually changed from the original, and the one constraint that reshapes how you build with it.
1.5 vs the original: fidelity, audio, and the image-only constraint
Grok Imagine 1.5 is xAI's image-to-video model, announced on June 3, 2026 and exposed through the API as grok-imagine-video-1.5-preview (alias grok-imagine-video-1.5-2026-05-30) . It takes a still starting frame plus a motion prompt and preserves the source image's lighting, composition, and subject identity more faithfully than a pure text reinterpretation . Practically, that means your prompt only drives what changes — a camera push-in, drifting embers, a product rotation — not what the subject looks like.
Audio prompting is new. xAI advises describing sound design, ambient room tone, and pacing in the same prompt as camera motion, and the model is benchmarked in audio-enabled tracks .
The key constraint: grok-imagine-video-1.5-preview does not support text-to-video. You must supply or generate a starting image first . For text-to-video, extension, or editing, the standard grok-imagine-video model remains the option . Output tops out at 720p and 15 seconds, with 5–8 seconds noted as the sweet spot for motion stability .










