How to prompt Gemini 3.1's new text to speech model

Gemini 3.1 Flash text to speech (TTS) is a new model that you can direct to get the precise audio performance you want. In this blog post I'll share some tips on how to guide the model with prompts, and share some examples of its strengths.

Out of the box gemini-3.1-flash-tts-preview will natively interpret a transcript and determine how your words should be delivered. Simple transcripts without any additional prompting sound natural. But 3.1 Flash TTS also comes with tools you can use to steer it.

You can give the model plenty of context, such as an audio profile – who is speaking, how they are speaking, what their voice sounds like, and so on. You can also describe the scene, where they are, what they are doing, the environment, and provide any extra "director's notes" to guide the performance. The model will use that information to generate speech that sounds right for that context.

You can now also use tags to control the delivery of specific parts of the transcript. Tags are inline modifiers like [whispers] or [laughs] that give you granular control over the delivery. You can use them to change the tone, pace, and emotional vibe of a line or section of the transcript. You can also use them to add interjections and a few other non-verbal sounds to the performance, like [cough], [sighs] or [gasp].

How to prompt Gemini 3.1's new text to speech model

Other newsrooms on this story

Related reading

Gemini 3.1 Flash Live: Making audio AI more natural and reliable

The debut of Gemini 3.1 Flash Live could make it harder to know if you're…

Google releases Gemini 3 Flash, promising improved intelligence and efficiency

Google announces Gemini 3 as battle with OpenAI intensifies

Google’s Gemini AI family updated with stable 2.5 Pro, super-efficient 2.5…

Con Google Gemini si può fare anche musica: come funziona il nuovo modello…