Gemini 3.1 Flash text to speech (TTS) is a new model that you can direct to get the precise audio performance you want. In this blog post I'll share some tips on how to guide the model with prompts, and share some examples of its strengths.
Out of the box gemini-3.1-flash-tts-preview will natively interpret a transcript and determine how your words should be delivered. Simple transcripts without any additional prompting sound natural. But 3.1 Flash TTS also comes with tools you can use to steer it.
You can give the model plenty of context, such as an audio profile – who is speaking, how they are speaking, what their voice sounds like, and so on. You can also describe the scene, where they are, what they are doing, the environment, and provide any extra "director's notes" to guide the performance. The model will use that information to generate speech that sounds right for that context.
You can now also use tags to control the delivery of specific parts of the transcript. Tags are inline modifiers like [whispers] or [laughs] that give you granular control over the delivery. You can use them to change the tone, pace, and emotional vibe of a line or section of the transcript. You can also use them to add interjections and a few other non-verbal sounds to the performance, like [cough], [sighs] or [gasp].








