When boto3 doesn't have it (yet), you write it: a realtime speech-to-speech story in Python

At a meetup's networking session, someone dropped: "the new speech-to-speech feature in Teams is really cool". Microsoft Teams added the interpreter agent with realtime AI-powered speech-to-speech translation during calls. So the natural question: how complicated is building one with AWS ? And what performance does it deliver ?

Meanwhile, for PyCon IT 2026, with an inclusivity goal, the plan was already to use bilardi/realtime-transcription with a monitor in the room showing the talk transcript. But wouldn't it be handier if each attendee had the translated transcript directly on their own mobile, and maybe the audio in their own language too, naturally without installing anything ?

And so bilardi/realtime-speech-to-speech was born, ready to use, for any conference or meetup. Under the hood there are three AWS services chained together: Transcribe Streaming for Automatic Speech Recognition (ASR) from audio to text, Translate for the translation, Polly bidirectional streaming for Text-to-Speech (TTS) from text to audio. Architecture, costs and usage live in the repo: here, instead, I tell the choices and what went sideways along the way.

A stage PoC for multilingual meetups

The initial alternatives were three, from the simplest to the most complex.

When boto3 doesn't have it (yet), you write it: a realtime speech-to-speech story in Python

Other newsrooms on this story

Related reading

MiniMax Speech 2.6 Turbo now available natively on Together AI

The "Zero-Latency" Deep Dive: Architecting Concurrent Voice AI in Python

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

Building Zero-Shared-State Auth Middleware and Real-Time Whisper STT Pipeline…

🎤 Building a Real-Time Voice AI Assistant Using Open Source Tools

Announcing the fastest inference for realtime voice AI agents