At a meetup's networking session, someone dropped: "the new speech-to-speech feature in Teams is really cool". Microsoft Teams added the interpreter agent with realtime AI-powered speech-to-speech translation during calls. So the natural question: how complicated is building one with AWS ? And what performance does it deliver ?

Meanwhile, for PyCon IT 2026, with an inclusivity goal, the plan was already to use bilardi/realtime-transcription with a monitor in the room showing the talk transcript. But wouldn't it be handier if each attendee had the translated transcript directly on their own mobile, and maybe the audio in their own language too, naturally without installing anything ?

And so bilardi/realtime-speech-to-speech was born, ready to use, for any conference or meetup. Under the hood there are three AWS services chained together: Transcribe Streaming for Automatic Speech Recognition (ASR) from audio to text, Translate for the translation, Polly bidirectional streaming for Text-to-Speech (TTS) from text to audio. Architecture, costs and usage live in the repo: here, instead, I tell the choices and what went sideways along the way.

A stage PoC for multilingual meetups

The initial alternatives were three, from the simplest to the most complex.