Build real-time voice applications with Amazon SageMaker AI and vLLM

Build real-time voice applications with Amazon SageMaker AI and vLLM | Amazon Web Services

Voice agents, live captioning, contact center analytics, and accessibility tools all depend on real-time speech-to-text, where your application streams audio in and receives transcription back simultaneously over a single persistent connection. Traditional request-response inference falls short here because transcription cannot begin until the entire audio recording has been received, adding latency that breaks the real-time […]

mercoledì 20 maggio 2026 New tab

Starting November 2025, you can stream data continuously in both directions between your clients and model containers using Amazon SageMaker AI bidirectional streaming for real-time inference. vLLM now lets you transcribe audio in real time through its Realtime API, where you use WebSockets for bidirectional streaming between client and server.

In this post, we bring these two capabilities together. We show how to deploy Voxtral-Mini-4B-Realtime-2602, Mistral AI’s compact real-time speech model, to a SageMaker AI endpoint using a vLLM container with bidirectional streaming. The result is a fully managed, speech-to-text service where audio flows in and transcription flows back in real time. You can follow along with the full example in the GitHub repository.

Build real-time voice applications with Amazon SageMaker AI and vLLM | Amazon Web Services

Build real-time voice applications with Amazon SageMaker AI and vLLM | Amazon Web Services

Other newsrooms on this story

Related reading

Announcing the fastest inference for realtime voice AI agents

The "Zero-Latency" Deep Dive: Architecting Concurrent Voice AI in Python

Introducing Scribe v2 Realtime

Together AI Launches Speech-to-Text: High-Performance Whisper APIs

Real-time voice agents with Stream Vision Agents and Amazon Nova 2 Sonic |…

Build realtime voice agents on AI Gateway

Other newsrooms on this story

Related reading

Announcing the fastest inference for realtime voice AI agents

The "Zero-Latency" Deep Dive: Architecting Concurrent Voice AI in Python

Introducing Scribe v2 Realtime

Together AI Launches Speech-to-Text: High-Performance Whisper APIs

Real-time voice agents with Stream Vision Agents and Amazon Nova 2 Sonic |…

Build realtime voice agents on AI Gateway