Voice interfaces are one of the hallmarks of a truly AI native application. From transcription to speech-to-code to outbound calling to custom podcasts, voice makes applications engaging and productive. But developers often have to piece together a number of specialized voice services to ship a single voice application. This tends to slow development while adding complexity, latency and cost.We're pleased to announce the addition of a greatly expanded set of high performance, low latency voice infrastructure to our cloud. We've worked hard to provide voice services that are frontier quality, developer friendly and very low latency.With these additions, we've expanded our voice offering from transcription to a full set of building blocks that can power some or all of an application's voice pipeline. These services support real-time and batch patterns in developer-friendly serverless and dedicated form factors.Streaming speech-to-text for voice agentsStreaming WhisperTraditional batch transcription waits for complete audio files. Voice agents need to process speech as it arrives, and intelligently detect when users finish speaking.We've built the industry's fastest speech-to-text API by combining optimized model inference with intelligent system design — WebSocket streaming to eliminate connection overhead, carefully tuned voice activity detection (VAD), and purpose-built infrastructure for realtime audio processing. The result: Whisper running in real time with minimal quality degradation, completing transcripts up to 35% faster than alternatives.The key is optimizing for time-to-complete-transcript, not just time-to-first-token. Voice agents need to know precisely when a user stops speaking to begin formulating responses. Our VAD tuning ensures your agent responds at the right moment, not too early (cutting users off) or too late (creating dead air).
Announcing the fastest inference for realtime voice AI agents
Together AI launches the fastest voice AI stack: streaming Whisper STT, serverless open-source TTS (Orpheus & Kokoro), and Voxtral transcription. Sub-second latency for production voice agents.















