Real-time voice agents often fail when speech is treated as transcription rather than conversation. Getting the words right is only part of the challenge: the system also has to detect turn boundaries, handle interruptions and overlap, and respond quickly enough to keep the exchange feeling natural. When teams try to patch those gaps with endpointing logic, routing layers, and extra providers, they often add latency and operational fragility right back into the system. Deepgram’s models are purpose-built for that layer, where transcription, turn-taking, and responsiveness have to work together in real time.Deepgram’s STT and TTS model lineup now runs natively on Together AI, the AI Native Cloud for building real-time voice agents, so teams can pair Deepgram transcription and synthesis with any LLM in the Together catalog and run the full voice pipeline on one production platform. For the broader architecture, see our real-time voice agents announcement. “Voice agents live or die by latency, and every network hop between providers is a place where the experience breaks down. By hosting Deepgram’s STT and TTS natively on Together AI’s infrastructure, we’re giving developers production-grade transcription without the tradeoff. Fast, accurate, and co-located with the rest of the pipeline.”- Abe Pursell, VP of Partnerships, DeepgramFlux: Conversational STT with turn detectionAccurate transcription is only part of the job. A voice agent also has to know when the speaker is actually finished, because if it misreads the turn, it either talks over the caller or waits too long and feels unresponsive.Flux is Deepgram’s conversational STT model for real-time agents, built not just to transcribe speech but to produce turn signals from conversational context rather than silence alone. That matters because many teams still rely on extra endpointing logic to bridge this gap, which adds complexity and makes latency harder to control. Flux simplifies that part of the stack and helps keep turn-taking more predictable in production with 250ms end-of-turn detection.Nova-3: Production transcription for real-world audioProduction audio is messier than benchmark audio. Calls come with background noise, overlapping speakers, accents, telephony compression, and interruptions, and the model still has to return text the rest of the pipeline can trust. Nova-3 is built for those conditions, with support for vocabulary customization so teams can improve recognition of domain-specific terms without retraining.Nova-3 Multilingual extends that approach across multiple languages, which matters in deployments where callers switch languages mid-conversation.Aura-2: Enterprise TTS for production voice agentsAura-2 covers the synthesis side of the pipeline for business environments where clarity and consistency matter. Teams can use Deepgram STT and TTS together while keeping output stable for domain-specific terms and structured entities.That difference shows up in delivery. The voice has to stay clear, direct, and reliable when it reads structured information or specialized language back to the user. A voice that sounds fine in a demo is not enough if it starts to stumble once the interaction becomes operational.