A voice agent can be correct and still feel broken. Customers judge it like a phone call: if it hesitates, sounds synthetic, or mispronounces a key term, trust collapses before they can evaluate reasoning. In production, that experience comes down to a real-time loop: STT (speech-to-text) models transcribe speech, the LLM decides what to say, and TTS (text-to-speech) speaks the response. At scale, teams stitch that loop across multiple vendors, so latency, reliability, observability, and ultimately what the customer hears become difficult to manage end-to-end.Starting today on Together AI, the AI Native Cloud, we're adding Rime Arcana v2 and Mist v2 to the Together Model Library, bringing proprietary TTS models into the same API, authentication, and observability surface you already use for LLM and speech workloads. Arcana v2 delivers expressive, conversational voices trained on real customer service interactions, with 40+ voices across multiple languages and regional dialects for quality-critical scenarios. Mist v2 brings deterministic pronunciation control to high-volume production environments, reaching about 225ms time-to-first-audio on Together AI dedicated endpoints—you define how a term sounds once via API and it renders consistently across all voices, flows, and channels. Both run as dedicated endpoints on a single cloud alongside your LLM and STT workloads, so your end-to-end voice stack operates on one production platform — instead of being split across multiple providers.