Last month, a 340ms spike in our TTS pipeline caused 12% of Loquent callers to talk over the AI mid-response. We didn't catch it for six hours because we were measuring the wrong thing — average latency instead of tail latency at each pipeline stage. That incident is why we built vox-bench, and why we're releasing it today.
Why we needed this
When you're building a voice AI agent that handles thousands of live phone calls per month — dental appointment bookings, patient intake, after-hours triage — latency isn't a nice-to-have metric. It's the difference between a conversation that feels human and one that feels like talking to a broken IVR.
Our Loquent pipeline has five stages: Twilio media stream ingestion, speech-to-text via Deepgram, LLM inference via Anthropic Claude (with OpenAI as fallback), text-to-speech via ElevenLabs, and audio streaming back through Twilio. Each stage adds time. The total round-trip — from the moment a caller stops speaking to the moment they hear the AI respond — needs to stay under 800ms to feel natural. Go above 1.2 seconds and callers start repeating themselves. Go above 1.8 seconds and they hang up.
We know these numbers because we tracked them across 10,000+ calls over six months of running Loquent in production. But for the first four months, we were tracking them wrong.









