Originally published on prodinit.com

Key Takeaways

Sub-300ms end-to-end latency is the human-conversation threshold for voice AI.

The latency budget breaks into four layers: STT (80–120ms), LLM first-token (150–250ms), TTS first-chunk (60–100ms), and network transport (20–60ms). Missing target in any one layer pushes the total over 500ms.

WebRTC with ICE Trickle is the correct transport for browser and mobile clients. SIP is the right choice for PSTN integration and legacy telephony.