A founder pinged us with a UX problem disguised as an engineering question. His team had launched an AI chat product. Users were abandoning the conversation before the agent finished responding. The team had measured p95 response latency at 31 seconds. Their assumption was that they needed to switch to a faster model.
The actual model was responsible for about 35% of the total latency. The other 65% was sequential tool calls, unnecessary intermediate LLM steps, and a missing streaming layer. Switching to a smaller model would have cut maybe 5 seconds off the worst case while degrading response quality.
We made four changes that did not touch the model. P95 latency dropped from 31 seconds to 8 seconds. User abandonment rate dropped 70%. The model stayed the same.
This is the latency stack that most teams either do not measure or do not know how to optimize. Here is the pattern.
Where the time actually goes













