Cutting agent latency from 30s to 8s without model swap

A founder pinged us with a UX problem disguised as an engineering question. His team had launched an AI chat product. Users were abandoning the conversation before the agent finished responding. The team had measured p95 response latency at 31 seconds. Their assumption was that they needed to switch to a faster model.

The actual model was responsible for about 35% of the total latency. The other 65% was sequential tool calls, unnecessary intermediate LLM steps, and a missing streaming layer. Switching to a smaller model would have cut maybe 5 seconds off the worst case while degrading response quality.

We made four changes that did not touch the model. P95 latency dropped from 31 seconds to 8 seconds. User abandonment rate dropped 70%. The model stayed the same.

This is the latency stack that most teams either do not measure or do not know how to optimize. Here is the pattern.

Where the time actually goes

We made four changes that did not touch the model. P95 latency dropped from 31 seconds to 8 seconds. User abandonment rate dropped 70%. The model stayed the same.

This is the latency stack that most teams either do not measure or do not know how to optimize. Here is the pattern.

Where the time actually goes

Cutting agent latency from 30s to 8s without model swap

Cutting agent latency from 30s to 8s without model swap

Related reading

9 Practical Ways Senior ML Engineers Reduce Inference Latency

Feedback Latency Is the Agent's IQ

I Benchmarked 5 Voice AI Stacks. Only 2 Stayed Under 300ms.

When not to build an AI agent (and what to ship instead)

How I Built AegisDesk: A Zero-Token Semantic IT Agent with <5ms Latency

How Fast Should Your AI Voice Agent Respond?

Related reading

9 Practical Ways Senior ML Engineers Reduce Inference Latency

Feedback Latency Is the Agent's IQ

I Benchmarked 5 Voice AI Stacks. Only 2 Stayed Under 300ms.

When not to build an AI agent (and what to ship instead)

How I Built AegisDesk: A Zero-Token Semantic IT Agent with <5ms Latency

How Fast Should Your AI Voice Agent Respond?