Balancing speed (as measured in tokens/second/user) and throughput (total tokens/second of an AI server) is one of the many challenges enterprises face in deploying AI agents in production in a cost-efficient, scalable manner.

While GPUs have enabled the first wave of AI, they end up hitting the "Agentic Wall" — where GPUs cannot sustain the token speeds per request required for complex reasoning loops to support near real-time agentic use cases, especially on larger models like DeepSeek.

While general chat might feel "fast" at 20 t/sv— just above human reading speed — AI agents require much higher velocities. This is because agents often operate in "test-time compute" paradigms, involving reasoning chains, tool-use loops, and multi-step reflection before generating an answer.

To meet enterprise usability requirements, infrastructure must at least deliver sustained speeds of 200+ tokens per second (t/s) per request with larger models like DeepSeek for agentic planning. For AI deployments, a key infrastructure decision needs to be made to balance the number of requests served from AI hardware versus the per request speed required for agents.

Architecture Designed for Agentic AI