Inference Speed or Throughput? With RDUs, You Don't Have to Choose - SambaNova

Balancing speed (as measured in tokens/second/user) and throughput (total tokens/second of an AI server) is one of the many challenges enterprises face in deploying AI agents in production in a cost-efficient, scalable manner.

While GPUs have enabled the first wave of AI, they end up hitting the "Agentic Wall" — where GPUs cannot sustain the token speeds per request required for complex reasoning loops to support near real-time agentic use cases, especially on larger models like DeepSeek.

While general chat might feel "fast" at 20 t/sv— just above human reading speed — AI agents require much higher velocities. This is because agents often operate in "test-time compute" paradigms, involving reasoning chains, tool-use loops, and multi-step reflection before generating an answer.

To meet enterprise usability requirements, infrastructure must at least deliver sustained speeds of 200+ tokens per second (t/s) per request with larger models like DeepSeek for agentic planning. For AI deployments, a key infrastructure decision needs to be made to balance the number of requests served from AI hardware versus the per request speed required for agents.

Architecture Designed for Agentic AI

Inference Speed or Throughput? With RDUs, You Don't Have to Choose - SambaNova

Inference Speed or Throughput? With RDUs, You Don't Have to Choose - SambaNova

Related reading

The First Disaggregated Inference Demo for AI Agents Is Live

Introducing the SN50 RDU: Purpose-Built for Agentic Inference

Solving the Infrastructure Crisis for AI Inference with Dataflow

Tensordyne claims 13x throughput improvement over Nvidia's NVL72 GB300 with new…

Solving the Decode Bottleneck: Why Agentic Inference Needs Hybrid Hardware

Benchmarking inference at scale: coding agents

Related reading

The First Disaggregated Inference Demo for AI Agents Is Live

Introducing the SN50 RDU: Purpose-Built for Agentic Inference

Solving the Infrastructure Crisis for AI Inference with Dataflow

Tensordyne claims 13x throughput improvement over Nvidia's NVL72 GB300 with new…

Solving the Decode Bottleneck: Why Agentic Inference Needs Hybrid Hardware

Benchmarking inference at scale: coding agents