Generative AI’s explosive first chapter was defined by humans sending requests and models responding. The agentic chapter is different.

Agents don’t follow a pre-determined sequence of actions. They call tools, spawn sub-agents with different tasks and models, retain information in memory, manage their own context window, and decide for themselves when they’re finished. In doing so, these systems push token consumption, context length, and latency requirements into extremely demanding regions — exactly the pressures now shaping the NVIDIA extreme co-design stack and the NVIDIA Vera Rubin platform.

This post analyzes that evolution across three parts:

How agents consume tokens

Why their economics break under conventional serving