AI‑native organizations increasingly face scaling challenges as agentic AI workflows drive context windows to millions of tokens and models scale toward trillions of parameters. These systems rely on agentic long‑term memory for context that persists across turns, tools, and sessions so agents can build on prior reasoning instead of starting from scratch on every request.
As context windows increase, Key-Value (KV) cache capacity requirements grow proportionally, while the compute requirements to recalculate that history grow much faster, making KV cache reuse and efficient storage essential for performance and efficiency.
This increases pressure on existing memory hierarchies, forcing AI providers to choose between scarce GPU high‑bandwidth memory (HBM) and general‑purpose storage tiers optimized for durability, data management, and protection—not for serving ephemeral, AI-native, KV cache—driving up power consumption, inflating cost per token, and leaving expensive GPUs underutilized.
The NVIDIA Vera Rubin platform enables organizations to scale every phase of AI, from pretraining, to post-training and test-time-scaling, to real-time agentic inference. The platform organizes AI infrastructure into compute, networking and storage racks that serve as configurable building blocks for AI factories.







