When AWS announces a new generation of OpenSearch Serverless aimed at agentic AI, the technical signal that matters is not in the press release — it's in the design implications most architects will discover too late: cold starts that destroy latency SLOs, OCU costs that explode with batch embedding workloads, and the illusion that 'serverless' eliminates the need for partition modeling and concurrency control. I have 16 years building financial systems on AWS infrastructure and I know what happens when a promising architectural pattern meets the reality of a regulated environment. This article tears down the agentic RAG pattern from the ground up: the problem it solves, its internal anatomy, the numbers that matter, and — most importantly — when you should not use it.

The Real Problem: Why Classical RAG Breaks in Agentic Workflows

Classical RAG is a two-phase pattern: you retrieve k relevant documents via vector search and inject them into an LLM's context for generation. It works well for static Q&A over a stable knowledge base. The problem surfaces when you add agency — that is, when the LLM iteratively decides which tools to call, which queries to reformulate, and how to compose the final answer from multiple heterogeneous sources.