A production AI assistant is not "an LLM with a prompt". It is a system that accepts intent, keeps state, decides when to retrieve or act, and exposes enough runtime detail to debug failures.

That systems-level view is what the AI Systems cluster explores when assistants move beyond a single model invocation.

OpenAI describes agents as applications that plan, call tools, collaborate, and keep enough state for multi-step work, while Anthropic frames the same problem as a managed harness that can run files, commands, web access, and code securely.

The cleanest architecture splits responsibilities into five layers: LLM, Memory, Tooling, Routing, and Observability. That split matches the capabilities exposed by major provider APIs, by MCP, by self-hosted runtimes such as vLLM and llama.cpp, and by real assistant systems such as OpenClaw and Hermes.

Memory should be treated as more than "longer context". Retrieval systems turn external knowledge into explicit non-parametric memory — the same design space covered in depth by Retrieval-Augmented Generation (RAG) — and both Anthropic's context guidance and the "Lost in the Middle" paper warn that merely cramming more tokens into context does not guarantee reliable recall.