It was 1 a.m. when a colleague messaged me: our smart customer service bot had amnesia again — “The user just told us the return address, and in the very next turn the bot asked ‘What would you like to return?’”

I opened the logs and started eyeballing memory variables across dozens of conversation turns. An hour later I finally nailed it: the k parameter in ConversationBufferWindowMemory was set wrong, keeping only the single most recent exchange. At that moment, I thought: do we really have to test LLM memory by chatting line by line, by hand? This can’t go on.

Breaking down the problem

Once you give an LLM-powered application a Memory component, its behavior becomes subtle. Is memory being written at the right time? Is it keeping or forgetting information as expected? Under multi-turn conversations, memory types like summary, buffer, and entity stack on top of each other; a tiny misconfiguration leads to the model completely forgetting what was just said.

Manual validation usually means opening a terminal, entering a few rounds of conversation, and manually inspecting memory.load_memory_variables({}). Sometimes you even have to infer the memory state from the model’s replies. This approach has fatal flaws: