At 1 a.m., the user group chat exploded: "I just spent half an hour discussing project details with the AI, and then it asked 'Who are you?'—all my previous words were wasted." Checking the backend, the session data in the Memory Store was glaringly sitting in Redis, but after the 23rd turn, the context log vanished as if cut off. This was already the third "amnesia" incident this month. We had been manually clicking through long conversations, but before each release, we could test at most 10 turns; anything longer and we'd lose patience. That night, I cursed at the screen: "If we don't automate this, it'll drive us insane sooner or later." So we made up our minds: build an end-to-end test suite with Playwright + pytest that simulates 100 turns, and catch memory consistency bugs right in CI.

The Problem: Why Long-Conversation Memory Is a Blind Spot for Manual Testing

One of our AI product's core selling points is "remembering what you've said"—you tell me in turn 3 that your name is "Lao Zhang" and you like iced Americano, and by turn 50 I can still ask "Lao Zhang, iced Americano as usual?" This experience relies on the underlying Memory Storage (usually Redis or a vector DB) writing, reading, and updating the session context at each turn. The problem is that memory loss often occurs in long-distance dependencies: the context window gets full and is truncated, the session TTL expires, serialization/deserialization errors happen, multi-threaded concurrent writes overwrite data, and so on. Manual testing never reaches these scenarios—can a QA spend half an hour manually clicking through 50 turns? Unrealistic. Tests at most cover the first 10 turns, then get marked "passed." The result? As soon as a real user has a longer chat, the bug surfaces, and we end up fixing it in the middle of the night, every time.