Self-evolving retrieval lifts benchmark scores 25%

Agents that adapt their retrieval configurations while running deliver roughly a quarter more performance on established benchmarks — EvolveMem reports a 25.7 % relative lift over the strongest static baseline [1]. The result overturns the long‑standing assumption that retrieval stacks should be frozen after deployment; instead, the system treats the whole memory‑access pipeline as a mutable policy that can be improved on the fly. This shift opens a new design space where an LLM‑driven “diagnosis” module rewrites its own search strategy as new queries arrive.

Before this work, LLM agents relied on a fixed retrieval infrastructure: scoring functions, fusion heuristics, and answer‑generation policies were hand‑tuned once and left unchanged for the life of the service. Researchers routinely built separate pipelines for data ingestion and for query execution, assuming that any performance gains had to come from larger models or richer corpora rather than from the retrieval logic itself. That static mindset limited the ability of agents to learn from their own failures in the field.

EvolveMem’s closed‑loop process turns that limitation into an advantage, reaching a 25.7 % relative improvement on LoCoMo and a 78.0 % relative gain over a minimal baseline [1]. Each evolution round consumes per‑question failure logs, lets the diagnosis LLM pinpoint root causes, and then proposes concrete configuration tweaks; the meta‑analyzer applies the changes, evaluates the impact, and repeats until convergence. The same system also pushes an 18.9 % lift on the text‑only MemBench benchmark, demonstrating improvement even without bespoke engineering for that benchmark.

Self-evolving retrieval lifts benchmark scores 25%

Self-evolving retrieval lifts benchmark scores 25%

Other newsrooms on this story

Related reading

I build a retrieval-first agent memory DB. Two papers just said retrieval is…

ReasoningBank: Enabling agents to learn from experience

Memory Engineering Is a Promotion Pipeline, Not a Pile of Notes

Memory beats full context on LongMemEval — and the wins we don't get

RAG and Long Context Aren't Enough for Agent Memory. δ-mem Is a Third Option

Why agents need memory that improves itself

Other newsrooms on this story

Related reading

I build a retrieval-first agent memory DB. Two papers just said retrieval is…

ReasoningBank: Enabling agents to learn from experience

Memory Engineering Is a Promotion Pipeline, Not a Pile of Notes

Memory beats full context on LongMemEval — and the wins we don't get

RAG and Long Context Aren't Enough for Agent Memory. δ-mem Is a Third Option

Why agents need memory that improves itself