A structural retrieval engine with KNN voting shows a promising signal over basic RAG: AUC = 0.705 averaged over snapshots from step 60 onward on held-out trajectories, while basic RAG plateaus around 0.60 AUC at any cosine threshold from 9% to 100% coverage.

Results are preliminary - single-repo and the distinct-PR condition (covered in benchmark design) limit the eval set to 220 trajectories, so 95% bootstrap confidence intervals overlap and the gap should be confirmed on larger datasets - though finding a sufficient public dataset is itself an open problem.

As the number of agents and trajectory databases grow, a new retrieval need emerges: finding trajectories that are structurally similar, rather than semantically similar by text content. The structure of a trajectory - the sequence of decisions and outcomes, independent of semantics - carries signals useful for monitoring, filtering for experience / reflection pipelines, evaluation, and other layers of agent infrastructure.

Our experiment shows that basic RAG loses this structural signal because it is designed for documents, not action sequences. We propose an approach where a trajectory is represented as a sequence of typed elements, and have implemented it in an open-source Python package, episodiq, described at the end of this article.