Structural retrieval shows promise over basic RAG for agent failure prediction on trajectory snapshots

A structural retrieval engine with KNN voting shows a promising signal over basic RAG: AUC = 0.705 averaged over snapshots from step 60 onward on held-out trajectories, while basic RAG plateaus around 0.60 AUC at any cosine threshold from 9% to 100% coverage.

Results are preliminary - single-repo and the distinct-PR condition (covered in benchmark design) limit the eval set to 220 trajectories, so 95% bootstrap confidence intervals overlap and the gap should be confirmed on larger datasets - though finding a sufficient public dataset is itself an open problem.

As the number of agents and trajectory databases grow, a new retrieval need emerges: finding trajectories that are structurally similar, rather than semantically similar by text content. The structure of a trajectory - the sequence of decisions and outcomes, independent of semantics - carries signals useful for monitoring, filtering for experience / reflection pipelines, evaluation, and other layers of agent infrastructure.

Our experiment shows that basic RAG loses this structural signal because it is designed for documents, not action sequences. We propose an approach where a trajectory is represented as a sequence of typed elements, and have implemented it in an open-source Python package, episodiq, described at the end of this article.

Structural retrieval shows promise over basic RAG for agent failure prediction on trajectory snapshots

Other newsrooms on this story

Structural retrieval shows promise over basic RAG for agent failure prediction on trajectory snapshots

Other newsrooms on this story

Related reading

Hybrid Retrieval + RRF: How I Got 100% Retrieval Precision in a Production RAG…

Your RAG Eval Isn't Flaky. Your Retrieval Is Non-Deterministic.

Beyond Retrieval | Pinecone

Your RAG System Is Lying To You About That Table

Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both

Context Graphs vs. Vector Search: When RAG Falls Short

Related reading

Hybrid Retrieval + RRF: How I Got 100% Retrieval Precision in a Production RAG…

Your RAG Eval Isn't Flaky. Your Retrieval Is Non-Deterministic.

Beyond Retrieval | Pinecone

Your RAG System Is Lying To You About That Table

Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both

Context Graphs vs. Vector Search: When RAG Falls Short