A month ago I started building a RAG system for financial document Q&A. First test: 2 out of 20 questions correct. Last test: 57% accuracy on 100 queries, validated against human labels.
This post is about which improvements actually worked, which didn't, and the one finding that surprised me most.
The setup
The system answers questions about SEC filings (10-K, 10-Q, earnings reports) from 84 public companies, evaluated against FinanceBench by Patronus AI. 150 expert-annotated Q&A pairs with ground truth answers.
Final stack: GPT-4o for generation, text-embedding-3-small for embeddings, Qdrant for vector storage (hybrid dense + BM25), LangGraph for orchestration (CRAG pipeline with document grading), BAAI/bge-reranker-base for reranking, and contextual retrieval with metadata prefixes on every chunk.






