From 10% to 57% Accuracy on FinanceBench: What Actually Moved the Needle

A month ago I started building a RAG system for financial document Q&A. First test: 2 out of 20 questions correct. Last test: 57% accuracy on 100 queries, validated against human labels.

This post is about which improvements actually worked, which didn't, and the one finding that surprised me most.

The setup

The system answers questions about SEC filings (10-K, 10-Q, earnings reports) from 84 public companies, evaluated against FinanceBench by Patronus AI. 150 expert-annotated Q&A pairs with ground truth answers.

Final stack: GPT-4o for generation, text-embedding-3-small for embeddings, Qdrant for vector storage (hybrid dense + BM25), LangGraph for orchestration (CRAG pipeline with document grading), BAAI/bge-reranker-base for reranking, and contextual retrieval with metadata prefixes on every chunk.

A month ago I started building a RAG system for financial document Q&A. First test: 2 out of 20 questions correct. Last test: 57% accuracy on 100 queries, validated against human labels.

This post is about which improvements actually worked, which didn't, and the one finding that surprised me most.

The setup

From 10% to 57% Accuracy on FinanceBench: What Actually Moved the Needle

From 10% to 57% Accuracy on FinanceBench: What Actually Moved the Needle

Related reading

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

I rebuilt my Financial Mentor retrieval from scratch. Here's everything the RAG…

Why RAG gives wrong answers (and how to fix retrieval failures)

RAG Evaluation with RAGAs: Faithfulness, Context Recall, and Answer Relevance

RAG Architecture Deep Dive

Most RAG Problems Are Retrieval Problems. Here Are 8 Fixes That Worked for Me

Related reading

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

I rebuilt my Financial Mentor retrieval from scratch. Here's everything the RAG…

Why RAG gives wrong answers (and how to fix retrieval failures)

RAG Evaluation with RAGAs: Faithfulness, Context Recall, and Answer Relevance

RAG Architecture Deep Dive

Most RAG Problems Are Retrieval Problems. Here Are 8 Fixes That Worked for Me