How I Finally Tamed Long Document Analysis with LLMs (It Wasn't Simple Chunking)

A few months ago, I was building a system to answer questions from 100-page technical PDFs. You know the drill: a user uploads a manual, asks “What’s the torque spec for bolt A?” and wants an accurate answer fast. What I thought would be a quick LangChain script turned into a weeks-long battle with token limits, hallucination, and cost explosions. Here’s what I learned.

The Problem

I had a client who needed to make sense of legacy engineering documents. Each PDF was dense, with tables, diagrams, and footnotes. My first prototype used GPT-4 with the full text crammed into a prompt. That worked for 10-page docs, but at 100 pages I hit the 128K token cap even before adding the question. And if I squeezed it in, the model would “forget” details from the middle (the lost-in-the-middle effect). Cost per query was laughable — $0.50+ per call.

I needed a solution that could handle arbitrarily long documents, return accurate answers, and not bankrupt us.

What I Tried (and Failed At)

How I Finally Tamed Long Document Analysis with LLMs (It Wasn't Simple Chunking)

Related reading

How I Stopped Fighting Hallucinations in LLM Data Extraction

Why I Stopped Building My Own Document Q&A from Scratch

38/60 Days System Design Questions

I Built a Q&A Bot for My Docs and Almost Gave Up (Here's What Worked)

I Spent a Month Fighting LLMs to Extract Structured Data

The LLM Was the Easy Part: Building a Hybrid RAG API