Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

I wanted to ask questions about my own papers without shipping them to a cloud API. This is the real story of building that — a private, fully-offline RAG with hybrid retrieval and reranking — across a pile of old GPUs and one newer one. Three things each cost me the better part of a day, and none of them were what I expected.

The goal: a private RAG over my own papers

I'm a researcher with a folder of PDFs I can't (and won't) upload to a hosted API. I wanted natural-language, cited answers over that corpus, running entirely on my own hardware. So I built a small tool — paper-rag, about 200 lines of Python — with the whole stack local:

PDFs → chunk → BGE-M3 dense (Ollama) ┐

BM25 sparse (fastembed)┴→ Qdrant (embedded, on disk)

The goal: a private RAG over my own papers

PDFs → chunk → BGE-M3 dense (Ollama) ┐

BM25 sparse (fastembed)┴→ Qdrant (embedded, on disk)

Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

Related reading

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4…

Building a Local-Only RAG System with Ollama and TypeScript

RAG SOTA: I Built SEQUOIA and Tested 7 Pipelines — Full Results

Related reading

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4…

Building a Local-Only RAG System with Ollama and TypeScript

RAG SOTA: I Built SEQUOIA and Tested 7 Pipelines — Full Results