TL;DRAI

CPU-only LLM pipeline (llama.cpp 35B MoE) extracted data from 10,000 papers on-prem at 6 tokens/s per node. Extraction at scale without GPU is feasible via MoE quantization; critical insight for data-sensitive teams: silent data loss (79% collisions) requires input-output reconciliation at every pipeline hop.

A field report from building a CPU-only, distributed LLM pipeline for large-scale scientific literature extraction. No GPUs. A lot of quantization. And four silent data-quality bugs that taught me more than the happy path ever did.

The constraint that started it all

Our team runs an internal research cluster: a couple dozen older x86 servers, plenty of RAM, zero GPUs. The mandate was to extract structured data — effect sizes, the entity each one describes, and the direction of effect — from ~10,000 full-text research papers, so a downstream meta-analysis could pool them.

The obvious 2024-era answer is "send it to a hosted LLM API." That wasn't on the table for data-governance reasons: the corpus had to stay on-prem. So the real question became:

Can you do serious LLM extraction at the 10k-document scale with CPUs only?

dev.to

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

A field report: a CPU-only, GPU-less distributed LLM pipeline (llama.cpp + quantized MoE) mining 10,000 papers — and the 4 silent data-quality bugs that nearly ruined the results.

mercoledì 3 giugno 2026 New tab

TL;DRAI

2,153 words~10 min read

The constraint that started it all

The obvious 2024-era answer is "send it to a hosted LLM API." That wasn't on the table for data-governance reasons: the corpus had to stay on-prem. So the real question became:

Can you do serious LLM extraction at the 10k-document scale with CPUs only?

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

Other newsrooms on this story

Related reading

local-llm: A Field Report on Running SOTA Models on Your Own Hardware

Why Your Local LLM Setup Is Costing More Than You Think — And What Happens When…

Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

Running LLMs Locally in 2026: The Complete Guide to Benefits, Trade-offs, and…

How I Fixed LLM Hallucinations on a 512MB Server with Pure Math

Stop paying for idle GPUs in your CI: batching LLM eval jobs

Other newsrooms on this story

Related reading

local-llm: A Field Report on Running SOTA Models on Your Own Hardware

Why Your Local LLM Setup Is Costing More Than You Think — And What Happens When…

Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

Running LLMs Locally in 2026: The Complete Guide to Benefits, Trade-offs, and…

How I Fixed LLM Hallucinations on a 512MB Server with Pure Math

Stop paying for idle GPUs in your CI: batching LLM eval jobs