TL;DRAI

Local Gemma 4 26B (RTX 4090) can't compete on batch: OpenAI Batch (gpt-4-mini) 50% off, line-isolated JSONL, ~1¢/doc, zero errors on high-volume filing extractions. Batch processing without cross-doc context: managed API beats self-hosted on reliability, latency, and hardware TCO.

I run a one-person AI shop. For 2asy.ai's filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Per-pipeline, not per-company.

The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for.

Quick results.

Local Gemma 4 26B on llama.cpp (RTX 4090 + W6800): live serving fine. Batch lane blocked. vLLM has no 4-bit MoE path I need, container wants CUDA 12.9, host driver is 12.8. GGML_CUDA_DISABLE_GRAPHS=1 keeps llama.cpp alive when graph optimizer segfaults.

OpenRouter: no real batch. Live pricing. At concurrency 32, latency 2 to 17 seconds, 121s timeouts, 429s.

dev.to

I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

A solo AI operator tested llama.cpp, OpenRouter, Gemini batch, and OpenAI Batch under one strict rule for the 2asy.ai extraction pipeline. What broke and what fit.

sabato 13 giugno 2026 New tab

TL;DRAI

187 words~1 min read

The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for.

Quick results.

OpenRouter: no real batch. Live pricing. At concurrency 32, latency 2 to 17 seconds, 121s timeouts, 429s.

I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

Related reading

How I Cut My LLM Costs by 90% Without Changing My App Logic

How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically

Turn ~800M Free AI Tokens Into a Single OpenAI API with FreeLLMAPI

How I Slashed My AI API Bill by 95% — A Practical Guide for 2026

How I Cut My AI Bill by Caching LLM Responses in Node.js

Run Your Own AI Server for $0/month with Ollama

Related reading

How I Cut My LLM Costs by 90% Without Changing My App Logic

How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically

Turn ~800M Free AI Tokens Into a Single OpenAI API with FreeLLMAPI

How I Slashed My AI API Bill by 95% — A Practical Guide for 2026

How I Cut My AI Bill by Caching LLM Responses in Node.js

Run Your Own AI Server for $0/month with Ollama