I run a one-person AI shop. For 2asy.ai's filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Per-pipeline, not per-company.
The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for.
Quick results.
Local Gemma 4 26B on llama.cpp (RTX 4090 + W6800): live serving fine. Batch lane blocked. vLLM has no 4-bit MoE path I need, container wants CUDA 12.9, host driver is 12.8. GGML_CUDA_DISABLE_GRAPHS=1 keeps llama.cpp alive when graph optimizer segfaults.
OpenRouter: no real batch. Live pricing. At concurrency 32, latency 2 to 17 seconds, 121s timeouts, 429s.






