Shipping a Local LLM API with FastAPI and Ollama

Phase 2 of the de-swarm project — how I turned a 3B text-to-SQL model into a production API for...

mercoledì 24 giugno 2026 New tab

2,375 words~11 min read

Phase 2 of the de-swarm project — how I turned a 3B text-to-SQL model into a production API for $0.

The setup

Three weeks ago, I distilled a 120B+ text-to-SQL pipeline into a 3B QLoRA fine-tune of Qwen2.5-Coder-3B-Instruct. The model hit 90% in-domain accuracy and 55.5% on Spider, ran on a laptop CPU via Ollama, and cost $0 to train and $0 to inference. I wrote about it in Phase 1.

But "I have a model that runs on my laptop" is a different category of deliverable than "I have an API anyone can call." The first is a research artifact. The second is a product.

Phase 2 was about crossing that gap. This post is the story.

Other newsrooms on this story

· 1 sources

Full timeline →

huggingface.co·Jun 25, 2026 · 19 h fa
Run a vLLM Server on HF Jobs in One Command

Shipping a Local LLM API with FastAPI and Ollama

Other newsrooms on this story

Shipping a Local LLM API with FastAPI and Ollama

Other newsrooms on this story

Related reading

I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

Your First LLM API on Kubernetes: From Model to Curl Request

How to Build a High-Performance RAG Pipeline with Ollama, Python and TypeScript

I Built a DeepSeek API Service with FastAPI: Here's the Data

Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers

I built a Zero Trust AI Architecture for Logistics (FastAPI + React). Roast my…

Related reading

I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

Your First LLM API on Kubernetes: From Model to Curl Request

How to Build a High-Performance RAG Pipeline with Ollama, Python and TypeScript

I Built a DeepSeek API Service with FastAPI: Here's the Data

Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers

I built a Zero Trust AI Architecture for Logistics (FastAPI + React). Roast my…