Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU

Repo:github.com/hearth-project/hearth · Apache-2.0 · v0.1.0, alpha.

I've been building Hearth, a Kubernetes operator that serves open-source LLMs (Qwen, DeepSeek, GLM, …) declaratively and scales them to zero when idle. It's at a point where the core works end-to-end on real GPUs, and I'm looking for people to build it with me. The thing I most want you to know up front: you can contribute without owning an accelerator. More on that below.

## The one interesting problem

Self-hosting an LLM on K8s is easy until you notice the GPU is burning money while nobody's using the model. The obvious fix — "scale to zero" — runs straight into a chicken-and-egg problem: a stock HPA can't scale up from zero, because zero replicas means zero metrics, which means it never wakes up.

Hearth puts a small gateway (an OpenAI-compatible reverse proxy) in front of each model. When a request arrives at a scaled-to-zero backend, the gateway accepts it, holds the connection open (SSE keepalive heartbeats so nothing times out), and bumps a pending counter exposed at /hearth/queue. KEDA polls that endpoint, sees pending > 0, and scales the backend 0 → 1. The pod loads weights from a warm cache, becomes Ready, and the gateway forwards the buffered request and streams tokens back. Idle again → KEDA scales it back to 0.

Repo:github.com/hearth-project/hearth · Apache-2.0 · v0.1.0, alpha.

## The one interesting problem

Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU

Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU

Other newsrooms on this story

Related reading

I built a distributed compute grid where your idle laptop runs ML jobs — the…

I built an open-source alternative to Microsoft's KAITO that works on ANY…

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4…

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

HELM: benchmarking large language models on the Together Research Computer

Designing the hf CLI as an agent-optimized way to work with the Hub

Other newsrooms on this story

Related reading

I built a distributed compute grid where your idle laptop runs ML jobs — the…

I built an open-source alternative to Microsoft's KAITO that works on ANY…

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4…

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

HELM: benchmarking large language models on the Together Research Computer

Designing the hf CLI as an agent-optimized way to work with the Hub