Making a fleet of self-hosted LLM agents trustworthy

Running one local LLM node is easy. Running a fleet of them, off-cluster, and trusting it to stay current and stay honest, is the hard part. This is the work that got LLMKube there: declarative, health-gated self-update for off-cluster agents (helm and brew for the edge), liveness and admission validation so dead or malformed nodes cannot lie to the control plane, and a real end-to-end test. Plus the build-in-public part: the bugs our own dogfooding caught that the unit tests could not, including a self-update path that had quietly disabled itself in production.

domenica 14 giugno 2026 New tab

Originally published at llmkube.com/blog/making-self-hosted-llm-agents-trustworthy. Cross-posted here for the dev.to audience.

Running a single local LLM node is a solved problem. You write an InferenceService, the operator schedules it, llama.cpp or MLX serves it, and you get an OpenAI-compatible endpoint. We have been doing that for months.

Running a fleet of them is where it stops being easy. My fleet is heterogeneous on purpose: CUDA pods in the cluster, and Apple Silicon Macs sitting off-cluster on the homelab network, each one running two separate agents (one for inference, one for the agentic coding harness). The day I shipped 0.8.4 to that fleet, I learned exactly how it does not scale.

I updated each Mac by hand. The control plane had no idea what version any agent was running. And the launchd reload I used to restart an agent was a silent no-op on an already-loaded service, so the old binary kept running while I believed I had updated it. I found that out by hand-inspecting a process tree. Three machines made it annoying. Thirty would make it impossible, and the whole pitch for sovereign, on-prem AI is that you run a lot more than three.

So the last stretch of work on LLMKube was not about a faster runtime or a bigger model. It was about making the fleet trustworthy: able to update itself safely, and unable to lie to the control plane about its own state. Here is what that took.

Originally published at llmkube.com/blog/making-self-hosted-llm-agents-trustworthy. Cross-posted here for the dev.to audience.

Making a fleet of self-hosted LLM agents trustworthy

Making a fleet of self-hosted LLM agents trustworthy

Related reading

Trust the harness, not the model: a weekend of local agents building their own…

A 27B model on an AMD mini-PC fixed a bug in our operator. Then it overreached.

Self-Hosting Your First LLM for Enterprise: What Nobody Tells You Before You…

GBase: Building LLM Agents That Actually Learn from Their Mistakes

I built an autonomous SRE that lets an LLM diagnose incidents — but never touch…

Run a vLLM Server on HF Jobs in One Command

Related reading

Trust the harness, not the model: a weekend of local agents building their own…

A 27B model on an AMD mini-PC fixed a bug in our operator. Then it overreached.

Self-Hosting Your First LLM for Enterprise: What Nobody Tells You Before You…

GBase: Building LLM Agents That Actually Learn from Their Mistakes

I built an autonomous SRE that lets an LLM diagnose incidents — but never touch…

Run a vLLM Server on HF Jobs in One Command