Not every request needs a frontier model, and a surprising share of them can run for nothing at all. The problem is that "free" usually sounds like "worse," so teams pay for every request just to be safe. Routing is what removes that trade-off.
There are actually two separate pools of zero-cost inference, and they behave very differently. It's worth knowing both before you decide what to send where.
Pool one: local models
Run a model on your own hardware and the marginal cost of a request is zero. Manifest connects local servers the same way it connects anything else: Ollama, LM Studio, and llama.cpp, plus any other OpenAI-compatible server you point it at.
Three things make local special. It's free, in the sense that you pay for electricity, not per token. It's private, because the prompt never leaves your machine. And it has no rate limits, because you aren't sharing a quota with strangers. The catch is just as simple: you need the hardware, and a small local model is not Opus. Which is exactly why you don't send it the hard work.







