New in llama.cpp: Model Management

Back to Articles

llama.cpp server now ships with router mode, which lets you dynamically load, unload, and switch between multiple models without restarting.

Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a popular request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected.

Quick Start

New in llama.cpp: Model Management

Related reading

Llamafile vs vLLM: Two Ways to Serve a Local Model, and When Each Makes Sense

Using OCR models with llama.cpp

Run Llama 2 with an API – Replicate blog

Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher

Fine-Tune Llama 3 706B Model Locally

Week 3 of LLaMA 🦙 – Replicate blog