Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU

Serving several chat LLMs from one 16 GB RTX 5080 by reusing an existing router + ~150 lines of shell — with controlled benchmarks: where memory split doesn't matter, where prefix caching doubles throughput, and where 'smart' routing made things slower.

martedì 26 maggio 2026 New tab

Every number below was measured on a single RTX 5080 (16 GB) and is reproducible

from the repo. Each result states the exact config it was measured under; I don't

compare numbers across configs, and I flag anything we did **not* cleanly measure.

TL;DR

You can serve several small chat LLMs from one 16 GB RTX 5080, behind a single

Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU

Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU

Other newsrooms on this story

Related reading

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

How llm-d Prefix-Cache Routing Made Qwen 7B on EKS 2.3x Faster

RTX 5060 for Local AI in 2026: When 448 GB/s Hits an 8GB Wall

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start…

Notes on Serving LLMs with TensorRT-LLM and Triton

Other newsrooms on this story

Related reading

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

How llm-d Prefix-Cache Routing Made Qwen 7B on EKS 2.3x Faster

RTX 5060 for Local AI in 2026: When 448 GB/s Hits an 8GB Wall

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start…

Notes on Serving LLMs with TensorRT-LLM and Triton