Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe

This is the technical, reproducible version of a fix I shipped on my own homelab. If you want the...

mercoledì 3 giugno 2026 New tab

1,090 words~5 min read

This is the technical, reproducible version of a fix I shipped on my own homelab. If you want the narrative version, that's on Medium. This one is the recipe: the measurements, the math, the Modelfile, and the exact prompt I gave Claude Code to generate it. Copy-paste friendly.

Repo for the dashboard used throughout: https://github.com/SikamikanikoBG/homelab-monitor

TL;DR

One 24GB RTX 3090, two GPU services: WhisperX large-v3 (STT, 7.7GB peak) and a Devstral Small 24B email-triage LLM (Q4_K_M, ~18.3GB).

18.3 + 7.7 = 26GB → CUDA OOM whenever they overlapped.

Other newsrooms on this story

· 1 sources

Full timeline →

aws.amazon.com·Jun 1, 2026 · 1 mesi fa
Accelerate LLM model loading and increase context windows with GPUDirect on Amazon FSx for Lustre and TurboQuant | Amazon Web Services

Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe

Other newsrooms on this story

Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe

Other newsrooms on this story

Related reading

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

AI-NT-No-Problem: Cramming a 9950X3D and RTX 5090 Into an SFF Custom Loop

I got tired of guessing which model holds my VRAM, so I built a tiny dashboard

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

How to Run Reliable Local LLM Agents on an RTX 3090: A Benchmark (5 Models,…

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

Related reading

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

AI-NT-No-Problem: Cramming a 9950X3D and RTX 5090 Into an SFF Custom Loop

I got tired of guessing which model holds my VRAM, so I built a tiny dashboard

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

How to Run Reliable Local LLM Agents on an RTX 3090: A Benchmark (5 Models,…

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe