Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

How to go from 86 GiB idle VRAM (instant OOM) to 0 GiB idle / 40 GiB peak by using a cold-start design for LTX-2.3 on one RTX Pro 6000 Blackwell.

venerdì 22 maggio 2026 New tab

1,207 words~5 min read

When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.

Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: LTX-2 official repo and bitsandbytes 0.49.1.

What I Was Trying to Do

A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses A2VidPipelineTwoStage:

prompt + audio_path + image

Other newsrooms on this story

· 1 sources

Full timeline →

tomshardware.com·May 23, 2026 · 1 mesi fa
768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

Other newsrooms on this story

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

Other newsrooms on this story

Related reading

Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a…

Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent…

RTX 5060 for Local AI in 2026: When 448 GB/s Hits an 8GB Wall

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

Porting a 1,200-line persistent CUDA megakernel to Qwen3-TTS: ~25 ms to first…

Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU

Related reading

Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a…

Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent…

RTX 5060 for Local AI in 2026: When 448 GB/s Hits an 8GB Wall

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

Porting a 1,200-line persistent CUDA megakernel to Qwen3-TTS: ~25 ms to first…

Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU