When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.
Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: LTX-2 official repo and bitsandbytes 0.49.1.
What I Was Trying to Do
A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses A2VidPipelineTwoStage:
prompt + audio_path + image













