Introduction

LTX-2.3 is a video generation model from Lightricks that includes audio support. In A2V (Audio-to-Video) mode, it takes a single image + audio + prompt and generates lip sync, facial expressions, and head/hair motion all at once. Unlike lip-sync-only models like MuseTalk, it can animate an entire scene, which makes it a powerful tool for directing.

The catch: the base checkpoint is 22B parameters / 43 GB, and keeping it resident in bf16 with transformer × 2 stage burns ~86 GiB at idle. On an RTX PRO 6000 Blackwell with 96 GiB, that leaves almost nothing for the TTS / Ditto-TalkingHead / Qwen3-TTS-vLLM services running alongside it.

After testing quantization approaches, I got LTX-2's native fp8_cast to compress peak VRAM from 40 GiB → 24 GiB (A2V cold-start, 768×512 / 97f). Meanwhile, optimum-quanto int8/fp8 has a compatibility issue with the LTX-2 transformer and simply doesn't work. This post documents the debugging and the decisions made along the way.

Environment