Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap

Introduction

LTX-2.3 is a video generation model from Lightricks that includes audio support. In A2V (Audio-to-Video) mode, it takes a single image + audio + prompt and generates lip sync, facial expressions, and head/hair motion all at once. Unlike lip-sync-only models like MuseTalk, it can animate an entire scene, which makes it a powerful tool for directing.

The catch: the base checkpoint is 22B parameters / 43 GB, and keeping it resident in bf16 with transformer × 2 stage burns ~86 GiB at idle. On an RTX PRO 6000 Blackwell with 96 GiB, that leaves almost nothing for the TTS / Ditto-TalkingHead / Qwen3-TTS-vLLM services running alongside it.

After testing quantization approaches, I got LTX-2's native fp8_cast to compress peak VRAM from 40 GiB → 24 GiB (A2V cold-start, 768×512 / 97f). Meanwhile, optimum-quanto int8/fp8 has a compatibility issue with the LTX-2 transformer and simply doesn't work. This post documents the debugging and the decisions made along the way.

Environment

Introduction

Environment

Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap

Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap

Other newsrooms on this story

Related reading

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start…

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they…

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

Tether AI open-sources TurboQuant, reducing LLM KV cache memory use by 5x

Fine-tune FLUX.2 [klein] with a LoRA under 60 minutes

Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline

Other newsrooms on this story

Related reading

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start…

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they…

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

Tether AI open-sources TurboQuant, reducing LLM KV cache memory use by 5x

Fine-tune FLUX.2 [klein] with a LoRA under 60 minutes

Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline