Back to Articles

Motivation

NVIDIA Cosmos Predict 2.5 is a large-scale world model capable of generating physically plausible videos conditioned on text, images, or video clips. To adapt it to a specific domain, such as robot manipulation or a particular camera viewpoint, teams still need targeted fine-tuning.

Training robot policies requires demonstration data, but collecting real-robot trajectories is slow and expensive. Generating synthetic trajectories with a fine-tuned video world model offers a scalable alternative. However, full fine-tuning of a 2B-parameter model is expensive and risks catastrophic forgetting of general knowledge. LoRA and DoRA inject small trainable adapter modules into the frozen base model, reducing memory requirements while keeping the adapter files small and portable. This makes it practical to fine-tune on a single GPU and flexibly swap adapters for different domains at inference.

This guide walks through parameter-efficient fine-tuning of Cosmos Predict 2.5 with LoRA and DoRA, using the diffusers and accelerate libraries with support for both single- and multi-GPU training. We then show how to use the fine-tuned model to generate synthetic robot trajectories for downstream robot learning tasks.