Spaces:

Lightricks
/

ltx-2

Running on Zero

App Files Files Community

ltx-2 / packages /ltx-trainer /docs /training-modes.md

linoy

inital commit

ebfc6b3 13 days ago

preview code

raw

history blame contribute delete

8.82 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Training Modes Guide

The trainer supports several training modes, each suited for different use cases and requirements.

🎯 Standard LoRA Training (Video-Only)

Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while keeping the base model frozen. This approach:

Requires significantly less memory and compute than full fine-tuning
Produces small, portable weight files (typically a few hundred MB)
Is ideal for learning specific styles, effects, or concepts
Can be easily combined with other LoRAs during inference

Configure standard LoRA training with:

model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1
  with_audio: false  # Video-only training

🔊 Audio-Video LoRA Training

LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output:

Synchronized audio-video generation - Audio matches the visual content
Same efficient LoRA approach - Just enable audio training
Requires audio latents - Dataset must include preprocessed audio

Configure audio-video training with:

model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1
  with_audio: true  # Enable audio training
  audio_latents_dir: "audio_latents"  # Directory containing audio latents

Example configuration file:

📄 Audio-Video LoRA Training

Dataset structure for audio-video training:

preprocessed_data_root/
├── latents/           # Video latents
├── conditions/        # Text embeddings
└── audio_latents/     # Audio latents (required when with_audio: true)

When training audio-video LoRAs, ensure your target_modules configuration captures video, audio, and cross-modal attention branches. Use patterns like "to_k" instead of "attn1.to_k" to match:

Video modules: attn1.to_k, attn2.to_k

Audio modules: audio_attn1.to_k, audio_attn2.to_k

Cross-modal modules: audio_to_video_attn.to_k, video_to_audio_attn.to_k

The cross-modal attention modules (audio_to_video_attn and video_to_audio_attn) enable bidirectional information flow between audio and video, which is critical for synchronized audiovisual generation. See Understanding Target Modules for detailed guidance.

You can generate audio during validation even if you're not training the audio branch. Set validation.generate_audio: true independently of training_strategy.with_audio.

🔥 Full Model Fine-tuning

Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but requiring substantial computational resources and larger training datasets:

Offers the highest potential quality and capability improvements
Requires multiple GPUs and distributed training techniques (e.g., FSDP)
Produces large checkpoint files (several GB)
Best for major model adaptations or when LoRA limitations are reached

Configure full fine-tuning with:

model:
  training_mode: "full"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1

Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8× H100 80GB) and distributed training with FSDP. See Training Guide for multi-GPU setup instructions.

🔄 In-Context LoRA (IC-LoRA) Training

IC-LoRA is a specialized training mode for video-to-video transformations. Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos. IC-LoRA enables a wide range of advanced video-to-video applications, such as:

Control adapters (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a target video
Video deblurring: Transform blurry input videos into sharp, high-quality outputs
Style transfer: Apply the style of a reference video to a target video sequence
Colorization: Convert grayscale reference videos into colorized outputs
Restoration and enhancement: Denoise, upscale, or restore old or degraded videos

By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning.

IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning:

Reference videos provide clean, unnoised conditioning input showing the "before" state
Target videos are noised during training and represent the desired "after" state
The model learns transformations from reference videos to target videos
Loss is applied only to the target portion, not the reference
Training and inference time increase significantly due to the doubled sequence length

To enable IC-LoRA training, configure your YAML file with:

model:
  training_mode: "lora"  # Required: IC-LoRA uses LoRA mode

training_strategy:
  name: "video_to_video"
  first_frame_conditioning_p: 0.1
  reference_latents_dir: "reference_latents"  # Directory for reference video latents

Example configuration file:

📄 IC-LoRA Training - Video-to-video transformation training

Dataset Requirements for IC-LoRA

Your dataset must contain paired videos where each target video has a corresponding reference video
Reference and target videos must have identical resolution and length
Both reference and target videos should be preprocessed together using the same resolution buckets

Dataset structure for IC-LoRA training:

preprocessed_data_root/
├── latents/            # Target video latents (what the model learns to generate)
├── conditions/         # Text embeddings for each video
└── reference_latents/  # Reference video latents (conditioning input)

Generating Reference Videos

We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset. The script takes a JSON file as input (e.g., output of caption_videos.py) and updates it with the generated reference video paths.

uv run python scripts/compute_reference.py scenes_output_dir/ \
    --output scenes_output_dir/dataset.json

To compute a different condition (depth maps, pose skeletons, etc.), modify the compute_reference() function in the script.

Configuration Requirements for IC-LoRA

You must provide reference_videos in your validation configuration when using IC-LoRA training
The number of reference videos must match the number of validation prompts

Example validation configuration for IC-LoRA:

validation:
  prompts:
    - "First prompt describing the desired output"
    - "Second prompt describing the desired output"
  reference_videos:
    - "/path/to/reference1.mp4"
    - "/path/to/reference2.mp4"
  include_reference_in_output: true  # Show reference side-by-side with output

📊 Training Mode Comparison

Aspect	LoRA	Audio-Video LoRA	Full Fine-tuning	IC-LoRA
Memory Usage	Low	Low-Medium	High	Medium
Training Speed	Fast	Fast	Slow	Medium
Output Size	100MB-few GB (depends on rank)	100MB-few GB (depends on rank)	Tens of GB	100MB-few GB (depends on rank)
Flexibility	Medium	Medium	High	Specialized
Audio Support	Optional	Yes	Optional	No
Reference Videos	No	No	No	Yes (required)

🎬 Using Trained Models for Inference

After training, use the ltx-pipelines package for production inference with your trained LoRAs:

Training Mode	Recommended Pipeline
LoRA / Audio-Video LoRA	`TI2VidOneStagePipeline` or `TI2VidTwoStagesPipeline`
IC-LoRA	`ICLoraPipeline`

All pipelines support loading custom LoRAs via the loras parameter. See the ltx-pipelines package documentation for detailed usage instructions.

🚀 Next Steps

Once you've chosen your training mode:

Set up your dataset using Dataset Preparation
Configure your training parameters in Configuration Reference
Start training with the Training Guide