ltx-2 / packages /ltx-trainer /docs /training-modes.md
linoy
inital commit
ebfc6b3

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Training Modes Guide

The trainer supports several training modes, each suited for different use cases and requirements.

🎯 Standard LoRA Training (Video-Only)

Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while keeping the base model frozen. This approach:

  • Requires significantly less memory and compute than full fine-tuning
  • Produces small, portable weight files (typically a few hundred MB)
  • Is ideal for learning specific styles, effects, or concepts
  • Can be easily combined with other LoRAs during inference

Configure standard LoRA training with:

model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1
  with_audio: false  # Video-only training

πŸ”Š Audio-Video LoRA Training

LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output:

  • Synchronized audio-video generation - Audio matches the visual content
  • Same efficient LoRA approach - Just enable audio training
  • Requires audio latents - Dataset must include preprocessed audio

Configure audio-video training with:

model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1
  with_audio: true  # Enable audio training
  audio_latents_dir: "audio_latents"  # Directory containing audio latents

Example configuration file:

Dataset structure for audio-video training:

preprocessed_data_root/
β”œβ”€β”€ latents/           # Video latents
β”œβ”€β”€ conditions/        # Text embeddings
└── audio_latents/     # Audio latents (required when with_audio: true)

When training audio-video LoRAs, ensure your target_modules configuration captures video, audio, and cross-modal attention branches. Use patterns like "to_k" instead of "attn1.to_k" to match:

  • Video modules: attn1.to_k, attn2.to_k
  • Audio modules: audio_attn1.to_k, audio_attn2.to_k
  • Cross-modal modules: audio_to_video_attn.to_k, video_to_audio_attn.to_k

The cross-modal attention modules (audio_to_video_attn and video_to_audio_attn) enable bidirectional information flow between audio and video, which is critical for synchronized audiovisual generation. See Understanding Target Modules for detailed guidance.

You can generate audio during validation even if you're not training the audio branch. Set validation.generate_audio: true independently of training_strategy.with_audio.

πŸ”₯ Full Model Fine-tuning

Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but requiring substantial computational resources and larger training datasets:

  • Offers the highest potential quality and capability improvements
  • Requires multiple GPUs and distributed training techniques (e.g., FSDP)
  • Produces large checkpoint files (several GB)
  • Best for major model adaptations or when LoRA limitations are reached

Configure full fine-tuning with:

model:
  training_mode: "full"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1

Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8Γ— H100 80GB) and distributed training with FSDP. See Training Guide for multi-GPU setup instructions.

πŸ”„ In-Context LoRA (IC-LoRA) Training

IC-LoRA is a specialized training mode for video-to-video transformations. Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos. IC-LoRA enables a wide range of advanced video-to-video applications, such as:

  • Control adapters (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a target video
  • Video deblurring: Transform blurry input videos into sharp, high-quality outputs
  • Style transfer: Apply the style of a reference video to a target video sequence
  • Colorization: Convert grayscale reference videos into colorized outputs
  • Restoration and enhancement: Denoise, upscale, or restore old or degraded videos

By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning.

IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning:

  • Reference videos provide clean, unnoised conditioning input showing the "before" state
  • Target videos are noised during training and represent the desired "after" state
  • The model learns transformations from reference videos to target videos
  • Loss is applied only to the target portion, not the reference
  • Training and inference time increase significantly due to the doubled sequence length

To enable IC-LoRA training, configure your YAML file with:

model:
  training_mode: "lora"  # Required: IC-LoRA uses LoRA mode

training_strategy:
  name: "video_to_video"
  first_frame_conditioning_p: 0.1
  reference_latents_dir: "reference_latents"  # Directory for reference video latents

Example configuration file:

Dataset Requirements for IC-LoRA

  • Your dataset must contain paired videos where each target video has a corresponding reference video
  • Reference and target videos must have identical resolution and length
  • Both reference and target videos should be preprocessed together using the same resolution buckets

Dataset structure for IC-LoRA training:

preprocessed_data_root/
β”œβ”€β”€ latents/            # Target video latents (what the model learns to generate)
β”œβ”€β”€ conditions/         # Text embeddings for each video
└── reference_latents/  # Reference video latents (conditioning input)

Generating Reference Videos

We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset. The script takes a JSON file as input (e.g., output of caption_videos.py) and updates it with the generated reference video paths.

uv run python scripts/compute_reference.py scenes_output_dir/ \
    --output scenes_output_dir/dataset.json

To compute a different condition (depth maps, pose skeletons, etc.), modify the compute_reference() function in the script.

Configuration Requirements for IC-LoRA

  • You must provide reference_videos in your validation configuration when using IC-LoRA training
  • The number of reference videos must match the number of validation prompts

Example validation configuration for IC-LoRA:

validation:
  prompts:
    - "First prompt describing the desired output"
    - "Second prompt describing the desired output"
  reference_videos:
    - "/path/to/reference1.mp4"
    - "/path/to/reference2.mp4"
  include_reference_in_output: true  # Show reference side-by-side with output

πŸ“Š Training Mode Comparison

Aspect LoRA Audio-Video LoRA Full Fine-tuning IC-LoRA
Memory Usage Low Low-Medium High Medium
Training Speed Fast Fast Slow Medium
Output Size 100MB-few GB (depends on rank) 100MB-few GB (depends on rank) Tens of GB 100MB-few GB (depends on rank)
Flexibility Medium Medium High Specialized
Audio Support Optional Yes Optional No
Reference Videos No No No Yes (required)

🎬 Using Trained Models for Inference

After training, use the ltx-pipelines package for production inference with your trained LoRAs:

Training Mode Recommended Pipeline
LoRA / Audio-Video LoRA TI2VidOneStagePipeline or TI2VidTwoStagesPipeline
IC-LoRA ICLoraPipeline

All pipelines support loading custom LoRAs via the loras parameter. See the ltx-pipelines package documentation for detailed usage instructions.

πŸš€ Next Steps

Once you've chosen your training mode: