Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
6.3.0
Training Modes Guide
The trainer supports several training modes, each suited for different use cases and requirements.
π― Standard LoRA Training (Video-Only)
Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while keeping the base model frozen. This approach:
- Requires significantly less memory and compute than full fine-tuning
- Produces small, portable weight files (typically a few hundred MB)
- Is ideal for learning specific styles, effects, or concepts
- Can be easily combined with other LoRAs during inference
Configure standard LoRA training with:
model:
training_mode: "lora"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
with_audio: false # Video-only training
π Audio-Video LoRA Training
LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output:
- Synchronized audio-video generation - Audio matches the visual content
- Same efficient LoRA approach - Just enable audio training
- Requires audio latents - Dataset must include preprocessed audio
Configure audio-video training with:
model:
training_mode: "lora"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
with_audio: true # Enable audio training
audio_latents_dir: "audio_latents" # Directory containing audio latents
Example configuration file:
Dataset structure for audio-video training:
preprocessed_data_root/
βββ latents/ # Video latents
βββ conditions/ # Text embeddings
βββ audio_latents/ # Audio latents (required when with_audio: true)
When training audio-video LoRAs, ensure your
target_modulesconfiguration captures video, audio, and cross-modal attention branches. Use patterns like"to_k"instead of"attn1.to_k"to match:
- Video modules:
attn1.to_k,attn2.to_k- Audio modules:
audio_attn1.to_k,audio_attn2.to_k- Cross-modal modules:
audio_to_video_attn.to_k,video_to_audio_attn.to_kThe cross-modal attention modules (
audio_to_video_attnandvideo_to_audio_attn) enable bidirectional information flow between audio and video, which is critical for synchronized audiovisual generation. See Understanding Target Modules for detailed guidance.
You can generate audio during validation even if you're not training the audio branch. Set
validation.generate_audio: trueindependently oftraining_strategy.with_audio.
π₯ Full Model Fine-tuning
Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but requiring substantial computational resources and larger training datasets:
- Offers the highest potential quality and capability improvements
- Requires multiple GPUs and distributed training techniques (e.g., FSDP)
- Produces large checkpoint files (several GB)
- Best for major model adaptations or when LoRA limitations are reached
Configure full fine-tuning with:
model:
training_mode: "full"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8Γ H100 80GB) and distributed training with FSDP. See Training Guide for multi-GPU setup instructions.
π In-Context LoRA (IC-LoRA) Training
IC-LoRA is a specialized training mode for video-to-video transformations. Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos. IC-LoRA enables a wide range of advanced video-to-video applications, such as:
- Control adapters (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a target video
- Video deblurring: Transform blurry input videos into sharp, high-quality outputs
- Style transfer: Apply the style of a reference video to a target video sequence
- Colorization: Convert grayscale reference videos into colorized outputs
- Restoration and enhancement: Denoise, upscale, or restore old or degraded videos
By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning.
IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning:
- Reference videos provide clean, unnoised conditioning input showing the "before" state
- Target videos are noised during training and represent the desired "after" state
- The model learns transformations from reference videos to target videos
- Loss is applied only to the target portion, not the reference
- Training and inference time increase significantly due to the doubled sequence length
To enable IC-LoRA training, configure your YAML file with:
model:
training_mode: "lora" # Required: IC-LoRA uses LoRA mode
training_strategy:
name: "video_to_video"
first_frame_conditioning_p: 0.1
reference_latents_dir: "reference_latents" # Directory for reference video latents
Example configuration file:
- π IC-LoRA Training - Video-to-video transformation training
Dataset Requirements for IC-LoRA
- Your dataset must contain paired videos where each target video has a corresponding reference video
- Reference and target videos must have identical resolution and length
- Both reference and target videos should be preprocessed together using the same resolution buckets
Dataset structure for IC-LoRA training:
preprocessed_data_root/
βββ latents/ # Target video latents (what the model learns to generate)
βββ conditions/ # Text embeddings for each video
βββ reference_latents/ # Reference video latents (conditioning input)
Generating Reference Videos
We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset.
The script takes a JSON file as input (e.g., output of caption_videos.py) and updates it with the generated reference
video paths.
uv run python scripts/compute_reference.py scenes_output_dir/ \
--output scenes_output_dir/dataset.json
To compute a different condition (depth maps, pose skeletons, etc.), modify the compute_reference() function in the
script.
Configuration Requirements for IC-LoRA
- You must provide
reference_videosin your validation configuration when using IC-LoRA training - The number of reference videos must match the number of validation prompts
Example validation configuration for IC-LoRA:
validation:
prompts:
- "First prompt describing the desired output"
- "Second prompt describing the desired output"
reference_videos:
- "/path/to/reference1.mp4"
- "/path/to/reference2.mp4"
include_reference_in_output: true # Show reference side-by-side with output
π Training Mode Comparison
| Aspect | LoRA | Audio-Video LoRA | Full Fine-tuning | IC-LoRA |
|---|---|---|---|---|
| Memory Usage | Low | Low-Medium | High | Medium |
| Training Speed | Fast | Fast | Slow | Medium |
| Output Size | 100MB-few GB (depends on rank) | 100MB-few GB (depends on rank) | Tens of GB | 100MB-few GB (depends on rank) |
| Flexibility | Medium | Medium | High | Specialized |
| Audio Support | Optional | Yes | Optional | No |
| Reference Videos | No | No | No | Yes (required) |
π¬ Using Trained Models for Inference
After training, use the ltx-pipelines package for production inference with your trained LoRAs:
| Training Mode | Recommended Pipeline |
|---|---|
| LoRA / Audio-Video LoRA | TI2VidOneStagePipeline or TI2VidTwoStagesPipeline |
| IC-LoRA | ICLoraPipeline |
All pipelines support loading custom LoRAs via the loras parameter. See the ltx-pipelines package
documentation for detailed usage instructions.
π Next Steps
Once you've chosen your training mode:
- Set up your dataset using Dataset Preparation
- Configure your training parameters in Configuration Reference
- Start training with the Training Guide