ltx-2 / packages /ltx-trainer /docs /training-modes.md
linoy
inital commit
ebfc6b3
# Training Modes Guide
The trainer supports several training modes, each suited for different use cases and requirements.
## 🎯 Standard LoRA Training (Video-Only)
Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while
keeping the base model frozen. This approach:
- **Requires significantly less memory and compute** than full fine-tuning
- **Produces small, portable weight files** (typically a few hundred MB)
- **Is ideal for learning specific styles, effects, or concepts**
- **Can be easily combined with other LoRAs** during inference
Configure standard LoRA training with:
```yaml
model:
training_mode: "lora"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
with_audio: false # Video-only training
```
## πŸ”Š Audio-Video LoRA Training
LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output:
- **Synchronized audio-video generation** - Audio matches the visual content
- **Same efficient LoRA approach** - Just enable audio training
- **Requires audio latents** - Dataset must include preprocessed audio
Configure audio-video training with:
```yaml
model:
training_mode: "lora"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
with_audio: true # Enable audio training
audio_latents_dir: "audio_latents" # Directory containing audio latents
```
**Example configuration file:**
- πŸ“„ [Audio-Video LoRA Training](../configs/ltx2_av_lora.yaml)
**Dataset structure for audio-video training:**
```
preprocessed_data_root/
β”œβ”€β”€ latents/ # Video latents
β”œβ”€β”€ conditions/ # Text embeddings
└── audio_latents/ # Audio latents (required when with_audio: true)
```
> [!IMPORTANT]
> When training audio-video LoRAs, ensure your `target_modules` configuration captures video, audio, and
> cross-modal attention branches. Use patterns like `"to_k"` instead of `"attn1.to_k"` to match:
> - Video modules: `attn1.to_k`, `attn2.to_k`
> - Audio modules: `audio_attn1.to_k`, `audio_attn2.to_k`
> - Cross-modal modules: `audio_to_video_attn.to_k`, `video_to_audio_attn.to_k`
>
> The cross-modal attention modules (`audio_to_video_attn` and `video_to_audio_attn`) enable bidirectional
> information flow between audio and video, which is critical for synchronized audiovisual generation.
> See [Understanding Target Modules](configuration-reference.md#understanding-target-modules) for detailed guidance.
> [!NOTE]
> You can generate audio during validation even if you're not training the audio branch.
> Set `validation.generate_audio: true` independently of `training_strategy.with_audio`.
## πŸ”₯ Full Model Fine-tuning
Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but
requiring substantial computational resources and larger training datasets:
- **Offers the highest potential quality and capability improvements**
- **Requires multiple GPUs** and distributed training techniques (e.g., FSDP)
- **Produces large checkpoint files** (several GB)
- **Best for major model adaptations** or when LoRA limitations are reached
Configure full fine-tuning with:
```yaml
model:
training_mode: "full"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
```
> [!IMPORTANT]
> Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8Γ— H100 80GB) and distributed
> training with FSDP. See [Training Guide](training-guide.md) for multi-GPU setup instructions.
## πŸ”„ In-Context LoRA (IC-LoRA) Training
IC-LoRA is a specialized training mode for video-to-video transformations.
Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos.
IC-LoRA enables a wide range of advanced video-to-video applications, such as:
- **Control adapters** (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a
target video
- **Video deblurring**: Transform blurry input videos into sharp, high-quality outputs
- **Style transfer**: Apply the style of a reference video to a target video sequence
- **Colorization**: Convert grayscale reference videos into colorized outputs
- **Restoration and enhancement**: Denoise, upscale, or restore old or degraded videos
By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning.
IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning:
- **Reference videos** provide clean, unnoised conditioning input showing the "before" state
- **Target videos** are noised during training and represent the desired "after" state
- **The model learns transformations** from reference videos to target videos
- **Loss is applied only to the target portion**, not the reference
- **Training and inference time increase significantly** due to the doubled sequence length
To enable IC-LoRA training, configure your YAML file with:
```yaml
model:
training_mode: "lora" # Required: IC-LoRA uses LoRA mode
training_strategy:
name: "video_to_video"
first_frame_conditioning_p: 0.1
reference_latents_dir: "reference_latents" # Directory for reference video latents
```
**Example configuration file:**
- πŸ“„ [IC-LoRA Training](../configs/ltx2_v2v_ic_lora.yaml) - Video-to-video transformation training
### Dataset Requirements for IC-LoRA
- Your dataset must contain **paired videos** where each target video has a corresponding reference video
- Reference and target videos must have **identical resolution and length**
- Both reference and target videos should be **preprocessed together** using the same resolution buckets
**Dataset structure for IC-LoRA training:**
```
preprocessed_data_root/
β”œβ”€β”€ latents/ # Target video latents (what the model learns to generate)
β”œβ”€β”€ conditions/ # Text embeddings for each video
└── reference_latents/ # Reference video latents (conditioning input)
```
### Generating Reference Videos
We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset.
The script takes a JSON file as input (e.g., output of `caption_videos.py`) and updates it with the generated reference
video paths.
```bash
uv run python scripts/compute_reference.py scenes_output_dir/ \
--output scenes_output_dir/dataset.json
```
To compute a different condition (depth maps, pose skeletons, etc.), modify the `compute_reference()` function in the
script.
### Configuration Requirements for IC-LoRA
- You **must** provide `reference_videos` in your validation configuration when using IC-LoRA training
- The number of reference videos must match the number of validation prompts
Example validation configuration for IC-LoRA:
```yaml
validation:
prompts:
- "First prompt describing the desired output"
- "Second prompt describing the desired output"
reference_videos:
- "/path/to/reference1.mp4"
- "/path/to/reference2.mp4"
include_reference_in_output: true # Show reference side-by-side with output
```
## πŸ“Š Training Mode Comparison
| Aspect | LoRA | Audio-Video LoRA | Full Fine-tuning | IC-LoRA |
|----------------------|------------|------------------|------------------|----------------|
| **Memory Usage** | Low | Low-Medium | High | Medium |
| **Training Speed** | Fast | Fast | Slow | Medium |
| **Output Size** | 100MB-few GB (depends on rank) | 100MB-few GB (depends on rank) | Tens of GB | 100MB-few GB (depends on rank) |
| **Flexibility** | Medium | Medium | High | Specialized |
| **Audio Support** | Optional | Yes | Optional | No |
| **Reference Videos** | No | No | No | Yes (required) |
## 🎬 Using Trained Models for Inference
After training, use the [`ltx-pipelines`](../../ltx-pipelines/) package for production inference with your trained LoRAs:
| Training Mode | Recommended Pipeline |
|---------------|---------------------|
| LoRA / Audio-Video LoRA | `TI2VidOneStagePipeline` or `TI2VidTwoStagesPipeline` |
| IC-LoRA | `ICLoraPipeline` |
All pipelines support loading custom LoRAs via the `loras` parameter. See the [`ltx-pipelines`](../../ltx-pipelines/) package
documentation for detailed usage instructions.
## πŸš€ Next Steps
Once you've chosen your training mode:
- Set up your dataset using [Dataset Preparation](dataset-preparation.md)
- Configure your training parameters in [Configuration Reference](configuration-reference.md)
- Start training with the [Training Guide](training-guide.md)