Spaces:
Paused
Paused
File size: 8,822 Bytes
ebfc6b3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | # Training Modes Guide
The trainer supports several training modes, each suited for different use cases and requirements.
## π― Standard LoRA Training (Video-Only)
Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while
keeping the base model frozen. This approach:
- **Requires significantly less memory and compute** than full fine-tuning
- **Produces small, portable weight files** (typically a few hundred MB)
- **Is ideal for learning specific styles, effects, or concepts**
- **Can be easily combined with other LoRAs** during inference
Configure standard LoRA training with:
```yaml
model:
training_mode: "lora"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
with_audio: false # Video-only training
```
## π Audio-Video LoRA Training
LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output:
- **Synchronized audio-video generation** - Audio matches the visual content
- **Same efficient LoRA approach** - Just enable audio training
- **Requires audio latents** - Dataset must include preprocessed audio
Configure audio-video training with:
```yaml
model:
training_mode: "lora"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
with_audio: true # Enable audio training
audio_latents_dir: "audio_latents" # Directory containing audio latents
```
**Example configuration file:**
- π [Audio-Video LoRA Training](../configs/ltx2_av_lora.yaml)
**Dataset structure for audio-video training:**
```
preprocessed_data_root/
βββ latents/ # Video latents
βββ conditions/ # Text embeddings
βββ audio_latents/ # Audio latents (required when with_audio: true)
```
> [!IMPORTANT]
> When training audio-video LoRAs, ensure your `target_modules` configuration captures video, audio, and
> cross-modal attention branches. Use patterns like `"to_k"` instead of `"attn1.to_k"` to match:
> - Video modules: `attn1.to_k`, `attn2.to_k`
> - Audio modules: `audio_attn1.to_k`, `audio_attn2.to_k`
> - Cross-modal modules: `audio_to_video_attn.to_k`, `video_to_audio_attn.to_k`
>
> The cross-modal attention modules (`audio_to_video_attn` and `video_to_audio_attn`) enable bidirectional
> information flow between audio and video, which is critical for synchronized audiovisual generation.
> See [Understanding Target Modules](configuration-reference.md#understanding-target-modules) for detailed guidance.
> [!NOTE]
> You can generate audio during validation even if you're not training the audio branch.
> Set `validation.generate_audio: true` independently of `training_strategy.with_audio`.
## π₯ Full Model Fine-tuning
Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but
requiring substantial computational resources and larger training datasets:
- **Offers the highest potential quality and capability improvements**
- **Requires multiple GPUs** and distributed training techniques (e.g., FSDP)
- **Produces large checkpoint files** (several GB)
- **Best for major model adaptations** or when LoRA limitations are reached
Configure full fine-tuning with:
```yaml
model:
training_mode: "full"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1
```
> [!IMPORTANT]
> Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8Γ H100 80GB) and distributed
> training with FSDP. See [Training Guide](training-guide.md) for multi-GPU setup instructions.
## π In-Context LoRA (IC-LoRA) Training
IC-LoRA is a specialized training mode for video-to-video transformations.
Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos.
IC-LoRA enables a wide range of advanced video-to-video applications, such as:
- **Control adapters** (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a
target video
- **Video deblurring**: Transform blurry input videos into sharp, high-quality outputs
- **Style transfer**: Apply the style of a reference video to a target video sequence
- **Colorization**: Convert grayscale reference videos into colorized outputs
- **Restoration and enhancement**: Denoise, upscale, or restore old or degraded videos
By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning.
IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning:
- **Reference videos** provide clean, unnoised conditioning input showing the "before" state
- **Target videos** are noised during training and represent the desired "after" state
- **The model learns transformations** from reference videos to target videos
- **Loss is applied only to the target portion**, not the reference
- **Training and inference time increase significantly** due to the doubled sequence length
To enable IC-LoRA training, configure your YAML file with:
```yaml
model:
training_mode: "lora" # Required: IC-LoRA uses LoRA mode
training_strategy:
name: "video_to_video"
first_frame_conditioning_p: 0.1
reference_latents_dir: "reference_latents" # Directory for reference video latents
```
**Example configuration file:**
- π [IC-LoRA Training](../configs/ltx2_v2v_ic_lora.yaml) - Video-to-video transformation training
### Dataset Requirements for IC-LoRA
- Your dataset must contain **paired videos** where each target video has a corresponding reference video
- Reference and target videos must have **identical resolution and length**
- Both reference and target videos should be **preprocessed together** using the same resolution buckets
**Dataset structure for IC-LoRA training:**
```
preprocessed_data_root/
βββ latents/ # Target video latents (what the model learns to generate)
βββ conditions/ # Text embeddings for each video
βββ reference_latents/ # Reference video latents (conditioning input)
```
### Generating Reference Videos
We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset.
The script takes a JSON file as input (e.g., output of `caption_videos.py`) and updates it with the generated reference
video paths.
```bash
uv run python scripts/compute_reference.py scenes_output_dir/ \
--output scenes_output_dir/dataset.json
```
To compute a different condition (depth maps, pose skeletons, etc.), modify the `compute_reference()` function in the
script.
### Configuration Requirements for IC-LoRA
- You **must** provide `reference_videos` in your validation configuration when using IC-LoRA training
- The number of reference videos must match the number of validation prompts
Example validation configuration for IC-LoRA:
```yaml
validation:
prompts:
- "First prompt describing the desired output"
- "Second prompt describing the desired output"
reference_videos:
- "/path/to/reference1.mp4"
- "/path/to/reference2.mp4"
include_reference_in_output: true # Show reference side-by-side with output
```
## π Training Mode Comparison
| Aspect | LoRA | Audio-Video LoRA | Full Fine-tuning | IC-LoRA |
|----------------------|------------|------------------|------------------|----------------|
| **Memory Usage** | Low | Low-Medium | High | Medium |
| **Training Speed** | Fast | Fast | Slow | Medium |
| **Output Size** | 100MB-few GB (depends on rank) | 100MB-few GB (depends on rank) | Tens of GB | 100MB-few GB (depends on rank) |
| **Flexibility** | Medium | Medium | High | Specialized |
| **Audio Support** | Optional | Yes | Optional | No |
| **Reference Videos** | No | No | No | Yes (required) |
## π¬ Using Trained Models for Inference
After training, use the [`ltx-pipelines`](../../ltx-pipelines/) package for production inference with your trained LoRAs:
| Training Mode | Recommended Pipeline |
|---------------|---------------------|
| LoRA / Audio-Video LoRA | `TI2VidOneStagePipeline` or `TI2VidTwoStagesPipeline` |
| IC-LoRA | `ICLoraPipeline` |
All pipelines support loading custom LoRAs via the `loras` parameter. See the [`ltx-pipelines`](../../ltx-pipelines/) package
documentation for detailed usage instructions.
## π Next Steps
Once you've chosen your training mode:
- Set up your dataset using [Dataset Preparation](dataset-preparation.md)
- Configure your training parameters in [Configuration Reference](configuration-reference.md)
- Start training with the [Training Guide](training-guide.md)
|