File size: 8,822 Bytes
ebfc6b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# Training Modes Guide

The trainer supports several training modes, each suited for different use cases and requirements.

## 🎯 Standard LoRA Training (Video-Only)

Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while
keeping the base model frozen. This approach:

- **Requires significantly less memory and compute** than full fine-tuning
- **Produces small, portable weight files** (typically a few hundred MB)
- **Is ideal for learning specific styles, effects, or concepts**
- **Can be easily combined with other LoRAs** during inference

Configure standard LoRA training with:

```yaml
model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1
  with_audio: false  # Video-only training
```

## πŸ”Š Audio-Video LoRA Training

LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output:

- **Synchronized audio-video generation** - Audio matches the visual content
- **Same efficient LoRA approach** - Just enable audio training
- **Requires audio latents** - Dataset must include preprocessed audio

Configure audio-video training with:

```yaml
model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1
  with_audio: true  # Enable audio training
  audio_latents_dir: "audio_latents"  # Directory containing audio latents
```

**Example configuration file:**

- πŸ“„ [Audio-Video LoRA Training](../configs/ltx2_av_lora.yaml)

**Dataset structure for audio-video training:**

```
preprocessed_data_root/
β”œβ”€β”€ latents/           # Video latents
β”œβ”€β”€ conditions/        # Text embeddings
└── audio_latents/     # Audio latents (required when with_audio: true)
```

> [!IMPORTANT]
> When training audio-video LoRAs, ensure your `target_modules` configuration captures video, audio, and
> cross-modal attention branches. Use patterns like `"to_k"` instead of `"attn1.to_k"` to match:
> - Video modules: `attn1.to_k`, `attn2.to_k`
> - Audio modules: `audio_attn1.to_k`, `audio_attn2.to_k`
> - Cross-modal modules: `audio_to_video_attn.to_k`, `video_to_audio_attn.to_k`
>
> The cross-modal attention modules (`audio_to_video_attn` and `video_to_audio_attn`) enable bidirectional
> information flow between audio and video, which is critical for synchronized audiovisual generation.
> See [Understanding Target Modules](configuration-reference.md#understanding-target-modules) for detailed guidance.

> [!NOTE]
> You can generate audio during validation even if you're not training the audio branch.
> Set `validation.generate_audio: true` independently of `training_strategy.with_audio`.

## πŸ”₯ Full Model Fine-tuning

Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but
requiring substantial computational resources and larger training datasets:

- **Offers the highest potential quality and capability improvements**
- **Requires multiple GPUs** and distributed training techniques (e.g., FSDP)
- **Produces large checkpoint files** (several GB)
- **Best for major model adaptations** or when LoRA limitations are reached

Configure full fine-tuning with:

```yaml
model:
  training_mode: "full"

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1
```

> [!IMPORTANT]
> Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8Γ— H100 80GB) and distributed
> training with FSDP. See [Training Guide](training-guide.md) for multi-GPU setup instructions.

## πŸ”„ In-Context LoRA (IC-LoRA) Training

IC-LoRA is a specialized training mode for video-to-video transformations.
Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos.
IC-LoRA enables a wide range of advanced video-to-video applications, such as:

- **Control adapters** (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a
  target video
- **Video deblurring**: Transform blurry input videos into sharp, high-quality outputs
- **Style transfer**: Apply the style of a reference video to a target video sequence
- **Colorization**: Convert grayscale reference videos into colorized outputs
- **Restoration and enhancement**: Denoise, upscale, or restore old or degraded videos

By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning.

IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning:

- **Reference videos** provide clean, unnoised conditioning input showing the "before" state
- **Target videos** are noised during training and represent the desired "after" state
- **The model learns transformations** from reference videos to target videos
- **Loss is applied only to the target portion**, not the reference
- **Training and inference time increase significantly** due to the doubled sequence length

To enable IC-LoRA training, configure your YAML file with:

```yaml
model:
  training_mode: "lora"  # Required: IC-LoRA uses LoRA mode

training_strategy:
  name: "video_to_video"
  first_frame_conditioning_p: 0.1
  reference_latents_dir: "reference_latents"  # Directory for reference video latents
```

**Example configuration file:**

- πŸ“„ [IC-LoRA Training](../configs/ltx2_v2v_ic_lora.yaml) - Video-to-video transformation training

### Dataset Requirements for IC-LoRA

- Your dataset must contain **paired videos** where each target video has a corresponding reference video
- Reference and target videos must have **identical resolution and length**
- Both reference and target videos should be **preprocessed together** using the same resolution buckets

**Dataset structure for IC-LoRA training:**

```
preprocessed_data_root/
β”œβ”€β”€ latents/            # Target video latents (what the model learns to generate)
β”œβ”€β”€ conditions/         # Text embeddings for each video
└── reference_latents/  # Reference video latents (conditioning input)
```

### Generating Reference Videos

We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset.
The script takes a JSON file as input (e.g., output of `caption_videos.py`) and updates it with the generated reference
video paths.

```bash
uv run python scripts/compute_reference.py scenes_output_dir/ \
    --output scenes_output_dir/dataset.json
```

To compute a different condition (depth maps, pose skeletons, etc.), modify the `compute_reference()` function in the
script.

### Configuration Requirements for IC-LoRA

- You **must** provide `reference_videos` in your validation configuration when using IC-LoRA training
- The number of reference videos must match the number of validation prompts

Example validation configuration for IC-LoRA:

```yaml
validation:
  prompts:
    - "First prompt describing the desired output"
    - "Second prompt describing the desired output"
  reference_videos:
    - "/path/to/reference1.mp4"
    - "/path/to/reference2.mp4"
  include_reference_in_output: true  # Show reference side-by-side with output
```

## πŸ“Š Training Mode Comparison

| Aspect               | LoRA       | Audio-Video LoRA | Full Fine-tuning | IC-LoRA        |
|----------------------|------------|------------------|------------------|----------------|
| **Memory Usage**     | Low        | Low-Medium       | High             | Medium         |
| **Training Speed**   | Fast       | Fast             | Slow             | Medium         |
| **Output Size**      | 100MB-few GB (depends on rank) | 100MB-few GB (depends on rank) | Tens of GB | 100MB-few GB (depends on rank) |
| **Flexibility**      | Medium     | Medium           | High             | Specialized    |
| **Audio Support**    | Optional   | Yes              | Optional         | No             |
| **Reference Videos** | No         | No               | No               | Yes (required) |

## 🎬 Using Trained Models for Inference

After training, use the [`ltx-pipelines`](../../ltx-pipelines/) package for production inference with your trained LoRAs:

| Training Mode | Recommended Pipeline |
|---------------|---------------------|
| LoRA / Audio-Video LoRA | `TI2VidOneStagePipeline` or `TI2VidTwoStagesPipeline` |
| IC-LoRA | `ICLoraPipeline` |

All pipelines support loading custom LoRAs via the `loras` parameter. See the [`ltx-pipelines`](../../ltx-pipelines/) package
documentation for detailed usage instructions.

## πŸš€ Next Steps

Once you've chosen your training mode:

- Set up your dataset using [Dataset Preparation](dataset-preparation.md)
- Configure your training parameters in [Configuration Reference](configuration-reference.md)
- Start training with the [Training Guide](training-guide.md)