File size: 9,803 Bytes
ebfc6b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
# Utility Scripts Reference

This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks.

## 🎬 Dataset Processing Scripts

### Video Scene Splitting

The `scripts/split_scenes.py` script automatically splits long videos into shorter, coherent scenes.

```bash
# Basic scene splitting
uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s
```

**Key features:**

- **Automatic scene detection**: Uses PySceneDetect for intelligent splitting
- **Multiple algorithms**: Content-based, adaptive, threshold, and histogram detection
- **Filtering options**: Remove scenes shorter than specified duration
- **Customizable parameters**: Thresholds, window sizes, and detection modes

**Common options:**

```bash
# See all available options
uv run python scripts/split_scenes.py --help

# Use adaptive detection with custom threshold
uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0

# Limit to maximum number of scenes
uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50
```

### Automatic Video Captioning

The `scripts/caption_videos.py` script generates captions for videos (with audio) using multimodal models.

```bash
# Generate captions for all videos in a directory (uses Qwen2.5-Omni by default)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json

# Use 8-bit quantization to reduce VRAM usage
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit

# Use Gemini Flash API instead (requires API key)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \
    --captioner-type gemini_flash --api-key YOUR_API_KEY

# Caption without audio processing (video-only)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio

# Force re-caption all files
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override
```

**Key features:**

- **Audio-visual captioning**: Processes both video and audio content, including speech transcription
- **Multiple backends**:
  - `qwen_omni` (default): Local Qwen2.5-Omni model - processes video + audio locally
  - `gemini_flash`: Google Gemini Flash API - cloud-based, requires API key
- **Structured output**: Captions include visual description, speech transcription, sounds, and on-screen text
- **Memory optimization**: 8-bit quantization option for limited VRAM
- **Incremental processing**: Skips already-captioned files by default
- **Multiple output formats**: JSON, JSONL, CSV, or TXT

**Caption format:**

The captioner produces structured captions with four sections:
- `[VISUAL]`: Detailed description of visual content
- `[SPEECH]`: Word-for-word transcription of spoken content
- `[SOUNDS]`: Description of music, ambient sounds, sound effects
- `[TEXT]`: Any on-screen text visible in the video

**Environment variables (for Gemini Flash):**

Set one of these to use Gemini Flash without passing `--api-key`:
- `GOOGLE_API_KEY`
- `GEMINI_API_KEY`

### Dataset Preprocessing

The `scripts/process_dataset.py` script processes videos and caches latents for training.

```bash
# Basic preprocessing
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model

# With audio processing
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --with-audio

# With video decoding for verification
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --decode
```

Multiple resolution buckets can be specified, separated by `;`:

```bash
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49;512x512x81" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model
```

> [!NOTE]
> When training with multiple resolution buckets, set `optimization.batch_size: 1`.

For detailed usage, see the [Dataset Preparation Guide](dataset-preparation.md).

### Reference Video Generation

The `scripts/compute_reference.py` script provides a template for creating reference videos needed for IC-LoRA training.
The default implementation generates Canny edge reference videos.

```bash
# Generate Canny edge reference videos
uv run python scripts/compute_reference.py videos_dir/ --output dataset.json
```

**Key features:**

- **Canny edge detection**: Creates edge-based reference videos
- **In-place editing**: Updates existing dataset JSON files
- **Customizable**: Modify the `compute_reference()` function for different conditions (depth, pose, etc.)

> [!TIP]
> You can edit this script to generate other types of reference videos for IC-LoRA training,
> such as depth maps, segmentation masks, or any custom video transformation.

## 🔍 Debugging and Verification Scripts

### Latents Decoding

The `scripts/decode_latents.py` script decodes precomputed video latents back into video files for visual inspection.

```bash
# Basic usage
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors

# With VAE tiling for large videos
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors \
    --vae-tiling

# Decode both video and audio latents
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors \
    --with-audio
```

**The script will:**

1. **Load the VAE model** from the specified path
2. **Process all `.pt` latent files** in the input directory
3. **Decode each latent** back into a video using the VAE
4. **Save resulting videos** as MP4 files in the output directory

**When to use:**

- **Verify preprocessing quality**: Check that your videos were encoded correctly
- **Debug training data**: Visualize what the model actually sees during training
- **Quality assessment**: Ensure latent encoding preserves important visual details


### Inference Script

The `scripts/inference.py` script runs inference with a trained model.

> [!TIP]
> For production inference, consider using the [`ltx-pipelines`](../../ltx-pipelines/) package which provides optimized,
> feature-rich pipelines for various use cases:
> - **Text/Image-to-Video**: `TI2VidOneStagePipeline`, `TI2VidTwoStagesPipeline`
> - **Distilled (fast) inference**: `DistilledPipeline`
> - **IC-LoRA video-to-video**: `ICLoraPipeline`
> - **Keyframe interpolation**: `KeyframeInterpolationPipeline`
>
> All pipelines support loading custom LoRAs trained with this trainer.

```bash
# Text-to-video inference (with audio by default)
# By default, uses CFG scale 3.0 and STG scale 1.0 with block 29
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --output output.mp4

# Video-only (skip audio generation)
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --skip-audio \
    --output output.mp4

# Image-to-video with conditioning image
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat walking" \
    --condition-image first_frame.png \
    --output output.mp4

# Custom guidance settings
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --guidance-scale 3.0 \
    --stg-scale 1.0 \
    --stg-blocks 29 \
    --output output.mp4

# Disable STG (CFG only)
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --stg-scale 0.0 \
    --output output.mp4
```

**Guidance parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--guidance-scale` | 3.0 | CFG (Classifier-Free Guidance) scale |
| `--stg-scale` | 1.0 | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG |
| `--stg-blocks` | 29 | Transformer block(s) to perturb for STG |
| `--stg-mode` | stg_av | `stg_av` perturbs both audio and video, `stg_v` video only |

## 🚀 Training Scripts

### Basic and Distributed Training

Use `scripts/train.py` for both single GPU and multi-GPU runs:

```bash
# Single-GPU training
uv run python scripts/train.py configs/ltx2_av_lora.yaml

# Multi-GPU (uses your accelerate config)
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml

# Override number of processes
uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml
```

For detailed usage, see the [Training Guide](training-guide.md).

## 💡 Tips for Using Utility Scripts

- **Start with `--help`**: Always check available options for each script
- **Test on small datasets**: Verify workflows with a few files before processing large datasets
- **Use decode verification**: Always decode a few samples to verify preprocessing quality
- **Monitor VRAM usage**: Use `--use-8bit` or quantization flags when running into memory issues
- **Keep backups**: Make copies of important dataset files before running conversion scripts