ltx-2 / packages /ltx-trainer /docs /utility-scripts.md
linoy
inital commit
ebfc6b3
# Utility Scripts Reference
This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks.
## 🎬 Dataset Processing Scripts
### Video Scene Splitting
The `scripts/split_scenes.py` script automatically splits long videos into shorter, coherent scenes.
```bash
# Basic scene splitting
uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s
```
**Key features:**
- **Automatic scene detection**: Uses PySceneDetect for intelligent splitting
- **Multiple algorithms**: Content-based, adaptive, threshold, and histogram detection
- **Filtering options**: Remove scenes shorter than specified duration
- **Customizable parameters**: Thresholds, window sizes, and detection modes
**Common options:**
```bash
# See all available options
uv run python scripts/split_scenes.py --help
# Use adaptive detection with custom threshold
uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0
# Limit to maximum number of scenes
uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50
```
### Automatic Video Captioning
The `scripts/caption_videos.py` script generates captions for videos (with audio) using multimodal models.
```bash
# Generate captions for all videos in a directory (uses Qwen2.5-Omni by default)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json
# Use 8-bit quantization to reduce VRAM usage
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit
# Use Gemini Flash API instead (requires API key)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \
--captioner-type gemini_flash --api-key YOUR_API_KEY
# Caption without audio processing (video-only)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio
# Force re-caption all files
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override
```
**Key features:**
- **Audio-visual captioning**: Processes both video and audio content, including speech transcription
- **Multiple backends**:
- `qwen_omni` (default): Local Qwen2.5-Omni model - processes video + audio locally
- `gemini_flash`: Google Gemini Flash API - cloud-based, requires API key
- **Structured output**: Captions include visual description, speech transcription, sounds, and on-screen text
- **Memory optimization**: 8-bit quantization option for limited VRAM
- **Incremental processing**: Skips already-captioned files by default
- **Multiple output formats**: JSON, JSONL, CSV, or TXT
**Caption format:**
The captioner produces structured captions with four sections:
- `[VISUAL]`: Detailed description of visual content
- `[SPEECH]`: Word-for-word transcription of spoken content
- `[SOUNDS]`: Description of music, ambient sounds, sound effects
- `[TEXT]`: Any on-screen text visible in the video
**Environment variables (for Gemini Flash):**
Set one of these to use Gemini Flash without passing `--api-key`:
- `GOOGLE_API_KEY`
- `GEMINI_API_KEY`
### Dataset Preprocessing
The `scripts/process_dataset.py` script processes videos and caches latents for training.
```bash
# Basic preprocessing
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
# With audio processing
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--with-audio
# With video decoding for verification
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--decode
```
Multiple resolution buckets can be specified, separated by `;`:
```bash
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49;512x512x81" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
```
> [!NOTE]
> When training with multiple resolution buckets, set `optimization.batch_size: 1`.
For detailed usage, see the [Dataset Preparation Guide](dataset-preparation.md).
### Reference Video Generation
The `scripts/compute_reference.py` script provides a template for creating reference videos needed for IC-LoRA training.
The default implementation generates Canny edge reference videos.
```bash
# Generate Canny edge reference videos
uv run python scripts/compute_reference.py videos_dir/ --output dataset.json
```
**Key features:**
- **Canny edge detection**: Creates edge-based reference videos
- **In-place editing**: Updates existing dataset JSON files
- **Customizable**: Modify the `compute_reference()` function for different conditions (depth, pose, etc.)
> [!TIP]
> You can edit this script to generate other types of reference videos for IC-LoRA training,
> such as depth maps, segmentation masks, or any custom video transformation.
## 🔍 Debugging and Verification Scripts
### Latents Decoding
The `scripts/decode_latents.py` script decodes precomputed video latents back into video files for visual inspection.
```bash
# Basic usage
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors
# With VAE tiling for large videos
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors \
--vae-tiling
# Decode both video and audio latents
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors \
--with-audio
```
**The script will:**
1. **Load the VAE model** from the specified path
2. **Process all `.pt` latent files** in the input directory
3. **Decode each latent** back into a video using the VAE
4. **Save resulting videos** as MP4 files in the output directory
**When to use:**
- **Verify preprocessing quality**: Check that your videos were encoded correctly
- **Debug training data**: Visualize what the model actually sees during training
- **Quality assessment**: Ensure latent encoding preserves important visual details
### Inference Script
The `scripts/inference.py` script runs inference with a trained model.
> [!TIP]
> For production inference, consider using the [`ltx-pipelines`](../../ltx-pipelines/) package which provides optimized,
> feature-rich pipelines for various use cases:
> - **Text/Image-to-Video**: `TI2VidOneStagePipeline`, `TI2VidTwoStagesPipeline`
> - **Distilled (fast) inference**: `DistilledPipeline`
> - **IC-LoRA video-to-video**: `ICLoraPipeline`
> - **Keyframe interpolation**: `KeyframeInterpolationPipeline`
>
> All pipelines support loading custom LoRAs trained with this trainer.
```bash
# Text-to-video inference (with audio by default)
# By default, uses CFG scale 3.0 and STG scale 1.0 with block 29
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--output output.mp4
# Video-only (skip audio generation)
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--skip-audio \
--output output.mp4
# Image-to-video with conditioning image
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat walking" \
--condition-image first_frame.png \
--output output.mp4
# Custom guidance settings
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--guidance-scale 3.0 \
--stg-scale 1.0 \
--stg-blocks 29 \
--output output.mp4
# Disable STG (CFG only)
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--stg-scale 0.0 \
--output output.mp4
```
**Guidance parameters:**
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--guidance-scale` | 3.0 | CFG (Classifier-Free Guidance) scale |
| `--stg-scale` | 1.0 | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG |
| `--stg-blocks` | 29 | Transformer block(s) to perturb for STG |
| `--stg-mode` | stg_av | `stg_av` perturbs both audio and video, `stg_v` video only |
## 🚀 Training Scripts
### Basic and Distributed Training
Use `scripts/train.py` for both single GPU and multi-GPU runs:
```bash
# Single-GPU training
uv run python scripts/train.py configs/ltx2_av_lora.yaml
# Multi-GPU (uses your accelerate config)
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
# Override number of processes
uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml
```
For detailed usage, see the [Training Guide](training-guide.md).
## 💡 Tips for Using Utility Scripts
- **Start with `--help`**: Always check available options for each script
- **Test on small datasets**: Verify workflows with a few files before processing large datasets
- **Use decode verification**: Always decode a few samples to verify preprocessing quality
- **Monitor VRAM usage**: Use `--use-8bit` or quantization flags when running into memory issues
- **Keep backups**: Make copies of important dataset files before running conversion scripts