Spaces:

Lightricks
/

ltx-2

Running on Zero

File size: 9,803 Bytes

ebfc6b3

# Utility Scripts Reference

This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks.

## 🎬 Dataset Processing Scripts

### Video Scene Splitting

The `scripts/split_scenes.py` script automatically splits long videos into shorter, coherent scenes.

```bash
# Basic scene splitting
uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s
```

**Key features:**

- **Automatic scene detection**: Uses PySceneDetect for intelligent splitting
- **Multiple algorithms**: Content-based, adaptive, threshold, and histogram detection
- **Filtering options**: Remove scenes shorter than specified duration
- **Customizable parameters**: Thresholds, window sizes, and detection modes

**Common options:**

```bash
# See all available options
uv run python scripts/split_scenes.py --help

# Use adaptive detection with custom threshold
uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0

# Limit to maximum number of scenes
uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50
```

### Automatic Video Captioning

The `scripts/caption_videos.py` script generates captions for videos (with audio) using multimodal models.

```bash
# Generate captions for all videos in a directory (uses Qwen2.5-Omni by default)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json

# Use 8-bit quantization to reduce VRAM usage
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit

# Use Gemini Flash API instead (requires API key)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \
    --captioner-type gemini_flash --api-key YOUR_API_KEY

# Caption without audio processing (video-only)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio

# Force re-caption all files
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override
```

**Key features:**

- **Audio-visual captioning**: Processes both video and audio content, including speech transcription
- **Multiple backends**:
  - `qwen_omni` (default): Local Qwen2.5-Omni model - processes video + audio locally
  - `gemini_flash`: Google Gemini Flash API - cloud-based, requires API key
- **Structured output**: Captions include visual description, speech transcription, sounds, and on-screen text
- **Memory optimization**: 8-bit quantization option for limited VRAM
- **Incremental processing**: Skips already-captioned files by default
- **Multiple output formats**: JSON, JSONL, CSV, or TXT

**Caption format:**

The captioner produces structured captions with four sections:
- `[VISUAL]`: Detailed description of visual content
- `[SPEECH]`: Word-for-word transcription of spoken content
- `[SOUNDS]`: Description of music, ambient sounds, sound effects
- `[TEXT]`: Any on-screen text visible in the video

**Environment variables (for Gemini Flash):**

Set one of these to use Gemini Flash without passing `--api-key`:
- `GOOGLE_API_KEY`
- `GEMINI_API_KEY`

### Dataset Preprocessing

The `scripts/process_dataset.py` script processes videos and caches latents for training.

```bash
# Basic preprocessing
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model

# With audio processing
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --with-audio

# With video decoding for verification
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --decode
```

Multiple resolution buckets can be specified, separated by `;`:

```bash
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49;512x512x81" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model
```

> [!NOTE]
> When training with multiple resolution buckets, set `optimization.batch_size: 1`.

For detailed usage, see the [Dataset Preparation Guide](dataset-preparation.md).

### Reference Video Generation

The `scripts/compute_reference.py` script provides a template for creating reference videos needed for IC-LoRA training.
The default implementation generates Canny edge reference videos.

```bash
# Generate Canny edge reference videos
uv run python scripts/compute_reference.py videos_dir/ --output dataset.json
```

**Key features:**

- **Canny edge detection**: Creates edge-based reference videos
- **In-place editing**: Updates existing dataset JSON files
- **Customizable**: Modify the `compute_reference()` function for different conditions (depth, pose, etc.)

> [!TIP]
> You can edit this script to generate other types of reference videos for IC-LoRA training,
> such as depth maps, segmentation masks, or any custom video transformation.

## 🔍 Debugging and Verification Scripts

### Latents Decoding

The `scripts/decode_latents.py` script decodes precomputed video latents back into video files for visual inspection.

```bash
# Basic usage
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors

# With VAE tiling for large videos
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors \
    --vae-tiling

# Decode both video and audio latents
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors \
    --with-audio
```

**The script will:**

1. **Load the VAE model** from the specified path
2. **Process all `.pt` latent files** in the input directory
3. **Decode each latent** back into a video using the VAE
4. **Save resulting videos** as MP4 files in the output directory

**When to use:**

- **Verify preprocessing quality**: Check that your videos were encoded correctly
- **Debug training data**: Visualize what the model actually sees during training
- **Quality assessment**: Ensure latent encoding preserves important visual details


### Inference Script

The `scripts/inference.py` script runs inference with a trained model.

> [!TIP]
> For production inference, consider using the [`ltx-pipelines`](../../ltx-pipelines/) package which provides optimized,
> feature-rich pipelines for various use cases:
> - **Text/Image-to-Video**: `TI2VidOneStagePipeline`, `TI2VidTwoStagesPipeline`
> - **Distilled (fast) inference**: `DistilledPipeline`
> - **IC-LoRA video-to-video**: `ICLoraPipeline`
> - **Keyframe interpolation**: `KeyframeInterpolationPipeline`
>
> All pipelines support loading custom LoRAs trained with this trainer.

```bash
# Text-to-video inference (with audio by default)
# By default, uses CFG scale 3.0 and STG scale 1.0 with block 29
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --output output.mp4

# Video-only (skip audio generation)
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --skip-audio \
    --output output.mp4

# Image-to-video with conditioning image
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat walking" \
    --condition-image first_frame.png \
    --output output.mp4

# Custom guidance settings
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --guidance-scale 3.0 \
    --stg-scale 1.0 \
    --stg-blocks 29 \
    --output output.mp4

# Disable STG (CFG only)
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --stg-scale 0.0 \
    --output output.mp4
```

**Guidance parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--guidance-scale` | 3.0 | CFG (Classifier-Free Guidance) scale |
| `--stg-scale` | 1.0 | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG |
| `--stg-blocks` | 29 | Transformer block(s) to perturb for STG |
| `--stg-mode` | stg_av | `stg_av` perturbs both audio and video, `stg_v` video only |

## 🚀 Training Scripts

### Basic and Distributed Training

Use `scripts/train.py` for both single GPU and multi-GPU runs:

```bash
# Single-GPU training
uv run python scripts/train.py configs/ltx2_av_lora.yaml

# Multi-GPU (uses your accelerate config)
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml

# Override number of processes
uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml
```

For detailed usage, see the [Training Guide](training-guide.md).

## 💡 Tips for Using Utility Scripts

- **Start with `--help`**: Always check available options for each script
- **Test on small datasets**: Verify workflows with a few files before processing large datasets
- **Use decode verification**: Always decode a few samples to verify preprocessing quality
- **Monitor VRAM usage**: Use `--use-8bit` or quantization flags when running into memory issues
- **Keep backups**: Make copies of important dataset files before running conversion scripts