| # Utility Scripts Reference |
|
|
| This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks. |
|
|
| ## ๐ฌ Dataset Processing Scripts |
|
|
| ### Video Scene Splitting |
|
|
| The `scripts/split_scenes.py` script automatically splits long videos into shorter, coherent scenes. |
|
|
| ```bash |
| # Basic scene splitting |
| uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s |
| ``` |
|
|
| **Key features:** |
|
|
| - **Automatic scene detection**: Uses PySceneDetect for intelligent splitting |
| - **Multiple algorithms**: Content-based, adaptive, threshold, and histogram detection |
| - **Filtering options**: Remove scenes shorter than specified duration |
| - **Customizable parameters**: Thresholds, window sizes, and detection modes |
|
|
| **Common options:** |
|
|
| ```bash |
| # See all available options |
| uv run python scripts/split_scenes.py --help |
| |
| # Use adaptive detection with custom threshold |
| uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0 |
| |
| # Limit to maximum number of scenes |
| uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50 |
| ``` |
|
|
| ### Automatic Video Captioning |
|
|
| The `scripts/caption_videos.py` script generates captions for videos (with audio) using multimodal models. |
|
|
| ```bash |
| # Generate captions for all videos in a directory (uses Qwen2.5-Omni by default) |
| uv run python scripts/caption_videos.py videos_dir/ --output dataset.json |
| |
| # Use 8-bit quantization to reduce VRAM usage |
| uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit |
| |
| # Use Gemini Flash API instead (requires API key) |
| uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \ |
| --captioner-type gemini_flash --api-key YOUR_API_KEY |
| |
| # Caption without audio processing (video-only) |
| uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio |
| |
| # Force re-caption all files |
| uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override |
| ``` |
|
|
| **Key features:** |
|
|
| - **Audio-visual captioning**: Processes both video and audio content, including speech transcription |
| - **Multiple backends**: |
| - `qwen_omni` (default): Local Qwen2.5-Omni model - processes video + audio locally |
| - `gemini_flash`: Google Gemini Flash API - cloud-based, requires API key |
| - **Structured output**: Captions include visual description, speech transcription, sounds, and on-screen text |
| - **Memory optimization**: 8-bit quantization option for limited VRAM |
| - **Incremental processing**: Skips already-captioned files by default |
| - **Multiple output formats**: JSON, JSONL, CSV, or TXT |
|
|
| **Caption format:** |
|
|
| The captioner produces structured captions with four sections: |
| - `[VISUAL]`: Detailed description of visual content |
| - `[SPEECH]`: Word-for-word transcription of spoken content |
| - `[SOUNDS]`: Description of music, ambient sounds, sound effects |
| - `[TEXT]`: Any on-screen text visible in the video |
|
|
| **Environment variables (for Gemini Flash):** |
|
|
| Set one of these to use Gemini Flash without passing `--api-key`: |
| - `GOOGLE_API_KEY` |
| - `GEMINI_API_KEY` |
|
|
| ### Dataset Preprocessing |
|
|
| The `scripts/process_dataset.py` script processes videos and caches latents for training. |
|
|
| ```bash |
| # Basic preprocessing |
| uv run python scripts/process_dataset.py dataset.json \ |
| --resolution-buckets "960x544x49" \ |
| --model-path /path/to/ltx-2-model.safetensors \ |
| --text-encoder-path /path/to/gemma-model |
| |
| # With audio processing |
| uv run python scripts/process_dataset.py dataset.json \ |
| --resolution-buckets "960x544x49" \ |
| --model-path /path/to/ltx-2-model.safetensors \ |
| --text-encoder-path /path/to/gemma-model \ |
| --with-audio |
| |
| # With video decoding for verification |
| uv run python scripts/process_dataset.py dataset.json \ |
| --resolution-buckets "960x544x49" \ |
| --model-path /path/to/ltx-2-model.safetensors \ |
| --text-encoder-path /path/to/gemma-model \ |
| --decode |
| ``` |
|
|
| Multiple resolution buckets can be specified, separated by `;`: |
|
|
| ```bash |
| uv run python scripts/process_dataset.py dataset.json \ |
| --resolution-buckets "960x544x49;512x512x81" \ |
| --model-path /path/to/ltx-2-model.safetensors \ |
| --text-encoder-path /path/to/gemma-model |
| ``` |
|
|
| > [!NOTE] |
| > When training with multiple resolution buckets, set `optimization.batch_size: 1`. |
| |
| For detailed usage, see the [Dataset Preparation Guide](dataset-preparation.md). |
| |
| ### Reference Video Generation |
| |
| The `scripts/compute_reference.py` script provides a template for creating reference videos needed for IC-LoRA training. |
| The default implementation generates Canny edge reference videos. |
|
|
| ```bash |
| # Generate Canny edge reference videos |
| uv run python scripts/compute_reference.py videos_dir/ --output dataset.json |
| ``` |
|
|
| **Key features:** |
|
|
| - **Canny edge detection**: Creates edge-based reference videos |
| - **In-place editing**: Updates existing dataset JSON files |
| - **Customizable**: Modify the `compute_reference()` function for different conditions (depth, pose, etc.) |
|
|
| > [!TIP] |
| > You can edit this script to generate other types of reference videos for IC-LoRA training, |
| > such as depth maps, segmentation masks, or any custom video transformation. |
|
|
| ## ๐ Debugging and Verification Scripts |
|
|
| ### Latents Decoding |
|
|
| The `scripts/decode_latents.py` script decodes precomputed video latents back into video files for visual inspection. |
|
|
| ```bash |
| # Basic usage |
| uv run python scripts/decode_latents.py /path/to/latents/dir \ |
| --output-dir /path/to/output \ |
| --model-path /path/to/ltx-2-model.safetensors |
| |
| # With VAE tiling for large videos |
| uv run python scripts/decode_latents.py /path/to/latents/dir \ |
| --output-dir /path/to/output \ |
| --model-path /path/to/ltx-2-model.safetensors \ |
| --vae-tiling |
| |
| # Decode both video and audio latents |
| uv run python scripts/decode_latents.py /path/to/latents/dir \ |
| --output-dir /path/to/output \ |
| --model-path /path/to/ltx-2-model.safetensors \ |
| --with-audio |
| ``` |
|
|
| **The script will:** |
|
|
| 1. **Load the VAE model** from the specified path |
| 2. **Process all `.pt` latent files** in the input directory |
| 3. **Decode each latent** back into a video using the VAE |
| 4. **Save resulting videos** as MP4 files in the output directory |
|
|
| **When to use:** |
|
|
| - **Verify preprocessing quality**: Check that your videos were encoded correctly |
| - **Debug training data**: Visualize what the model actually sees during training |
| - **Quality assessment**: Ensure latent encoding preserves important visual details |
|
|
|
|
| ### Inference Script |
|
|
| The `scripts/inference.py` script runs inference with a trained model. |
|
|
| > [!TIP] |
| > For production inference, consider using the [`ltx-pipelines`](../../ltx-pipelines/) package which provides optimized, |
| > feature-rich pipelines for various use cases: |
| > - **Text/Image-to-Video**: `TI2VidOneStagePipeline`, `TI2VidTwoStagesPipeline` |
| > - **Distilled (fast) inference**: `DistilledPipeline` |
| > - **IC-LoRA video-to-video**: `ICLoraPipeline` |
| > - **Keyframe interpolation**: `KeyframeInterpolationPipeline` |
| > |
| > All pipelines support loading custom LoRAs trained with this trainer. |
|
|
| ```bash |
| # Text-to-video inference (with audio by default) |
| # By default, uses CFG scale 4.0 and STG scale 1.0 with block 29 |
| uv run python scripts/inference.py \ |
| --checkpoint /path/to/model.safetensors \ |
| --text-encoder-path /path/to/gemma \ |
| --prompt "A cat playing with a ball" \ |
| --output output.mp4 |
| |
| # Video-only (skip audio generation) |
| uv run python scripts/inference.py \ |
| --checkpoint /path/to/model.safetensors \ |
| --text-encoder-path /path/to/gemma \ |
| --prompt "A cat playing with a ball" \ |
| --skip-audio \ |
| --output output.mp4 |
| |
| # Image-to-video with conditioning image |
| uv run python scripts/inference.py \ |
| --checkpoint /path/to/model.safetensors \ |
| --text-encoder-path /path/to/gemma \ |
| --prompt "A cat walking" \ |
| --condition-image first_frame.png \ |
| --output output.mp4 |
| |
| # Custom guidance settings |
| uv run python scripts/inference.py \ |
| --checkpoint /path/to/model.safetensors \ |
| --text-encoder-path /path/to/gemma \ |
| --prompt "A cat playing with a ball" \ |
| --guidance-scale 4.0 \ |
| --stg-scale 1.0 \ |
| --stg-blocks 29 \ |
| --output output.mp4 |
| |
| # Disable STG (CFG only) |
| uv run python scripts/inference.py \ |
| --checkpoint /path/to/model.safetensors \ |
| --text-encoder-path /path/to/gemma \ |
| --prompt "A cat playing with a ball" \ |
| --stg-scale 0.0 \ |
| --output output.mp4 |
| ``` |
|
|
| **Guidance parameters:** |
|
|
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `--guidance-scale` | 4.0 | CFG (Classifier-Free Guidance) scale | |
| | `--stg-scale` | 1.0 | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG | |
| | `--stg-blocks` | 29 | Transformer block(s) to perturb for STG | |
| | `--stg-mode` | stg_av | `stg_av` perturbs both audio and video, `stg_v` video only | |
| |
| ## ๐ Training Scripts |
| |
| ### Basic and Distributed Training |
| |
| Use `scripts/train.py` for both single GPU and multi-GPU runs: |
| |
| ```bash |
| # Single-GPU training |
| uv run python scripts/train.py configs/ltx2_av_lora.yaml |
| |
| # Multi-GPU (uses your accelerate config) |
| uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml |
| |
| # Override number of processes |
| uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml |
| ``` |
| |
| For detailed usage, see the [Training Guide](training-guide.md). |
| |
| ## ๐ก Tips for Using Utility Scripts |
| |
| - **Start with `--help`**: Always check available options for each script |
| - **Test on small datasets**: Verify workflows with a few files before processing large datasets |
| - **Use decode verification**: Always decode a few samples to verify preprocessing quality |
| - **Monitor VRAM usage**: Use `--use-8bit` or quantization flags when running into memory issues |
| - **Keep backups**: Make copies of important dataset files before running conversion scripts |
| |