# Utility Scripts Reference This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks. ## 🎬 Dataset Processing Scripts ### Video Scene Splitting The `scripts/split_scenes.py` script automatically splits long videos into shorter, coherent scenes. ```bash # Basic scene splitting uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s ``` **Key features:** - **Automatic scene detection**: Uses PySceneDetect for intelligent splitting - **Multiple algorithms**: Content-based, adaptive, threshold, and histogram detection - **Filtering options**: Remove scenes shorter than specified duration - **Customizable parameters**: Thresholds, window sizes, and detection modes **Common options:** ```bash # See all available options uv run python scripts/split_scenes.py --help # Use adaptive detection with custom threshold uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0 # Limit to maximum number of scenes uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50 ``` ### Automatic Video Captioning The `scripts/caption_videos.py` script generates captions for videos (with audio) using multimodal models. ```bash # Generate captions for all videos in a directory (uses Qwen2.5-Omni by default) uv run python scripts/caption_videos.py videos_dir/ --output dataset.json # Use 8-bit quantization to reduce VRAM usage uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit # Use Gemini Flash API instead (requires API key) uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \ --captioner-type gemini_flash --api-key YOUR_API_KEY # Caption without audio processing (video-only) uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio # Force re-caption all files uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override ``` **Key features:** - **Audio-visual captioning**: Processes both video and audio content, including speech transcription - **Multiple backends**: - `qwen_omni` (default): Local Qwen2.5-Omni model - processes video + audio locally - `gemini_flash`: Google Gemini Flash API - cloud-based, requires API key - **Structured output**: Captions include visual description, speech transcription, sounds, and on-screen text - **Memory optimization**: 8-bit quantization option for limited VRAM - **Incremental processing**: Skips already-captioned files by default - **Multiple output formats**: JSON, JSONL, CSV, or TXT **Caption format:** The captioner produces structured captions with four sections: - `[VISUAL]`: Detailed description of visual content - `[SPEECH]`: Word-for-word transcription of spoken content - `[SOUNDS]`: Description of music, ambient sounds, sound effects - `[TEXT]`: Any on-screen text visible in the video **Environment variables (for Gemini Flash):** Set one of these to use Gemini Flash without passing `--api-key`: - `GOOGLE_API_KEY` - `GEMINI_API_KEY` ### Dataset Preprocessing The `scripts/process_dataset.py` script processes videos and caches latents for training. ```bash # Basic preprocessing uv run python scripts/process_dataset.py dataset.json \ --resolution-buckets "960x544x49" \ --model-path /path/to/ltx-2-model.safetensors \ --text-encoder-path /path/to/gemma-model # With audio processing uv run python scripts/process_dataset.py dataset.json \ --resolution-buckets "960x544x49" \ --model-path /path/to/ltx-2-model.safetensors \ --text-encoder-path /path/to/gemma-model \ --with-audio # With video decoding for verification uv run python scripts/process_dataset.py dataset.json \ --resolution-buckets "960x544x49" \ --model-path /path/to/ltx-2-model.safetensors \ --text-encoder-path /path/to/gemma-model \ --decode ``` Multiple resolution buckets can be specified, separated by `;`: ```bash uv run python scripts/process_dataset.py dataset.json \ --resolution-buckets "960x544x49;512x512x81" \ --model-path /path/to/ltx-2-model.safetensors \ --text-encoder-path /path/to/gemma-model ``` > [!NOTE] > When training with multiple resolution buckets, set `optimization.batch_size: 1`. For detailed usage, see the [Dataset Preparation Guide](dataset-preparation.md). ### Reference Video Generation The `scripts/compute_reference.py` script provides a template for creating reference videos needed for IC-LoRA training. The default implementation generates Canny edge reference videos. ```bash # Generate Canny edge reference videos uv run python scripts/compute_reference.py videos_dir/ --output dataset.json ``` **Key features:** - **Canny edge detection**: Creates edge-based reference videos - **In-place editing**: Updates existing dataset JSON files - **Customizable**: Modify the `compute_reference()` function for different conditions (depth, pose, etc.) > [!TIP] > You can edit this script to generate other types of reference videos for IC-LoRA training, > such as depth maps, segmentation masks, or any custom video transformation. ## 🔍 Debugging and Verification Scripts ### Latents Decoding The `scripts/decode_latents.py` script decodes precomputed video latents back into video files for visual inspection. ```bash # Basic usage uv run python scripts/decode_latents.py /path/to/latents/dir \ --output-dir /path/to/output \ --model-path /path/to/ltx-2-model.safetensors # With VAE tiling for large videos uv run python scripts/decode_latents.py /path/to/latents/dir \ --output-dir /path/to/output \ --model-path /path/to/ltx-2-model.safetensors \ --vae-tiling # Decode both video and audio latents uv run python scripts/decode_latents.py /path/to/latents/dir \ --output-dir /path/to/output \ --model-path /path/to/ltx-2-model.safetensors \ --with-audio ``` **The script will:** 1. **Load the VAE model** from the specified path 2. **Process all `.pt` latent files** in the input directory 3. **Decode each latent** back into a video using the VAE 4. **Save resulting videos** as MP4 files in the output directory **When to use:** - **Verify preprocessing quality**: Check that your videos were encoded correctly - **Debug training data**: Visualize what the model actually sees during training - **Quality assessment**: Ensure latent encoding preserves important visual details ### Inference Script The `scripts/inference.py` script runs inference with a trained model. > [!TIP] > For production inference, consider using the [`ltx-pipelines`](../../ltx-pipelines/) package which provides optimized, > feature-rich pipelines for various use cases: > - **Text/Image-to-Video**: `TI2VidOneStagePipeline`, `TI2VidTwoStagesPipeline` > - **Distilled (fast) inference**: `DistilledPipeline` > - **IC-LoRA video-to-video**: `ICLoraPipeline` > - **Keyframe interpolation**: `KeyframeInterpolationPipeline` > > All pipelines support loading custom LoRAs trained with this trainer. ```bash # Text-to-video inference (with audio by default) # By default, uses CFG scale 3.0 and STG scale 1.0 with block 29 uv run python scripts/inference.py \ --checkpoint /path/to/model.safetensors \ --text-encoder-path /path/to/gemma \ --prompt "A cat playing with a ball" \ --output output.mp4 # Video-only (skip audio generation) uv run python scripts/inference.py \ --checkpoint /path/to/model.safetensors \ --text-encoder-path /path/to/gemma \ --prompt "A cat playing with a ball" \ --skip-audio \ --output output.mp4 # Image-to-video with conditioning image uv run python scripts/inference.py \ --checkpoint /path/to/model.safetensors \ --text-encoder-path /path/to/gemma \ --prompt "A cat walking" \ --condition-image first_frame.png \ --output output.mp4 # Custom guidance settings uv run python scripts/inference.py \ --checkpoint /path/to/model.safetensors \ --text-encoder-path /path/to/gemma \ --prompt "A cat playing with a ball" \ --guidance-scale 3.0 \ --stg-scale 1.0 \ --stg-blocks 29 \ --output output.mp4 # Disable STG (CFG only) uv run python scripts/inference.py \ --checkpoint /path/to/model.safetensors \ --text-encoder-path /path/to/gemma \ --prompt "A cat playing with a ball" \ --stg-scale 0.0 \ --output output.mp4 ``` **Guidance parameters:** | Parameter | Default | Description | |-----------|---------|-------------| | `--guidance-scale` | 3.0 | CFG (Classifier-Free Guidance) scale | | `--stg-scale` | 1.0 | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG | | `--stg-blocks` | 29 | Transformer block(s) to perturb for STG | | `--stg-mode` | stg_av | `stg_av` perturbs both audio and video, `stg_v` video only | ## 🚀 Training Scripts ### Basic and Distributed Training Use `scripts/train.py` for both single GPU and multi-GPU runs: ```bash # Single-GPU training uv run python scripts/train.py configs/ltx2_av_lora.yaml # Multi-GPU (uses your accelerate config) uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml # Override number of processes uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml ``` For detailed usage, see the [Training Guide](training-guide.md). ## 💡 Tips for Using Utility Scripts - **Start with `--help`**: Always check available options for each script - **Test on small datasets**: Verify workflows with a few files before processing large datasets - **Use decode verification**: Always decode a few samples to verify preprocessing quality - **Monitor VRAM usage**: Use `--use-8bit` or quantization flags when running into memory issues - **Keep backups**: Make copies of important dataset files before running conversion scripts