Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
6.3.0
Utility Scripts Reference
This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks.
π¬ Dataset Processing Scripts
Video Scene Splitting
The scripts/split_scenes.py script automatically splits long videos into shorter, coherent scenes.
# Basic scene splitting
uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s
Key features:
- Automatic scene detection: Uses PySceneDetect for intelligent splitting
- Multiple algorithms: Content-based, adaptive, threshold, and histogram detection
- Filtering options: Remove scenes shorter than specified duration
- Customizable parameters: Thresholds, window sizes, and detection modes
Common options:
# See all available options
uv run python scripts/split_scenes.py --help
# Use adaptive detection with custom threshold
uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0
# Limit to maximum number of scenes
uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50
Automatic Video Captioning
The scripts/caption_videos.py script generates captions for videos (with audio) using multimodal models.
# Generate captions for all videos in a directory (uses Qwen2.5-Omni by default)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json
# Use 8-bit quantization to reduce VRAM usage
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit
# Use Gemini Flash API instead (requires API key)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \
--captioner-type gemini_flash --api-key YOUR_API_KEY
# Caption without audio processing (video-only)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio
# Force re-caption all files
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override
Key features:
- Audio-visual captioning: Processes both video and audio content, including speech transcription
- Multiple backends:
qwen_omni(default): Local Qwen2.5-Omni model - processes video + audio locallygemini_flash: Google Gemini Flash API - cloud-based, requires API key
- Structured output: Captions include visual description, speech transcription, sounds, and on-screen text
- Memory optimization: 8-bit quantization option for limited VRAM
- Incremental processing: Skips already-captioned files by default
- Multiple output formats: JSON, JSONL, CSV, or TXT
Caption format:
The captioner produces structured captions with four sections:
[VISUAL]: Detailed description of visual content[SPEECH]: Word-for-word transcription of spoken content[SOUNDS]: Description of music, ambient sounds, sound effects[TEXT]: Any on-screen text visible in the video
Environment variables (for Gemini Flash):
Set one of these to use Gemini Flash without passing --api-key:
GOOGLE_API_KEYGEMINI_API_KEY
Dataset Preprocessing
The scripts/process_dataset.py script processes videos and caches latents for training.
# Basic preprocessing
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
# With audio processing
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--with-audio
# With video decoding for verification
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--decode
Multiple resolution buckets can be specified, separated by ;:
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49;512x512x81" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
When training with multiple resolution buckets, set
optimization.batch_size: 1.
For detailed usage, see the Dataset Preparation Guide.
Reference Video Generation
The scripts/compute_reference.py script provides a template for creating reference videos needed for IC-LoRA training.
The default implementation generates Canny edge reference videos.
# Generate Canny edge reference videos
uv run python scripts/compute_reference.py videos_dir/ --output dataset.json
Key features:
- Canny edge detection: Creates edge-based reference videos
- In-place editing: Updates existing dataset JSON files
- Customizable: Modify the
compute_reference()function for different conditions (depth, pose, etc.)
You can edit this script to generate other types of reference videos for IC-LoRA training, such as depth maps, segmentation masks, or any custom video transformation.
π Debugging and Verification Scripts
Latents Decoding
The scripts/decode_latents.py script decodes precomputed video latents back into video files for visual inspection.
# Basic usage
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors
# With VAE tiling for large videos
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors \
--vae-tiling
# Decode both video and audio latents
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors \
--with-audio
The script will:
- Load the VAE model from the specified path
- Process all
.ptlatent files in the input directory - Decode each latent back into a video using the VAE
- Save resulting videos as MP4 files in the output directory
When to use:
- Verify preprocessing quality: Check that your videos were encoded correctly
- Debug training data: Visualize what the model actually sees during training
- Quality assessment: Ensure latent encoding preserves important visual details
Inference Script
The scripts/inference.py script runs inference with a trained model.
For production inference, consider using the
ltx-pipelinespackage which provides optimized, feature-rich pipelines for various use cases:
- Text/Image-to-Video:
TI2VidOneStagePipeline,TI2VidTwoStagesPipeline- Distilled (fast) inference:
DistilledPipeline- IC-LoRA video-to-video:
ICLoraPipeline- Keyframe interpolation:
KeyframeInterpolationPipelineAll pipelines support loading custom LoRAs trained with this trainer.
# Text-to-video inference (with audio by default)
# By default, uses CFG scale 3.0 and STG scale 1.0 with block 29
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--output output.mp4
# Video-only (skip audio generation)
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--skip-audio \
--output output.mp4
# Image-to-video with conditioning image
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat walking" \
--condition-image first_frame.png \
--output output.mp4
# Custom guidance settings
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--guidance-scale 3.0 \
--stg-scale 1.0 \
--stg-blocks 29 \
--output output.mp4
# Disable STG (CFG only)
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--stg-scale 0.0 \
--output output.mp4
Guidance parameters:
| Parameter | Default | Description |
|---|---|---|
--guidance-scale |
3.0 | CFG (Classifier-Free Guidance) scale |
--stg-scale |
1.0 | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG |
--stg-blocks |
29 | Transformer block(s) to perturb for STG |
--stg-mode |
stg_av | stg_av perturbs both audio and video, stg_v video only |
π Training Scripts
Basic and Distributed Training
Use scripts/train.py for both single GPU and multi-GPU runs:
# Single-GPU training
uv run python scripts/train.py configs/ltx2_av_lora.yaml
# Multi-GPU (uses your accelerate config)
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
# Override number of processes
uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml
For detailed usage, see the Training Guide.
π‘ Tips for Using Utility Scripts
- Start with
--help: Always check available options for each script - Test on small datasets: Verify workflows with a few files before processing large datasets
- Use decode verification: Always decode a few samples to verify preprocessing quality
- Monitor VRAM usage: Use
--use-8bitor quantization flags when running into memory issues - Keep backups: Make copies of important dataset files before running conversion scripts