ltx-2 / packages /ltx-trainer /docs /utility-scripts.md
linoy
inital commit
ebfc6b3

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Utility Scripts Reference

This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks.

🎬 Dataset Processing Scripts

Video Scene Splitting

The scripts/split_scenes.py script automatically splits long videos into shorter, coherent scenes.

# Basic scene splitting
uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s

Key features:

  • Automatic scene detection: Uses PySceneDetect for intelligent splitting
  • Multiple algorithms: Content-based, adaptive, threshold, and histogram detection
  • Filtering options: Remove scenes shorter than specified duration
  • Customizable parameters: Thresholds, window sizes, and detection modes

Common options:

# See all available options
uv run python scripts/split_scenes.py --help

# Use adaptive detection with custom threshold
uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0

# Limit to maximum number of scenes
uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50

Automatic Video Captioning

The scripts/caption_videos.py script generates captions for videos (with audio) using multimodal models.

# Generate captions for all videos in a directory (uses Qwen2.5-Omni by default)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json

# Use 8-bit quantization to reduce VRAM usage
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit

# Use Gemini Flash API instead (requires API key)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \
    --captioner-type gemini_flash --api-key YOUR_API_KEY

# Caption without audio processing (video-only)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio

# Force re-caption all files
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override

Key features:

  • Audio-visual captioning: Processes both video and audio content, including speech transcription
  • Multiple backends:
    • qwen_omni (default): Local Qwen2.5-Omni model - processes video + audio locally
    • gemini_flash: Google Gemini Flash API - cloud-based, requires API key
  • Structured output: Captions include visual description, speech transcription, sounds, and on-screen text
  • Memory optimization: 8-bit quantization option for limited VRAM
  • Incremental processing: Skips already-captioned files by default
  • Multiple output formats: JSON, JSONL, CSV, or TXT

Caption format:

The captioner produces structured captions with four sections:

  • [VISUAL]: Detailed description of visual content
  • [SPEECH]: Word-for-word transcription of spoken content
  • [SOUNDS]: Description of music, ambient sounds, sound effects
  • [TEXT]: Any on-screen text visible in the video

Environment variables (for Gemini Flash):

Set one of these to use Gemini Flash without passing --api-key:

  • GOOGLE_API_KEY
  • GEMINI_API_KEY

Dataset Preprocessing

The scripts/process_dataset.py script processes videos and caches latents for training.

# Basic preprocessing
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model

# With audio processing
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --with-audio

# With video decoding for verification
uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --decode

Multiple resolution buckets can be specified, separated by ;:

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49;512x512x81" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model

When training with multiple resolution buckets, set optimization.batch_size: 1.

For detailed usage, see the Dataset Preparation Guide.

Reference Video Generation

The scripts/compute_reference.py script provides a template for creating reference videos needed for IC-LoRA training. The default implementation generates Canny edge reference videos.

# Generate Canny edge reference videos
uv run python scripts/compute_reference.py videos_dir/ --output dataset.json

Key features:

  • Canny edge detection: Creates edge-based reference videos
  • In-place editing: Updates existing dataset JSON files
  • Customizable: Modify the compute_reference() function for different conditions (depth, pose, etc.)

You can edit this script to generate other types of reference videos for IC-LoRA training, such as depth maps, segmentation masks, or any custom video transformation.

πŸ” Debugging and Verification Scripts

Latents Decoding

The scripts/decode_latents.py script decodes precomputed video latents back into video files for visual inspection.

# Basic usage
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors

# With VAE tiling for large videos
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors \
    --vae-tiling

# Decode both video and audio latents
uv run python scripts/decode_latents.py /path/to/latents/dir \
    --output-dir /path/to/output \
    --model-path /path/to/ltx-2-model.safetensors \
    --with-audio

The script will:

  1. Load the VAE model from the specified path
  2. Process all .pt latent files in the input directory
  3. Decode each latent back into a video using the VAE
  4. Save resulting videos as MP4 files in the output directory

When to use:

  • Verify preprocessing quality: Check that your videos were encoded correctly
  • Debug training data: Visualize what the model actually sees during training
  • Quality assessment: Ensure latent encoding preserves important visual details

Inference Script

The scripts/inference.py script runs inference with a trained model.

For production inference, consider using the ltx-pipelines package which provides optimized, feature-rich pipelines for various use cases:

  • Text/Image-to-Video: TI2VidOneStagePipeline, TI2VidTwoStagesPipeline
  • Distilled (fast) inference: DistilledPipeline
  • IC-LoRA video-to-video: ICLoraPipeline
  • Keyframe interpolation: KeyframeInterpolationPipeline

All pipelines support loading custom LoRAs trained with this trainer.

# Text-to-video inference (with audio by default)
# By default, uses CFG scale 3.0 and STG scale 1.0 with block 29
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --output output.mp4

# Video-only (skip audio generation)
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --skip-audio \
    --output output.mp4

# Image-to-video with conditioning image
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat walking" \
    --condition-image first_frame.png \
    --output output.mp4

# Custom guidance settings
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --guidance-scale 3.0 \
    --stg-scale 1.0 \
    --stg-blocks 29 \
    --output output.mp4

# Disable STG (CFG only)
uv run python scripts/inference.py \
    --checkpoint /path/to/model.safetensors \
    --text-encoder-path /path/to/gemma \
    --prompt "A cat playing with a ball" \
    --stg-scale 0.0 \
    --output output.mp4

Guidance parameters:

Parameter Default Description
--guidance-scale 3.0 CFG (Classifier-Free Guidance) scale
--stg-scale 1.0 STG (Spatio-Temporal Guidance) scale. 0.0 disables STG
--stg-blocks 29 Transformer block(s) to perturb for STG
--stg-mode stg_av stg_av perturbs both audio and video, stg_v video only

πŸš€ Training Scripts

Basic and Distributed Training

Use scripts/train.py for both single GPU and multi-GPU runs:

# Single-GPU training
uv run python scripts/train.py configs/ltx2_av_lora.yaml

# Multi-GPU (uses your accelerate config)
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml

# Override number of processes
uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml

For detailed usage, see the Training Guide.

πŸ’‘ Tips for Using Utility Scripts

  • Start with --help: Always check available options for each script
  • Test on small datasets: Verify workflows with a few files before processing large datasets
  • Use decode verification: Always decode a few samples to verify preprocessing quality
  • Monitor VRAM usage: Use --use-8bit or quantization flags when running into memory issues
  • Keep backups: Make copies of important dataset files before running conversion scripts