Spaces:
Running
on
Zero
Running
on
Zero
File size: 9,803 Bytes
ebfc6b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
# Utility Scripts Reference
This guide covers the various utility scripts available for preprocessing, conversion, and debugging tasks.
## 🎬 Dataset Processing Scripts
### Video Scene Splitting
The `scripts/split_scenes.py` script automatically splits long videos into shorter, coherent scenes.
```bash
# Basic scene splitting
uv run python scripts/split_scenes.py input.mp4 output_dir/ --filter-shorter-than 5s
```
**Key features:**
- **Automatic scene detection**: Uses PySceneDetect for intelligent splitting
- **Multiple algorithms**: Content-based, adaptive, threshold, and histogram detection
- **Filtering options**: Remove scenes shorter than specified duration
- **Customizable parameters**: Thresholds, window sizes, and detection modes
**Common options:**
```bash
# See all available options
uv run python scripts/split_scenes.py --help
# Use adaptive detection with custom threshold
uv run python scripts/split_scenes.py video.mp4 scenes/ --detector adaptive --threshold 30.0
# Limit to maximum number of scenes
uv run python scripts/split_scenes.py video.mp4 scenes/ --max-scenes 50
```
### Automatic Video Captioning
The `scripts/caption_videos.py` script generates captions for videos (with audio) using multimodal models.
```bash
# Generate captions for all videos in a directory (uses Qwen2.5-Omni by default)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json
# Use 8-bit quantization to reduce VRAM usage
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --use-8bit
# Use Gemini Flash API instead (requires API key)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json \
--captioner-type gemini_flash --api-key YOUR_API_KEY
# Caption without audio processing (video-only)
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --no-audio
# Force re-caption all files
uv run python scripts/caption_videos.py videos_dir/ --output dataset.json --override
```
**Key features:**
- **Audio-visual captioning**: Processes both video and audio content, including speech transcription
- **Multiple backends**:
- `qwen_omni` (default): Local Qwen2.5-Omni model - processes video + audio locally
- `gemini_flash`: Google Gemini Flash API - cloud-based, requires API key
- **Structured output**: Captions include visual description, speech transcription, sounds, and on-screen text
- **Memory optimization**: 8-bit quantization option for limited VRAM
- **Incremental processing**: Skips already-captioned files by default
- **Multiple output formats**: JSON, JSONL, CSV, or TXT
**Caption format:**
The captioner produces structured captions with four sections:
- `[VISUAL]`: Detailed description of visual content
- `[SPEECH]`: Word-for-word transcription of spoken content
- `[SOUNDS]`: Description of music, ambient sounds, sound effects
- `[TEXT]`: Any on-screen text visible in the video
**Environment variables (for Gemini Flash):**
Set one of these to use Gemini Flash without passing `--api-key`:
- `GOOGLE_API_KEY`
- `GEMINI_API_KEY`
### Dataset Preprocessing
The `scripts/process_dataset.py` script processes videos and caches latents for training.
```bash
# Basic preprocessing
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
# With audio processing
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--with-audio
# With video decoding for verification
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--decode
```
Multiple resolution buckets can be specified, separated by `;`:
```bash
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49;512x512x81" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
```
> [!NOTE]
> When training with multiple resolution buckets, set `optimization.batch_size: 1`.
For detailed usage, see the [Dataset Preparation Guide](dataset-preparation.md).
### Reference Video Generation
The `scripts/compute_reference.py` script provides a template for creating reference videos needed for IC-LoRA training.
The default implementation generates Canny edge reference videos.
```bash
# Generate Canny edge reference videos
uv run python scripts/compute_reference.py videos_dir/ --output dataset.json
```
**Key features:**
- **Canny edge detection**: Creates edge-based reference videos
- **In-place editing**: Updates existing dataset JSON files
- **Customizable**: Modify the `compute_reference()` function for different conditions (depth, pose, etc.)
> [!TIP]
> You can edit this script to generate other types of reference videos for IC-LoRA training,
> such as depth maps, segmentation masks, or any custom video transformation.
## 🔍 Debugging and Verification Scripts
### Latents Decoding
The `scripts/decode_latents.py` script decodes precomputed video latents back into video files for visual inspection.
```bash
# Basic usage
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors
# With VAE tiling for large videos
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors \
--vae-tiling
# Decode both video and audio latents
uv run python scripts/decode_latents.py /path/to/latents/dir \
--output-dir /path/to/output \
--model-path /path/to/ltx-2-model.safetensors \
--with-audio
```
**The script will:**
1. **Load the VAE model** from the specified path
2. **Process all `.pt` latent files** in the input directory
3. **Decode each latent** back into a video using the VAE
4. **Save resulting videos** as MP4 files in the output directory
**When to use:**
- **Verify preprocessing quality**: Check that your videos were encoded correctly
- **Debug training data**: Visualize what the model actually sees during training
- **Quality assessment**: Ensure latent encoding preserves important visual details
### Inference Script
The `scripts/inference.py` script runs inference with a trained model.
> [!TIP]
> For production inference, consider using the [`ltx-pipelines`](../../ltx-pipelines/) package which provides optimized,
> feature-rich pipelines for various use cases:
> - **Text/Image-to-Video**: `TI2VidOneStagePipeline`, `TI2VidTwoStagesPipeline`
> - **Distilled (fast) inference**: `DistilledPipeline`
> - **IC-LoRA video-to-video**: `ICLoraPipeline`
> - **Keyframe interpolation**: `KeyframeInterpolationPipeline`
>
> All pipelines support loading custom LoRAs trained with this trainer.
```bash
# Text-to-video inference (with audio by default)
# By default, uses CFG scale 3.0 and STG scale 1.0 with block 29
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--output output.mp4
# Video-only (skip audio generation)
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--skip-audio \
--output output.mp4
# Image-to-video with conditioning image
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat walking" \
--condition-image first_frame.png \
--output output.mp4
# Custom guidance settings
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--guidance-scale 3.0 \
--stg-scale 1.0 \
--stg-blocks 29 \
--output output.mp4
# Disable STG (CFG only)
uv run python scripts/inference.py \
--checkpoint /path/to/model.safetensors \
--text-encoder-path /path/to/gemma \
--prompt "A cat playing with a ball" \
--stg-scale 0.0 \
--output output.mp4
```
**Guidance parameters:**
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--guidance-scale` | 3.0 | CFG (Classifier-Free Guidance) scale |
| `--stg-scale` | 1.0 | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG |
| `--stg-blocks` | 29 | Transformer block(s) to perturb for STG |
| `--stg-mode` | stg_av | `stg_av` perturbs both audio and video, `stg_v` video only |
## 🚀 Training Scripts
### Basic and Distributed Training
Use `scripts/train.py` for both single GPU and multi-GPU runs:
```bash
# Single-GPU training
uv run python scripts/train.py configs/ltx2_av_lora.yaml
# Multi-GPU (uses your accelerate config)
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
# Override number of processes
uv run accelerate launch --num_processes 4 scripts/train.py configs/ltx2_av_lora.yaml
```
For detailed usage, see the [Training Guide](training-guide.md).
## 💡 Tips for Using Utility Scripts
- **Start with `--help`**: Always check available options for each script
- **Test on small datasets**: Verify workflows with a few files before processing large datasets
- **Use decode verification**: Always decode a few samples to verify preprocessing quality
- **Monitor VRAM usage**: Use `--use-8bit` or quantization flags when running into memory issues
- **Keep backups**: Make copies of important dataset files before running conversion scripts
|