Spaces:

Lightricks
/

ltx-2

Paused

App Files Files Community

ltx-2 / packages /ltx-trainer /docs /dataset-preparation.md

linoy

inital commit

ebfc6b3 4 months ago

preview code

raw

history blame contribute delete

11.9 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Dataset Preparation Guide

This guide covers the complete workflow for preparing and preprocessing your dataset for training.

📋 Overview

The general dataset preparation workflow is:

(Optional) Split long videos into scenes using split_scenes.py
(Optional) Generate captions for your videos using caption_videos.py
Preprocess your dataset using process_dataset.py to compute and cache video/audio latents and text embeddings
Run the trainer with your preprocessed dataset

🎬 Step 1: Split Scenes

If you're starting with raw, long-form videos (e.g., downloaded from YouTube), you should first split them into shorter, coherent scenes.

uv run python scripts/split_scenes.py input.mp4 scenes_output_dir/ \
    --filter-shorter-than 5s

This will create multiple video clips in scenes_output_dir. These clips will be the input for the captioning step, if you choose to use it.

The script supports many configuration options for scene detection (detector algorithms, thresholds, minimum scene lengths, etc.):

uv run python scripts/split_scenes.py --help

📝 Step 2: Caption Videos

If your dataset doesn't include captions, you can automatically generate them using multimodal models that understand both video and audio.

uv run python scripts/caption_videos.py scenes_output_dir/ \
    --output scenes_output_dir/dataset.json

If you're running into VRAM issues, try enabling 8-bit quantization to reduce memory usage:

uv run python scripts/caption_videos.py scenes_output_dir/ \
    --output scenes_output_dir/dataset.json \
    --use-8bit

This will create a dataset.json file containing video paths and their captions.

Captioning options:

Option	Description
`--captioner-type`	`qwen_omni` (default, local) or `gemini_flash` (API)
`--use-8bit`	Enable 8-bit quantization for lower VRAM usage
`--no-audio`	Disable audio processing (video-only captions)
`--override`	Re-caption files that already have captions
`--api-key`	API key for Gemini Flash (or set `GOOGLE_API_KEY` env var)

Caption format:

The captioner produces structured captions with sections for:

Visual content: People, objects, actions, settings, colors, movements
Speech transcription: Word-for-word transcription of spoken content
Sounds: Music, ambient sounds, sound effects
On-screen text: Any visible text overlays

The automatically generated captions may contain inaccuracies or hallucinated content. We recommend reviewing and correcting the generated captions in your dataset.json file before proceeding to preprocessing.

⚡ Step 3: Dataset Preprocessing

This step preprocesses your video dataset by:

Resizing and cropping videos to fit specified resolution buckets
Computing and caching video latent representations
Computing and caching text embeddings for captions
(Optional) Computing and caching audio latents

Basic Usage

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model

With Audio Processing

For audio-video training, add the --with-audio flag:

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --with-audio

📊 Dataset Format

The trainer supports either videos or single images. Note that your dataset must be homogeneous - either all videos or all images, mixing is not supported.

Image Datasets: When using images, follow the same preprocessing steps and format requirements as with videos, but use 1 for the frame count in the resolution bucket (e.g., 960x544x1).

The dataset must be a CSV, JSON, or JSONL metadata file with columns for captions and video paths:

JSON format example:

[
  {
    "caption": "A cat playing with a ball of yarn",
    "media_path": "videos/cat_playing.mp4"
  },
  {
    "caption": "A dog running in the park",
    "media_path": "videos/dog_running.mp4"
  }
]

JSONL format example:

{"caption": "A cat playing with a ball of yarn", "media_path": "videos/cat_playing.mp4"}
{"caption": "A dog running in the park", "media_path": "videos/dog_running.mp4"}

CSV format example:

caption,media_path
"A cat playing with a ball of yarn","videos/cat_playing.mp4"
"A dog running in the park","videos/dog_running.mp4"

📐 Resolution Buckets

Videos are organized into "buckets" of specific dimensions (width × height × frames). Each video is assigned to the nearest matching bucket. You can preprocess with one or multiple resolution buckets. When training with multiple resolution buckets, you must use a batch size of 1.

The dimensions of each bucket must follow these constraints due to LTX-2's VAE architecture:

Spatial dimensions (width and height) must be multiples of 32
Number of frames must satisfy frames % 8 == 1 (e.g., 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121, etc.)

Guidelines for choosing training resolution:

For high-quality, detailed videos: use larger spatial dimensions (e.g. 768x448) with fewer frames (e.g. 89)
For longer, motion-focused videos: use smaller spatial dimensions (512×512) with more frames (121)
Memory usage increases with both spatial and temporal dimensions

Example usage:

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model

Multiple buckets are supported by separating entries with ;:

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49;512x512x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model

Video processing workflow:

Videos are resized maintaining aspect ratio until either width or height matches the target
The larger dimension is center cropped to match the bucket's dimensions
Only the first X frames are taken to match the bucket's frame count, remaining frames are ignored

The sequence length processed by the transformer model can be calculated as:
sequence_length = (H/32) * (W/32) * ((F-1)/8 + 1)
Where:

H = Height of video

W = Width of video

F = Number of frames

32 = VAE's spatial downsampling factor

8 = VAE's temporal downsampling factor

For example, a 768×448×89 video would have sequence length:
(768/32) * (448/32) * ((89-1)/8 + 1) = 24 * 14 * 12 = 4,032
Keep this in mind when choosing video dimensions, as longer sequences require more GPU memory.

When training with multiple resolution buckets, you must use a batch size of 1 (i.e., set optimization.batch_size: 1 in your training config).

📁 Output Structure

The preprocessed data is saved in a .precomputed directory:

dataset/
└── .precomputed/
    ├── latents/            # Cached video latents
    ├── conditions/         # Cached text embeddings
    ├── audio_latents/      # (only if --with-audio) Cached audio latents
    └── reference_latents/  # (only for IC-LoRA) Cached reference video latents

🪄 IC-LoRA Reference Video Preprocessing

For IC-LoRA training, you need to preprocess datasets that include reference videos. Reference videos provide the conditioning input while target videos represent the desired transformed output.

Dataset Format with Reference Videos

JSON format:

[
  {
    "caption": "A cat playing with a ball of yarn",
    "media_path": "videos/cat_playing.mp4",
    "reference_path": "references/cat_playing_depth.mp4"
  }
]

JSONL format:

{"caption": "A cat playing with a ball of yarn", "media_path": "videos/cat_playing.mp4", "reference_path": "references/cat_playing_depth.mp4"}
{"caption": "A dog running in the park", "media_path": "videos/dog_running.mp4", "reference_path": "references/dog_running_depth.mp4"}

Preprocessing with Reference Videos

To preprocess a dataset with reference videos, add the --reference-column argument specifying the name of the field in your dataset JSON/JSONL/CSV that contains the reference video paths:

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --reference-column "reference_path"

This will create an additional reference_latents/ directory containing the preprocessed reference video latents.

Generating Reference Videos

Dataset Requirements for IC-LoRA:

Your dataset must contain paired videos where each target video has a corresponding reference video
Reference and target videos must have identical resolution and length
Both reference and target videos should be preprocessed together using the same resolution buckets

We provide an example script, scripts/compute_reference.py, to generate reference videos for a given dataset. The default implementation generates Canny edge reference videos.

uv run python scripts/compute_reference.py scenes_output_dir/ \
    --output scenes_output_dir/dataset.json

The script accepts a JSON file as the dataset configuration and updates it in-place by adding the filenames of the generated reference videos.

If you want to generate a different type of condition (depth maps, pose skeletons, etc.), modify or replace the compute_reference() function within this script.

Example Dataset

For reference, see our Canny Control Dataset which demonstrates proper IC-LoRA dataset structure with paired videos and Canny edge maps.

🎯 LoRA Trigger Words

When training a LoRA, you can specify a trigger token that will be prepended to all captions:

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --lora-trigger "MYTRIGGER"

This acts as a trigger word that activates the LoRA during inference when you include the same token in your prompts.

There is no need to manually insert the trigger word into your dataset JSON/JSONL/CSV file. The trigger word specified with --lora-trigger is automatically prepended to each caption during preprocessing.

🔍 Decoding Videos for Verification

If you add the --decode flag, the script will VAE-decode the precomputed latents and save the resulting videos in .precomputed/decoded_videos. When audio preprocessing is enabled (--with-audio), audio latents will also be decoded and saved to .precomputed/decoded_audio. This allows you to visually and audibly inspect the processed data.

uv run python scripts/process_dataset.py dataset.json \
    --resolution-buckets "960x544x49" \
    --model-path /path/to/ltx-2-model.safetensors \
    --text-encoder-path /path/to/gemma-model \
    --decode

For single-frame images, the decoded latents will be saved as PNG files rather than MP4 videos.

🚀 Next Steps

Once your dataset is preprocessed, you can proceed to:

Configure your training parameters in Configuration Reference
Choose your training approach in Training Modes
Start training with the Training Guide