Spaces:
Paused
A newer version of the Gradio SDK is available: 6.13.0
Dataset Preparation Guide
This guide covers the complete workflow for preparing and preprocessing your dataset for training.
π Overview
The general dataset preparation workflow is:
- (Optional) Split long videos into scenes using
split_scenes.py - (Optional) Generate captions for your videos using
caption_videos.py - Preprocess your dataset using
process_dataset.pyto compute and cache video/audio latents and text embeddings - Run the trainer with your preprocessed dataset
π¬ Step 1: Split Scenes
If you're starting with raw, long-form videos (e.g., downloaded from YouTube), you should first split them into shorter, coherent scenes.
uv run python scripts/split_scenes.py input.mp4 scenes_output_dir/ \
--filter-shorter-than 5s
This will create multiple video clips in scenes_output_dir.
These clips will be the input for the captioning step, if you choose to use it.
The script supports many configuration options for scene detection (detector algorithms, thresholds, minimum scene lengths, etc.):
uv run python scripts/split_scenes.py --help
π Step 2: Caption Videos
If your dataset doesn't include captions, you can automatically generate them using multimodal models that understand both video and audio.
uv run python scripts/caption_videos.py scenes_output_dir/ \
--output scenes_output_dir/dataset.json
If you're running into VRAM issues, try enabling 8-bit quantization to reduce memory usage:
uv run python scripts/caption_videos.py scenes_output_dir/ \
--output scenes_output_dir/dataset.json \
--use-8bit
This will create a dataset.json file containing video paths and their captions.
Captioning options:
| Option | Description |
|---|---|
--captioner-type |
qwen_omni (default, local) or gemini_flash (API) |
--use-8bit |
Enable 8-bit quantization for lower VRAM usage |
--no-audio |
Disable audio processing (video-only captions) |
--override |
Re-caption files that already have captions |
--api-key |
API key for Gemini Flash (or set GOOGLE_API_KEY env var) |
Caption format:
The captioner produces structured captions with sections for:
- Visual content: People, objects, actions, settings, colors, movements
- Speech transcription: Word-for-word transcription of spoken content
- Sounds: Music, ambient sounds, sound effects
- On-screen text: Any visible text overlays
The automatically generated captions may contain inaccuracies or hallucinated content. We recommend reviewing and correcting the generated captions in your
dataset.jsonfile before proceeding to preprocessing.
β‘ Step 3: Dataset Preprocessing
This step preprocesses your video dataset by:
- Resizing and cropping videos to fit specified resolution buckets
- Computing and caching video latent representations
- Computing and caching text embeddings for captions
- (Optional) Computing and caching audio latents
Basic Usage
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
With Audio Processing
For audio-video training, add the --with-audio flag:
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--with-audio
π Dataset Format
The trainer supports either videos or single images. Note that your dataset must be homogeneous - either all videos or all images, mixing is not supported.
Image Datasets: When using images, follow the same preprocessing steps and format requirements as with videos, but use
1for the frame count in the resolution bucket (e.g.,960x544x1).
The dataset must be a CSV, JSON, or JSONL metadata file with columns for captions and video paths:
JSON format example:
[
{
"caption": "A cat playing with a ball of yarn",
"media_path": "videos/cat_playing.mp4"
},
{
"caption": "A dog running in the park",
"media_path": "videos/dog_running.mp4"
}
]
JSONL format example:
{"caption": "A cat playing with a ball of yarn", "media_path": "videos/cat_playing.mp4"}
{"caption": "A dog running in the park", "media_path": "videos/dog_running.mp4"}
CSV format example:
caption,media_path
"A cat playing with a ball of yarn","videos/cat_playing.mp4"
"A dog running in the park","videos/dog_running.mp4"
π Resolution Buckets
Videos are organized into "buckets" of specific dimensions (width Γ height Γ frames). Each video is assigned to the nearest matching bucket. You can preprocess with one or multiple resolution buckets. When training with multiple resolution buckets, you must use a batch size of 1.
The dimensions of each bucket must follow these constraints due to LTX-2's VAE architecture:
- Spatial dimensions (width and height) must be multiples of 32
- Number of frames must satisfy
frames % 8 == 1(e.g., 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121, etc.)
Guidelines for choosing training resolution:
- For high-quality, detailed videos: use larger spatial dimensions (e.g. 768x448) with fewer frames (e.g. 89)
- For longer, motion-focused videos: use smaller spatial dimensions (512Γ512) with more frames (121)
- Memory usage increases with both spatial and temporal dimensions
Example usage:
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
Multiple buckets are supported by separating entries with ;:
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49;512x512x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model
Video processing workflow:
- Videos are resized maintaining aspect ratio until either width or height matches the target
- The larger dimension is center cropped to match the bucket's dimensions
- Only the first X frames are taken to match the bucket's frame count, remaining frames are ignored
The sequence length processed by the transformer model can be calculated as:
sequence_length = (H/32) * (W/32) * ((F-1)/8 + 1)Where:
- H = Height of video
- W = Width of video
- F = Number of frames
- 32 = VAE's spatial downsampling factor
- 8 = VAE's temporal downsampling factor
For example, a 768Γ448Γ89 video would have sequence length:
(768/32) * (448/32) * ((89-1)/8 + 1) = 24 * 14 * 12 = 4,032Keep this in mind when choosing video dimensions, as longer sequences require more GPU memory.
When training with multiple resolution buckets, you must use a batch size of 1 (i.e., set
optimization.batch_size: 1in your training config).
π Output Structure
The preprocessed data is saved in a .precomputed directory:
dataset/
βββ .precomputed/
βββ latents/ # Cached video latents
βββ conditions/ # Cached text embeddings
βββ audio_latents/ # (only if --with-audio) Cached audio latents
βββ reference_latents/ # (only for IC-LoRA) Cached reference video latents
πͺ IC-LoRA Reference Video Preprocessing
For IC-LoRA training, you need to preprocess datasets that include reference videos. Reference videos provide the conditioning input while target videos represent the desired transformed output.
Dataset Format with Reference Videos
JSON format:
[
{
"caption": "A cat playing with a ball of yarn",
"media_path": "videos/cat_playing.mp4",
"reference_path": "references/cat_playing_depth.mp4"
}
]
JSONL format:
{"caption": "A cat playing with a ball of yarn", "media_path": "videos/cat_playing.mp4", "reference_path": "references/cat_playing_depth.mp4"}
{"caption": "A dog running in the park", "media_path": "videos/dog_running.mp4", "reference_path": "references/dog_running_depth.mp4"}
Preprocessing with Reference Videos
To preprocess a dataset with reference videos, add the --reference-column argument specifying the name of the field
in your dataset JSON/JSONL/CSV that contains the reference video paths:
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--reference-column "reference_path"
This will create an additional reference_latents/ directory containing the preprocessed reference video latents.
Generating Reference Videos
Dataset Requirements for IC-LoRA:
- Your dataset must contain paired videos where each target video has a corresponding reference video
- Reference and target videos must have identical resolution and length
- Both reference and target videos should be preprocessed together using the same resolution buckets
We provide an example script, scripts/compute_reference.py, to generate reference
videos for a given dataset. The default implementation generates Canny edge reference videos.
uv run python scripts/compute_reference.py scenes_output_dir/ \
--output scenes_output_dir/dataset.json
The script accepts a JSON file as the dataset configuration and updates it in-place by adding the filenames of the generated reference videos.
If you want to generate a different type of condition (depth maps, pose skeletons, etc.), modify or replace the compute_reference() function within this script.
Example Dataset
For reference, see our Canny Control Dataset which demonstrates proper IC-LoRA dataset structure with paired videos and Canny edge maps.
π― LoRA Trigger Words
When training a LoRA, you can specify a trigger token that will be prepended to all captions:
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--lora-trigger "MYTRIGGER"
This acts as a trigger word that activates the LoRA during inference when you include the same token in your prompts.
There is no need to manually insert the trigger word into your dataset JSON/JSONL/CSV file. The trigger word specified with
--lora-triggeris automatically prepended to each caption during preprocessing.
π Decoding Videos for Verification
If you add the --decode flag, the script will VAE-decode the precomputed latents and save the resulting videos
in .precomputed/decoded_videos. When audio preprocessing is enabled (--with-audio), audio latents will also be
decoded and saved to .precomputed/decoded_audio. This allows you to visually and audibly inspect the processed data.
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x49" \
--model-path /path/to/ltx-2-model.safetensors \
--text-encoder-path /path/to/gemma-model \
--decode
For single-frame images, the decoded latents will be saved as PNG files rather than MP4 videos.
π Next Steps
Once your dataset is preprocessed, you can proceed to:
- Configure your training parameters in Configuration Reference
- Choose your training approach in Training Modes
- Start training with the Training Guide