Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
6.3.0
Configuration Reference
The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters. This guide covers all available configuration options and their usage.
π Overview
The main configuration class is LtxTrainerConfig, which includes the following sub-configurations:
- ModelConfig: Base model and training mode settings
- LoraConfig: LoRA training parameters
- TrainingStrategyConfig: Training strategy settings (text-to-video or video-to-video)
- OptimizationConfig: Learning rate, batch sizes, and scheduler settings
- AccelerationConfig: Mixed precision and quantization settings
- DataConfig: Data loading parameters
- ValidationConfig: Validation and inference settings
- CheckpointsConfig: Checkpoint saving frequency and retention settings
- HubConfig: Hugging Face Hub integration settings
- WandbConfig: Weights & Biases logging settings
- FlowMatchingConfig: Timestep sampling parameters
π Example Configuration Files
Check out our example configurations in the configs directory:
- π Audio-Video LoRA Training - Joint audio-video to generation training
- π IC-LoRA Training - Video-to-video transformation training
βοΈ Configuration Sections
ModelConfig
Controls the base model and training mode settings.
model:
model_path: "/path/to/ltx-2-model.safetensors" # Local path to model checkpoint
text_encoder_path: "/path/to/gemma-model" # Path to Gemma text encoder directory
training_mode: "lora" # "lora" or "full"
load_checkpoint: null # Path to checkpoint to resume from
Key parameters:
| Parameter | Description |
|---|---|
model_path |
Required. Local path to the LTX-2 model checkpoint (.safetensors file). URLs are not supported. |
text_encoder_path |
Required. Path to the Gemma text encoder model directory. Download from HuggingFace. |
training_mode |
Training approach - "lora" for LoRA training or "full" for full-rank fine-tuning. |
load_checkpoint |
Optional path to resume training from a checkpoint file or directory. |
LTX-2 requires both a model checkpoint and a Gemma text encoder. Both must be local paths.
LoraConfig
LoRA-specific fine-tuning parameters (only used when training_mode: "lora").
lora:
rank: 32 # LoRA rank (higher = more parameters)
alpha: 32 # LoRA alpha scaling factor
dropout: 0.0 # Dropout probability (0.0-1.0)
target_modules: # Modules to apply LoRA to
- "to_k"
- "to_q"
- "to_v"
- "to_out.0"
Key parameters:
| Parameter | Description |
|---|---|
rank |
LoRA rank - higher values mean more trainable parameters (typical range: 8-128) |
alpha |
Alpha scaling factor - typically set equal to rank |
dropout |
Dropout probability for regularization |
target_modules |
List of transformer modules to apply LoRA adapters to (see below) |
Understanding Target Modules
The LTX-2 transformer has separate attention and feed-forward blocks for video and audio, as well as cross-attention
modules that enable the two modalities to exchange information. Choosing the right target_modules is critical for
achieving good results, especially when training with audio.
Video-only modules:
| Module Pattern | Description |
|---|---|
attn1.to_k, attn1.to_q, attn1.to_v, attn1.to_out.0 |
Video self-attention |
attn2.to_k, attn2.to_q, attn2.to_v, attn2.to_out.0 |
Video cross-attention (to text) |
ff.net.0.proj, ff.net.2 |
Video feed-forward network |
Audio-only modules:
| Module Pattern | Description |
|---|---|
audio_attn1.to_k, audio_attn1.to_q, audio_attn1.to_v, audio_attn1.to_out.0 |
Audio self-attention |
audio_attn2.to_k, audio_attn2.to_q, audio_attn2.to_v, audio_attn2.to_out.0 |
Audio cross-attention (to text) |
audio_ff.net.0.proj, audio_ff.net.2 |
Audio feed-forward network |
Audio-video cross-attention modules:
These modules enable bidirectional information flow between the audio and video modalities:
| Module Pattern | Description |
|---|---|
audio_to_video_attn.to_k, audio_to_video_attn.to_q, audio_to_video_attn.to_v, audio_to_video_attn.to_out.0 |
Video attends to audio (Q from video, K/V from audio) |
video_to_audio_attn.to_k, video_to_audio_attn.to_q, video_to_audio_attn.to_v, video_to_audio_attn.to_out.0 |
Audio attends to video (Q from audio, K/V from video) |
Recommended configurations:
For video-only training, target the video attention layers:
target_modules:
- "attn1.to_k"
- "attn1.to_q"
- "attn1.to_v"
- "attn1.to_out.0"
- "attn2.to_k"
- "attn2.to_q"
- "attn2.to_v"
- "attn2.to_out.0"
For audio-video training, use patterns that match both branches:
target_modules:
- "to_k"
- "to_q"
- "to_v"
- "to_out.0"
Using shorter patterns like
"to_k"will match all attention modules includingattn1.to_k,audio_attn1.to_k,audio_to_video_attn.to_k, andvideo_to_audio_attn.to_k, effectively training video, audio, and cross-modal attention branches together.
You can also target the feed-forward (FFN) modules (
ff.net.0.proj,ff.net.2for video,audio_ff.net.0.proj,audio_ff.net.2for audio) to increase the LoRA's capacity and potentially help it capture the target distribution better.
TrainingStrategyConfig
Configures the training strategy. This replaces the legacy ConditioningConfig.
Text-to-Video Strategy
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.1 # Probability of first-frame conditioning
with_audio: false # Enable joint audio-video training
audio_latents_dir: "audio_latents" # Directory for audio latents (when with_audio: true)
Video-to-Video Strategy (IC-LoRA)
training_strategy:
name: "video_to_video"
first_frame_conditioning_p: 0.1
reference_latents_dir: "reference_latents" # Directory for reference video latents
Key parameters:
| Parameter | Description |
|---|---|
name |
Strategy type: "text_to_video" or "video_to_video" |
first_frame_conditioning_p |
Probability of using first frame as conditioning (0.0-1.0) |
with_audio |
(text_to_video only) Enable joint audio-video training |
audio_latents_dir |
(text_to_video only) Directory name for audio latents |
reference_latents_dir |
(video_to_video only) Directory name for reference video latents |
OptimizationConfig
Training optimization parameters including learning rates, batch sizes, and schedulers.
optimization:
learning_rate: 1e-4 # Learning rate
steps: 2000 # Total training steps
batch_size: 1 # Batch size per GPU
gradient_accumulation_steps: 1 # Steps to accumulate gradients
max_grad_norm: 1.0 # Gradient clipping threshold
optimizer_type: "adamw" # "adamw" or "adamw8bit"
scheduler_type: "linear" # Scheduler type
scheduler_params: {} # Additional scheduler parameters
enable_gradient_checkpointing: true # Memory optimization
Key parameters:
| Parameter | Description |
|---|---|
learning_rate |
Learning rate for optimization (typical range: 1e-5 to 1e-3) |
steps |
Total number of training steps |
batch_size |
Batch size per GPU (reduce if running out of memory) |
gradient_accumulation_steps |
Accumulate gradients over multiple steps |
scheduler_type |
LR scheduler: "constant", "linear", "cosine", "cosine_with_restarts", "polynomial" |
enable_gradient_checkpointing |
Trade training speed for GPU memory savings (recommended for large models) |
AccelerationConfig
Hardware acceleration and compute optimization settings.
acceleration:
mixed_precision_mode: "bf16" # "no", "fp16", or "bf16"
quantization: null # Quantization options
load_text_encoder_in_8bit: false # Load text encoder in 8-bit
Key parameters:
| Parameter | Description |
|---|---|
mixed_precision_mode |
Precision mode - "bf16" recommended for modern GPUs |
quantization |
Model quantization: null, "int8-quanto", "int4-quanto", "fp8-quanto", etc. |
load_text_encoder_in_8bit |
Load the Gemma text encoder in 8-bit to save GPU memory |
DataConfig
Data loading and processing configuration.
data:
preprocessed_data_root: "/path/to/preprocessed/data" # Path to precomputed dataset
num_dataloader_workers: 2 # Background data loading workers
Key parameters:
| Parameter | Description |
|---|---|
preprocessed_data_root |
Path to your preprocessed dataset (contains latents/, conditions/, etc.) |
num_dataloader_workers |
Number of parallel data loading processes (0 = synchronous loading, useful when debugging) |
ValidationConfig
Validation and inference settings for monitoring training progress.
validation:
prompts: # Validation prompts
- "A cat playing with a ball"
- "A dog running in a field"
negative_prompt: "worst quality, inconsistent motion, blurry, jittery, distorted"
images: null # Optional image paths for image-to-video
reference_videos: null # Reference video paths (IC-LoRA only)
video_dims: [576, 576, 89] # Video dimensions [width, height, frames]
frame_rate: 25.0 # Frame rate for generated videos
seed: 42 # Random seed for reproducibility
inference_steps: 30 # Number of inference steps
interval: 100 # Steps between validation runs
videos_per_prompt: 1 # Videos generated per prompt
guidance_scale: 3.0 # CFG guidance strength
stg_scale: 1.0 # STG guidance strength (0.0 to disable)
stg_blocks: [29] # Transformer blocks to perturb for STG
stg_mode: "stg_av" # "stg_av" or "stg_v" (video only)
generate_audio: true # Whether to generate audio
skip_initial_validation: false # Skip validation at step 0
include_reference_in_output: false # Include reference video side-by-side (IC-LoRA)
Key parameters:
| Parameter | Description |
|---|---|
prompts |
List of text prompts for validation video generation |
images |
List of image paths for image-to-video validation (must match number of prompts) |
reference_videos |
List of reference video paths for IC-LoRA validation (must match number of prompts) |
video_dims |
Output dimensions [width, height, frames]. Width/height must be divisible by 32, frames must satisfy frames % 8 == 1 |
interval |
Steps between validation runs (set to null to disable) |
guidance_scale |
CFG (Classifier-Free Guidance) scale. Recommended: 3.0 |
stg_scale |
STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended: 1.0 |
stg_blocks |
Transformer blocks to perturb for STG. Recommended: [29] (single block) |
stg_mode |
STG mode: "stg_av" perturbs both audio and video, "stg_v" perturbs video only |
generate_audio |
Whether to generate audio in validation samples |
include_reference_in_output |
For IC-LoRA: concatenate reference video side-by-side with output |
CheckpointsConfig
Model checkpointing configuration.
checkpoints:
interval: 250 # Steps between checkpoint saves (null = disabled)
keep_last_n: 3 # Number of recent checkpoints to retain
Key parameters:
| Parameter | Description |
|---|---|
interval |
Steps between intermediate checkpoint saves (set to null to disable) |
keep_last_n |
Number of most recent checkpoints to keep (-1 = keep all) |
HubConfig
Hugging Face Hub integration for automatic model uploads.
hub:
push_to_hub: false # Enable Hub uploading
hub_model_id: "username/model-name" # Hub repository ID
Key parameters:
| Parameter | Description |
|---|---|
push_to_hub |
Whether to automatically push trained models to Hugging Face Hub |
hub_model_id |
Repository ID in format "username/repository-name" |
WandbConfig
Weights & Biases logging configuration.
wandb:
enabled: false # Enable W&B logging
project: "ltx-2-trainer" # W&B project name
entity: null # W&B username or team
tags: [] # Tags for the run
log_validation_videos: true # Log validation videos to W&B
Key parameters:
| Parameter | Description |
|---|---|
enabled |
Whether to enable W&B logging |
project |
W&B project name |
entity |
W&B username or team (null uses default account) |
log_validation_videos |
Whether to log validation videos to W&B |
FlowMatchingConfig
Flow matching training configuration for timestep sampling.
flow_matching:
timestep_sampling_mode: "shifted_logit_normal" # Timestep sampling strategy
timestep_sampling_params: {} # Additional sampling parameters
Key parameters:
| Parameter | Description |
|---|---|
timestep_sampling_mode |
Sampling strategy: "uniform" or "shifted_logit_normal" |
timestep_sampling_params |
Additional parameters for the sampling strategy |
π Next Steps
Once you've configured your training parameters:
- Set up your dataset using Dataset Preparation
- Choose your training approach in Training Modes
- Start training with the Training Guide