ltx-2 / packages /ltx-trainer /docs /configuration-reference.md
linoy
inital commit
ebfc6b3

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Configuration Reference

The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters. This guide covers all available configuration options and their usage.

πŸ“‹ Overview

The main configuration class is LtxTrainerConfig, which includes the following sub-configurations:

  • ModelConfig: Base model and training mode settings
  • LoraConfig: LoRA training parameters
  • TrainingStrategyConfig: Training strategy settings (text-to-video or video-to-video)
  • OptimizationConfig: Learning rate, batch sizes, and scheduler settings
  • AccelerationConfig: Mixed precision and quantization settings
  • DataConfig: Data loading parameters
  • ValidationConfig: Validation and inference settings
  • CheckpointsConfig: Checkpoint saving frequency and retention settings
  • HubConfig: Hugging Face Hub integration settings
  • WandbConfig: Weights & Biases logging settings
  • FlowMatchingConfig: Timestep sampling parameters

πŸ“„ Example Configuration Files

Check out our example configurations in the configs directory:

βš™οΈ Configuration Sections

ModelConfig

Controls the base model and training mode settings.

model:
  model_path: "/path/to/ltx-2-model.safetensors"  # Local path to model checkpoint
  text_encoder_path: "/path/to/gemma-model"       # Path to Gemma text encoder directory
  training_mode: "lora"                           # "lora" or "full"
  load_checkpoint: null                           # Path to checkpoint to resume from

Key parameters:

Parameter Description
model_path Required. Local path to the LTX-2 model checkpoint (.safetensors file). URLs are not supported.
text_encoder_path Required. Path to the Gemma text encoder model directory. Download from HuggingFace.
training_mode Training approach - "lora" for LoRA training or "full" for full-rank fine-tuning.
load_checkpoint Optional path to resume training from a checkpoint file or directory.

LTX-2 requires both a model checkpoint and a Gemma text encoder. Both must be local paths.

LoraConfig

LoRA-specific fine-tuning parameters (only used when training_mode: "lora").

lora:
  rank: 32                        # LoRA rank (higher = more parameters)
  alpha: 32                       # LoRA alpha scaling factor
  dropout: 0.0                    # Dropout probability (0.0-1.0)
  target_modules:                 # Modules to apply LoRA to
    - "to_k"
    - "to_q"
    - "to_v"
    - "to_out.0"

Key parameters:

Parameter Description
rank LoRA rank - higher values mean more trainable parameters (typical range: 8-128)
alpha Alpha scaling factor - typically set equal to rank
dropout Dropout probability for regularization
target_modules List of transformer modules to apply LoRA adapters to (see below)

Understanding Target Modules

The LTX-2 transformer has separate attention and feed-forward blocks for video and audio, as well as cross-attention modules that enable the two modalities to exchange information. Choosing the right target_modules is critical for achieving good results, especially when training with audio.

Video-only modules:

Module Pattern Description
attn1.to_k, attn1.to_q, attn1.to_v, attn1.to_out.0 Video self-attention
attn2.to_k, attn2.to_q, attn2.to_v, attn2.to_out.0 Video cross-attention (to text)
ff.net.0.proj, ff.net.2 Video feed-forward network

Audio-only modules:

Module Pattern Description
audio_attn1.to_k, audio_attn1.to_q, audio_attn1.to_v, audio_attn1.to_out.0 Audio self-attention
audio_attn2.to_k, audio_attn2.to_q, audio_attn2.to_v, audio_attn2.to_out.0 Audio cross-attention (to text)
audio_ff.net.0.proj, audio_ff.net.2 Audio feed-forward network

Audio-video cross-attention modules:

These modules enable bidirectional information flow between the audio and video modalities:

Module Pattern Description
audio_to_video_attn.to_k, audio_to_video_attn.to_q, audio_to_video_attn.to_v, audio_to_video_attn.to_out.0 Video attends to audio (Q from video, K/V from audio)
video_to_audio_attn.to_k, video_to_audio_attn.to_q, video_to_audio_attn.to_v, video_to_audio_attn.to_out.0 Audio attends to video (Q from audio, K/V from video)

Recommended configurations:

For video-only training, target the video attention layers:

target_modules:
  - "attn1.to_k"
  - "attn1.to_q"
  - "attn1.to_v"
  - "attn1.to_out.0"
  - "attn2.to_k"
  - "attn2.to_q"
  - "attn2.to_v"
  - "attn2.to_out.0"

For audio-video training, use patterns that match both branches:

target_modules:
  - "to_k"
  - "to_q"
  - "to_v"
  - "to_out.0"

Using shorter patterns like "to_k" will match all attention modules including attn1.to_k, audio_attn1.to_k, audio_to_video_attn.to_k, and video_to_audio_attn.to_k, effectively training video, audio, and cross-modal attention branches together.

You can also target the feed-forward (FFN) modules (ff.net.0.proj, ff.net.2 for video, audio_ff.net.0.proj, audio_ff.net.2 for audio) to increase the LoRA's capacity and potentially help it capture the target distribution better.

TrainingStrategyConfig

Configures the training strategy. This replaces the legacy ConditioningConfig.

Text-to-Video Strategy

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1   # Probability of first-frame conditioning
  with_audio: false                 # Enable joint audio-video training
  audio_latents_dir: "audio_latents"  # Directory for audio latents (when with_audio: true)

Video-to-Video Strategy (IC-LoRA)

training_strategy:
  name: "video_to_video"
  first_frame_conditioning_p: 0.1
  reference_latents_dir: "reference_latents"  # Directory for reference video latents

Key parameters:

Parameter Description
name Strategy type: "text_to_video" or "video_to_video"
first_frame_conditioning_p Probability of using first frame as conditioning (0.0-1.0)
with_audio (text_to_video only) Enable joint audio-video training
audio_latents_dir (text_to_video only) Directory name for audio latents
reference_latents_dir (video_to_video only) Directory name for reference video latents

OptimizationConfig

Training optimization parameters including learning rates, batch sizes, and schedulers.

optimization:
  learning_rate: 1e-4              # Learning rate
  steps: 2000                      # Total training steps
  batch_size: 1                    # Batch size per GPU
  gradient_accumulation_steps: 1   # Steps to accumulate gradients
  max_grad_norm: 1.0               # Gradient clipping threshold
  optimizer_type: "adamw"          # "adamw" or "adamw8bit"
  scheduler_type: "linear"         # Scheduler type
  scheduler_params: {}             # Additional scheduler parameters
  enable_gradient_checkpointing: true  # Memory optimization

Key parameters:

Parameter Description
learning_rate Learning rate for optimization (typical range: 1e-5 to 1e-3)
steps Total number of training steps
batch_size Batch size per GPU (reduce if running out of memory)
gradient_accumulation_steps Accumulate gradients over multiple steps
scheduler_type LR scheduler: "constant", "linear", "cosine", "cosine_with_restarts", "polynomial"
enable_gradient_checkpointing Trade training speed for GPU memory savings (recommended for large models)

AccelerationConfig

Hardware acceleration and compute optimization settings.

acceleration:
  mixed_precision_mode: "bf16"     # "no", "fp16", or "bf16"
  quantization: null               # Quantization options
  load_text_encoder_in_8bit: false # Load text encoder in 8-bit

Key parameters:

Parameter Description
mixed_precision_mode Precision mode - "bf16" recommended for modern GPUs
quantization Model quantization: null, "int8-quanto", "int4-quanto", "fp8-quanto", etc.
load_text_encoder_in_8bit Load the Gemma text encoder in 8-bit to save GPU memory

DataConfig

Data loading and processing configuration.

data:
  preprocessed_data_root: "/path/to/preprocessed/data"  # Path to precomputed dataset
  num_dataloader_workers: 2                             # Background data loading workers

Key parameters:

Parameter Description
preprocessed_data_root Path to your preprocessed dataset (contains latents/, conditions/, etc.)
num_dataloader_workers Number of parallel data loading processes (0 = synchronous loading, useful when debugging)

ValidationConfig

Validation and inference settings for monitoring training progress.

validation:
  prompts:                         # Validation prompts
    - "A cat playing with a ball"
    - "A dog running in a field"
  negative_prompt: "worst quality, inconsistent motion, blurry, jittery, distorted"
  images: null                     # Optional image paths for image-to-video
  reference_videos: null           # Reference video paths (IC-LoRA only)
  video_dims: [576, 576, 89]       # Video dimensions [width, height, frames]
  frame_rate: 25.0                 # Frame rate for generated videos
  seed: 42                         # Random seed for reproducibility
  inference_steps: 30              # Number of inference steps
  interval: 100                    # Steps between validation runs
  videos_per_prompt: 1             # Videos generated per prompt
  guidance_scale: 3.0              # CFG guidance strength
  stg_scale: 1.0                   # STG guidance strength (0.0 to disable)
  stg_blocks: [29]                 # Transformer blocks to perturb for STG
  stg_mode: "stg_av"               # "stg_av" or "stg_v" (video only)
  generate_audio: true             # Whether to generate audio
  skip_initial_validation: false   # Skip validation at step 0
  include_reference_in_output: false  # Include reference video side-by-side (IC-LoRA)

Key parameters:

Parameter Description
prompts List of text prompts for validation video generation
images List of image paths for image-to-video validation (must match number of prompts)
reference_videos List of reference video paths for IC-LoRA validation (must match number of prompts)
video_dims Output dimensions [width, height, frames]. Width/height must be divisible by 32, frames must satisfy frames % 8 == 1
interval Steps between validation runs (set to null to disable)
guidance_scale CFG (Classifier-Free Guidance) scale. Recommended: 3.0
stg_scale STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended: 1.0
stg_blocks Transformer blocks to perturb for STG. Recommended: [29] (single block)
stg_mode STG mode: "stg_av" perturbs both audio and video, "stg_v" perturbs video only
generate_audio Whether to generate audio in validation samples
include_reference_in_output For IC-LoRA: concatenate reference video side-by-side with output

CheckpointsConfig

Model checkpointing configuration.

checkpoints:
  interval: 250       # Steps between checkpoint saves (null = disabled)
  keep_last_n: 3      # Number of recent checkpoints to retain

Key parameters:

Parameter Description
interval Steps between intermediate checkpoint saves (set to null to disable)
keep_last_n Number of most recent checkpoints to keep (-1 = keep all)

HubConfig

Hugging Face Hub integration for automatic model uploads.

hub:
  push_to_hub: false                    # Enable Hub uploading
  hub_model_id: "username/model-name"   # Hub repository ID

Key parameters:

Parameter Description
push_to_hub Whether to automatically push trained models to Hugging Face Hub
hub_model_id Repository ID in format "username/repository-name"

WandbConfig

Weights & Biases logging configuration.

wandb:
  enabled: false              # Enable W&B logging
  project: "ltx-2-trainer"    # W&B project name
  entity: null                # W&B username or team
  tags: []                    # Tags for the run
  log_validation_videos: true # Log validation videos to W&B

Key parameters:

Parameter Description
enabled Whether to enable W&B logging
project W&B project name
entity W&B username or team (null uses default account)
log_validation_videos Whether to log validation videos to W&B

FlowMatchingConfig

Flow matching training configuration for timestep sampling.

flow_matching:
  timestep_sampling_mode: "shifted_logit_normal"  # Timestep sampling strategy
  timestep_sampling_params: {}                     # Additional sampling parameters

Key parameters:

Parameter Description
timestep_sampling_mode Sampling strategy: "uniform" or "shifted_logit_normal"
timestep_sampling_params Additional parameters for the sampling strategy

πŸš€ Next Steps

Once you've configured your training parameters: