Spaces:

Lightricks
/

ltx-2

Running on Zero

App Files Files Community

ltx-2 / packages /ltx-trainer /docs /configuration-reference.md

linoy

inital commit

ebfc6b3 8 days ago

preview code

raw

history blame contribute delete

14.3 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Configuration Reference

The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters. This guide covers all available configuration options and their usage.

📋 Overview

The main configuration class is LtxTrainerConfig, which includes the following sub-configurations:

ModelConfig: Base model and training mode settings
LoraConfig: LoRA training parameters
TrainingStrategyConfig: Training strategy settings (text-to-video or video-to-video)
OptimizationConfig: Learning rate, batch sizes, and scheduler settings
AccelerationConfig: Mixed precision and quantization settings
DataConfig: Data loading parameters
ValidationConfig: Validation and inference settings
CheckpointsConfig: Checkpoint saving frequency and retention settings
HubConfig: Hugging Face Hub integration settings
WandbConfig: Weights & Biases logging settings
FlowMatchingConfig: Timestep sampling parameters

📄 Example Configuration Files

Check out our example configurations in the configs directory:

📄 Audio-Video LoRA Training - Joint audio-video to generation training
📄 IC-LoRA Training - Video-to-video transformation training

⚙️ Configuration Sections

ModelConfig

Controls the base model and training mode settings.

model:
  model_path: "/path/to/ltx-2-model.safetensors"  # Local path to model checkpoint
  text_encoder_path: "/path/to/gemma-model"       # Path to Gemma text encoder directory
  training_mode: "lora"                           # "lora" or "full"
  load_checkpoint: null                           # Path to checkpoint to resume from

Key parameters:

Parameter	Description
`model_path`	Required. Local path to the LTX-2 model checkpoint (`.safetensors` file). URLs are not supported.
`text_encoder_path`	Required. Path to the Gemma text encoder model directory. Download from HuggingFace.
`training_mode`	Training approach - `"lora"` for LoRA training or `"full"` for full-rank fine-tuning.
`load_checkpoint`	Optional path to resume training from a checkpoint file or directory.

LTX-2 requires both a model checkpoint and a Gemma text encoder. Both must be local paths.

LoraConfig

LoRA-specific fine-tuning parameters (only used when training_mode: "lora").

lora:
  rank: 32                        # LoRA rank (higher = more parameters)
  alpha: 32                       # LoRA alpha scaling factor
  dropout: 0.0                    # Dropout probability (0.0-1.0)
  target_modules:                 # Modules to apply LoRA to
    - "to_k"
    - "to_q"
    - "to_v"
    - "to_out.0"

Key parameters:

Parameter	Description
`rank`	LoRA rank - higher values mean more trainable parameters (typical range: 8-128)
`alpha`	Alpha scaling factor - typically set equal to rank
`dropout`	Dropout probability for regularization
`target_modules`	List of transformer modules to apply LoRA adapters to (see below)

Understanding Target Modules

The LTX-2 transformer has separate attention and feed-forward blocks for video and audio, as well as cross-attention modules that enable the two modalities to exchange information. Choosing the right target_modules is critical for achieving good results, especially when training with audio.

Video-only modules:

Module Pattern	Description
`attn1.to_k`, `attn1.to_q`, `attn1.to_v`, `attn1.to_out.0`	Video self-attention
`attn2.to_k`, `attn2.to_q`, `attn2.to_v`, `attn2.to_out.0`	Video cross-attention (to text)
`ff.net.0.proj`, `ff.net.2`	Video feed-forward network

Audio-only modules:

Module Pattern	Description
`audio_attn1.to_k`, `audio_attn1.to_q`, `audio_attn1.to_v`, `audio_attn1.to_out.0`	Audio self-attention
`audio_attn2.to_k`, `audio_attn2.to_q`, `audio_attn2.to_v`, `audio_attn2.to_out.0`	Audio cross-attention (to text)
`audio_ff.net.0.proj`, `audio_ff.net.2`	Audio feed-forward network

Audio-video cross-attention modules:

These modules enable bidirectional information flow between the audio and video modalities:

Module Pattern	Description
`audio_to_video_attn.to_k`, `audio_to_video_attn.to_q`, `audio_to_video_attn.to_v`, `audio_to_video_attn.to_out.0`	Video attends to audio (Q from video, K/V from audio)
`video_to_audio_attn.to_k`, `video_to_audio_attn.to_q`, `video_to_audio_attn.to_v`, `video_to_audio_attn.to_out.0`	Audio attends to video (Q from audio, K/V from video)

Recommended configurations:

For video-only training, target the video attention layers:

target_modules:
  - "attn1.to_k"
  - "attn1.to_q"
  - "attn1.to_v"
  - "attn1.to_out.0"
  - "attn2.to_k"
  - "attn2.to_q"
  - "attn2.to_v"
  - "attn2.to_out.0"

For audio-video training, use patterns that match both branches:

target_modules:
  - "to_k"
  - "to_q"
  - "to_v"
  - "to_out.0"

Using shorter patterns like "to_k" will match all attention modules including attn1.to_k, audio_attn1.to_k, audio_to_video_attn.to_k, and video_to_audio_attn.to_k, effectively training video, audio, and cross-modal attention branches together.

You can also target the feed-forward (FFN) modules (ff.net.0.proj, ff.net.2 for video, audio_ff.net.0.proj, audio_ff.net.2 for audio) to increase the LoRA's capacity and potentially help it capture the target distribution better.

TrainingStrategyConfig

Configures the training strategy. This replaces the legacy ConditioningConfig.

Text-to-Video Strategy

training_strategy:
  name: "text_to_video"
  first_frame_conditioning_p: 0.1   # Probability of first-frame conditioning
  with_audio: false                 # Enable joint audio-video training
  audio_latents_dir: "audio_latents"  # Directory for audio latents (when with_audio: true)

Video-to-Video Strategy (IC-LoRA)

training_strategy:
  name: "video_to_video"
  first_frame_conditioning_p: 0.1
  reference_latents_dir: "reference_latents"  # Directory for reference video latents

Key parameters:

Parameter	Description
`name`	Strategy type: `"text_to_video"` or `"video_to_video"`
`first_frame_conditioning_p`	Probability of using first frame as conditioning (0.0-1.0)
`with_audio`	(text_to_video only) Enable joint audio-video training
`audio_latents_dir`	(text_to_video only) Directory name for audio latents
`reference_latents_dir`	(video_to_video only) Directory name for reference video latents

OptimizationConfig

Training optimization parameters including learning rates, batch sizes, and schedulers.

optimization:
  learning_rate: 1e-4              # Learning rate
  steps: 2000                      # Total training steps
  batch_size: 1                    # Batch size per GPU
  gradient_accumulation_steps: 1   # Steps to accumulate gradients
  max_grad_norm: 1.0               # Gradient clipping threshold
  optimizer_type: "adamw"          # "adamw" or "adamw8bit"
  scheduler_type: "linear"         # Scheduler type
  scheduler_params: {}             # Additional scheduler parameters
  enable_gradient_checkpointing: true  # Memory optimization

Key parameters:

Parameter	Description
`learning_rate`	Learning rate for optimization (typical range: 1e-5 to 1e-3)
`steps`	Total number of training steps
`batch_size`	Batch size per GPU (reduce if running out of memory)
`gradient_accumulation_steps`	Accumulate gradients over multiple steps
`scheduler_type`	LR scheduler: `"constant"`, `"linear"`, `"cosine"`, `"cosine_with_restarts"`, `"polynomial"`
`enable_gradient_checkpointing`	Trade training speed for GPU memory savings (recommended for large models)

AccelerationConfig

Hardware acceleration and compute optimization settings.

acceleration:
  mixed_precision_mode: "bf16"     # "no", "fp16", or "bf16"
  quantization: null               # Quantization options
  load_text_encoder_in_8bit: false # Load text encoder in 8-bit

Key parameters:

Parameter	Description
`mixed_precision_mode`	Precision mode - `"bf16"` recommended for modern GPUs
`quantization`	Model quantization: `null`, `"int8-quanto"`, `"int4-quanto"`, `"fp8-quanto"`, etc.
`load_text_encoder_in_8bit`	Load the Gemma text encoder in 8-bit to save GPU memory

DataConfig

Data loading and processing configuration.

data:
  preprocessed_data_root: "/path/to/preprocessed/data"  # Path to precomputed dataset
  num_dataloader_workers: 2                             # Background data loading workers

Key parameters:

Parameter	Description
`preprocessed_data_root`	Path to your preprocessed dataset (contains `latents/`, `conditions/`, etc.)
`num_dataloader_workers`	Number of parallel data loading processes (0 = synchronous loading, useful when debugging)

ValidationConfig

Validation and inference settings for monitoring training progress.

validation:
  prompts:                         # Validation prompts
    - "A cat playing with a ball"
    - "A dog running in a field"
  negative_prompt: "worst quality, inconsistent motion, blurry, jittery, distorted"
  images: null                     # Optional image paths for image-to-video
  reference_videos: null           # Reference video paths (IC-LoRA only)
  video_dims: [576, 576, 89]       # Video dimensions [width, height, frames]
  frame_rate: 25.0                 # Frame rate for generated videos
  seed: 42                         # Random seed for reproducibility
  inference_steps: 30              # Number of inference steps
  interval: 100                    # Steps between validation runs
  videos_per_prompt: 1             # Videos generated per prompt
  guidance_scale: 3.0              # CFG guidance strength
  stg_scale: 1.0                   # STG guidance strength (0.0 to disable)
  stg_blocks: [29]                 # Transformer blocks to perturb for STG
  stg_mode: "stg_av"               # "stg_av" or "stg_v" (video only)
  generate_audio: true             # Whether to generate audio
  skip_initial_validation: false   # Skip validation at step 0
  include_reference_in_output: false  # Include reference video side-by-side (IC-LoRA)

Key parameters:

Parameter	Description
`prompts`	List of text prompts for validation video generation
`images`	List of image paths for image-to-video validation (must match number of prompts)
`reference_videos`	List of reference video paths for IC-LoRA validation (must match number of prompts)
`video_dims`	Output dimensions `[width, height, frames]`. Width/height must be divisible by 32, frames must satisfy `frames % 8 == 1`
`interval`	Steps between validation runs (set to `null` to disable)
`guidance_scale`	CFG (Classifier-Free Guidance) scale. Recommended: 3.0
`stg_scale`	STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended: 1.0
`stg_blocks`	Transformer blocks to perturb for STG. Recommended: `[29]` (single block)
`stg_mode`	STG mode: `"stg_av"` perturbs both audio and video, `"stg_v"` perturbs video only
`generate_audio`	Whether to generate audio in validation samples
`include_reference_in_output`	For IC-LoRA: concatenate reference video side-by-side with output

CheckpointsConfig

Model checkpointing configuration.

checkpoints:
  interval: 250       # Steps between checkpoint saves (null = disabled)
  keep_last_n: 3      # Number of recent checkpoints to retain

Key parameters:

Parameter	Description
`interval`	Steps between intermediate checkpoint saves (set to `null` to disable)
`keep_last_n`	Number of most recent checkpoints to keep (-1 = keep all)

HubConfig

Hugging Face Hub integration for automatic model uploads.

hub:
  push_to_hub: false                    # Enable Hub uploading
  hub_model_id: "username/model-name"   # Hub repository ID

Key parameters:

Parameter	Description
`push_to_hub`	Whether to automatically push trained models to Hugging Face Hub
`hub_model_id`	Repository ID in format `"username/repository-name"`

WandbConfig

Weights & Biases logging configuration.

wandb:
  enabled: false              # Enable W&B logging
  project: "ltx-2-trainer"    # W&B project name
  entity: null                # W&B username or team
  tags: []                    # Tags for the run
  log_validation_videos: true # Log validation videos to W&B

Key parameters:

Parameter	Description
`enabled`	Whether to enable W&B logging
`project`	W&B project name
`entity`	W&B username or team (null uses default account)
`log_validation_videos`	Whether to log validation videos to W&B

FlowMatchingConfig

Flow matching training configuration for timestep sampling.

flow_matching:
  timestep_sampling_mode: "shifted_logit_normal"  # Timestep sampling strategy
  timestep_sampling_params: {}                     # Additional sampling parameters

Key parameters:

Parameter	Description
`timestep_sampling_mode`	Sampling strategy: `"uniform"` or `"shifted_logit_normal"`
`timestep_sampling_params`	Additional parameters for the sampling strategy

🚀 Next Steps

Once you've configured your training parameters:

Set up your dataset using Dataset Preparation
Choose your training approach in Training Modes
Start training with the Training Guide