| # Configuration Reference |
|
|
| The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters. |
| This guide covers all available configuration options and their usage. |
|
|
| ## 📋 Overview |
|
|
| The main configuration class is [`LtxTrainerConfig`](../src/ltx_trainer/config.py), which includes the following |
| sub-configurations: |
|
|
| - **ModelConfig**: Base model and training mode settings |
| - **LoraConfig**: LoRA training parameters |
| - **TrainingStrategyConfig**: Training strategy settings (text-to-video or video-to-video) |
| - **OptimizationConfig**: Learning rate, batch sizes, and scheduler settings |
| - **AccelerationConfig**: Mixed precision and quantization settings |
| - **DataConfig**: Data loading parameters |
| - **ValidationConfig**: Validation and inference settings |
| - **CheckpointsConfig**: Checkpoint saving frequency and retention settings |
| - **HubConfig**: Hugging Face Hub integration settings |
| - **WandbConfig**: Weights & Biases logging settings |
| - **FlowMatchingConfig**: Timestep sampling parameters |
|
|
| ## 📄 Example Configuration Files |
|
|
| Check out our example configurations in the `configs` directory: |
|
|
| - 📄 [Audio-Video LoRA Training](../configs/ltx2_av_lora.yaml) - Joint audio-video generation training |
| - 📄 [Audio-Video LoRA Training (Low VRAM)](../configs/ltx2_av_lora_low_vram.yaml) - Memory-optimized config for 32GB |
| GPUs (uses 8-bit optimizer, INT8 quantization, and reduced LoRA rank) |
| - 📄 [IC-LoRA Training](../configs/ltx2_v2v_ic_lora.yaml) - Video-to-video transformation training |
|
|
| ## ⚙️ Configuration Sections |
|
|
| ### ModelConfig |
|
|
| Controls the base model and training mode settings. |
|
|
| ```yaml |
| model: |
| model_path: "/path/to/ltx-2-model.safetensors" # Local path to model checkpoint |
| text_encoder_path: "/path/to/gemma-model" # Path to Gemma text encoder directory |
| training_mode: "lora" # "lora" or "full" |
| load_checkpoint: null # Path to checkpoint to resume from |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| | `model_path` | **Required.** Local path to the LTX-2 model checkpoint (`.safetensors` file). URLs are not supported. | |
| | `text_encoder_path` | **Required.** Path to the Gemma text encoder model directory. Download from [HuggingFace](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized/). | |
| | `training_mode` | Training approach - `"lora"` for LoRA training or `"full"` for full-rank fine-tuning. | |
| | `load_checkpoint` | Optional path to resume training from a checkpoint file or directory. | |
|
|
| > [!NOTE] |
| > LTX-2 requires both a model checkpoint and a Gemma text encoder. Both must be local paths. |
|
|
| ### LoraConfig |
|
|
| LoRA-specific fine-tuning parameters (only used when `training_mode: "lora"`). |
|
|
| ```yaml |
| lora: |
| rank: 32 # LoRA rank (higher = more parameters) |
| alpha: 32 # LoRA alpha scaling factor |
| dropout: 0.0 # Dropout probability (0.0-1.0) |
| target_modules: # Modules to apply LoRA to |
| - "to_k" |
| - "to_q" |
| - "to_v" |
| - "to_out.0" |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |------------------|---------------------------------------------------------------------------------| |
| | `rank` | LoRA rank - higher values mean more trainable parameters (typical range: 8-128) | |
| | `alpha` | Alpha scaling factor - typically set equal to rank | |
| | `dropout` | Dropout probability for regularization | |
| | `target_modules` | List of transformer modules to apply LoRA adapters to (see below) | |
|
|
| #### Understanding Target Modules |
|
|
| The LTX-2 transformer has separate attention and feed-forward blocks for video and audio, as well as cross-attention |
| modules that enable the two modalities to exchange information. Choosing the right `target_modules` is critical for |
| achieving good results, especially when training with audio. |
|
|
| **Video-only modules:** |
|
|
| | Module Pattern | Description | |
| |------------------------------------------------------------|---------------------------------| |
| | `attn1.to_k`, `attn1.to_q`, `attn1.to_v`, `attn1.to_out.0` | Video self-attention | |
| | `attn2.to_k`, `attn2.to_q`, `attn2.to_v`, `attn2.to_out.0` | Video cross-attention (to text) | |
| | `ff.net.0.proj`, `ff.net.2` | Video feed-forward network | |
|
|
| **Audio-only modules:** |
|
|
| | Module Pattern | Description | |
| |------------------------------------------------------------------------------------|---------------------------------| |
| | `audio_attn1.to_k`, `audio_attn1.to_q`, `audio_attn1.to_v`, `audio_attn1.to_out.0` | Audio self-attention | |
| | `audio_attn2.to_k`, `audio_attn2.to_q`, `audio_attn2.to_v`, `audio_attn2.to_out.0` | Audio cross-attention (to text) | |
| | `audio_ff.net.0.proj`, `audio_ff.net.2` | Audio feed-forward network | |
|
|
| **Audio-video cross-attention modules:** |
|
|
| These modules enable bidirectional information flow between the audio and video modalities: |
|
|
| | Module Pattern | Description | |
| |--------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------| |
| | `audio_to_video_attn.to_k`, `audio_to_video_attn.to_q`, `audio_to_video_attn.to_v`, `audio_to_video_attn.to_out.0` | Video attends to audio (Q from video, K/V from audio) | |
| | `video_to_audio_attn.to_k`, `video_to_audio_attn.to_q`, `video_to_audio_attn.to_v`, `video_to_audio_attn.to_out.0` | Audio attends to video (Q from audio, K/V from video) | |
|
|
| **Recommended configurations:** |
|
|
| For **video-only training**, target the video attention layers: |
|
|
| ```yaml |
| target_modules: |
| - "attn1.to_k" |
| - "attn1.to_q" |
| - "attn1.to_v" |
| - "attn1.to_out.0" |
| - "attn2.to_k" |
| - "attn2.to_q" |
| - "attn2.to_v" |
| - "attn2.to_out.0" |
| ``` |
|
|
| For **audio-video training**, use patterns that match both branches: |
|
|
| ```yaml |
| target_modules: |
| - "to_k" |
| - "to_q" |
| - "to_v" |
| - "to_out.0" |
| ``` |
|
|
| > [!NOTE] |
| > Using shorter patterns like `"to_k"` will match all attention modules including `attn1.to_k`, `audio_attn1.to_k`, |
| > `audio_to_video_attn.to_k`, and `video_to_audio_attn.to_k`, effectively training video, audio, and cross-modal |
| > attention branches together. |
|
|
| > [!TIP] |
| > You can also target the feed-forward (FFN) modules (`ff.net.0.proj`, `ff.net.2` for video, |
| > `audio_ff.net.0.proj`, `audio_ff.net.2` for audio) to increase the LoRA's capacity and potentially |
| > help it capture the target distribution better. |
|
|
| ### TrainingStrategyConfig |
|
|
| Configures the training strategy. The trainer includes two built-in strategies described below. |
| For custom use cases, see [Implementing Custom Training Strategies](custom-training-strategies.md). |
|
|
| #### Text-to-Video Strategy |
|
|
| ```yaml |
| training_strategy: |
| name: "text_to_video" |
| first_frame_conditioning_p: 0.1 # Probability of first-frame conditioning |
| with_audio: false # Enable joint audio-video training |
| audio_latents_dir: "audio_latents" # Directory for audio latents (when with_audio: true) |
| ``` |
|
|
| #### Video-to-Video Strategy (IC-LoRA) |
|
|
| ```yaml |
| training_strategy: |
| name: "video_to_video" |
| first_frame_conditioning_p: 0.1 |
| reference_latents_dir: "reference_latents" # Directory for reference video latents |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |------------------------------|------------------------------------------------------------------| |
| | `name` | Strategy type: `"text_to_video"` or `"video_to_video"` | |
| | `first_frame_conditioning_p` | Probability of using first frame as conditioning (0.0-1.0) | |
| | `with_audio` | (text_to_video only) Enable joint audio-video training | |
| | `audio_latents_dir` | (text_to_video only) Directory name for audio latents | |
| | `reference_latents_dir` | (video_to_video only) Directory name for reference video latents | |
|
|
| ### OptimizationConfig |
|
|
| Training optimization parameters including learning rates, batch sizes, and schedulers. |
|
|
| ```yaml |
| optimization: |
| learning_rate: 1e-4 # Learning rate |
| steps: 2000 # Total training steps |
| batch_size: 1 # Batch size per GPU |
| gradient_accumulation_steps: 1 # Steps to accumulate gradients |
| max_grad_norm: 1.0 # Gradient clipping threshold |
| optimizer_type: "adamw" # "adamw" or "adamw8bit" |
| scheduler_type: "linear" # Scheduler type |
| scheduler_params: { } # Additional scheduler parameters |
| enable_gradient_checkpointing: true # Memory optimization |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |---------------------------------|----------------------------------------------------------------------------------------------| |
| | `learning_rate` | Learning rate for optimization (typical range: 1e-5 to 1e-3) | |
| | `steps` | Total number of training steps | |
| | `batch_size` | Batch size per GPU (reduce if running out of memory) | |
| | `gradient_accumulation_steps` | Accumulate gradients over multiple steps | |
| | `scheduler_type` | LR scheduler: `"constant"`, `"linear"`, `"cosine"`, `"cosine_with_restarts"`, `"polynomial"` | |
| | `enable_gradient_checkpointing` | Trade training speed for GPU memory savings (recommended for large models) | |
|
|
| ### AccelerationConfig |
|
|
| Hardware acceleration and compute optimization settings. |
|
|
| ```yaml |
| acceleration: |
| mixed_precision_mode: "bf16" # "no", "fp16", or "bf16" |
| quantization: null # Quantization options |
| load_text_encoder_in_8bit: false # Load text encoder in 8-bit |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |-----------------------------|------------------------------------------------------------------------------------| |
| | `mixed_precision_mode` | Precision mode - `"bf16"` recommended for modern GPUs | |
| | `quantization` | Model quantization: `null`, `"int8-quanto"`, `"int4-quanto"`, `"fp8-quanto"`, etc. | |
| | `load_text_encoder_in_8bit` | Load the Gemma text encoder in 8-bit to save GPU memory | |
|
|
| ### DataConfig |
|
|
| Data loading and processing configuration. |
|
|
| ```yaml |
| data: |
| preprocessed_data_root: "/path/to/preprocessed/data" # Path to precomputed dataset |
| num_dataloader_workers: 2 # Background data loading workers |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |--------------------------|--------------------------------------------------------------------------------------------| |
| | `preprocessed_data_root` | Path to your preprocessed dataset (contains `latents/`, `conditions/`, etc.) | |
| | `num_dataloader_workers` | Number of parallel data loading processes (0 = synchronous loading, useful when debugging) | |
|
|
| ### ValidationConfig |
|
|
| Validation and inference settings for monitoring training progress. |
|
|
| ```yaml |
| validation: |
| prompts: # Validation prompts |
| - "A cat playing with a ball" |
| - "A dog running in a field" |
| negative_prompt: "worst quality, inconsistent motion, blurry, jittery, distorted" |
| images: null # Optional image paths for image-to-video |
| reference_videos: null # Reference video paths (IC-LoRA only) |
| video_dims: [ 576, 576, 89 ] # Video dimensions [width, height, frames] |
| frame_rate: 25.0 # Frame rate for generated videos |
| seed: 42 # Random seed for reproducibility |
| inference_steps: 30 # Number of inference steps |
| interval: 100 # Steps between validation runs |
| videos_per_prompt: 1 # Videos generated per prompt |
| guidance_scale: 4.0 # CFG guidance strength |
| stg_scale: 1.0 # STG guidance strength (0.0 to disable) |
| stg_blocks: [ 29 ] # Transformer blocks to perturb for STG |
| stg_mode: "stg_av" # "stg_av" or "stg_v" (video only) |
| generate_audio: true # Whether to generate audio |
| skip_initial_validation: false # Skip validation at step 0 |
| include_reference_in_output: false # Include reference video side-by-side (IC-LoRA) |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |-------------------------------|--------------------------------------------------------------------------------------------------------------------------| |
| | `prompts` | List of text prompts for validation video generation | |
| | `images` | List of image paths for image-to-video validation (must match number of prompts) | |
| | `reference_videos` | List of reference video paths for IC-LoRA validation (must match number of prompts) | |
| | `video_dims` | Output dimensions `[width, height, frames]`. Width/height must be divisible by 32, frames must satisfy `frames % 8 == 1` | |
| | `interval` | Steps between validation runs (set to `null` to disable) | |
| | `guidance_scale` | CFG (Classifier-Free Guidance) scale. Recommended: 4.0 | |
| | `stg_scale` | STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended: 1.0 | |
| | `stg_blocks` | Transformer blocks to perturb for STG. Recommended: `[29]` (single block) | |
| | `stg_mode` | STG mode: `"stg_av"` perturbs both audio and video, `"stg_v"` perturbs video only | |
| | `generate_audio` | Whether to generate audio in validation samples | |
| | `include_reference_in_output` | For IC-LoRA: concatenate reference video side-by-side with output | |
|
|
| ### CheckpointsConfig |
|
|
| Model checkpointing configuration. |
|
|
| ```yaml |
| checkpoints: |
| interval: 250 # Steps between checkpoint saves (null = disabled) |
| keep_last_n: 3 # Number of recent checkpoints to retain |
| precision: bfloat16 # Precision for saved weights (bfloat16 or float32) |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |---------------|-------------------------------------------------------------------------------| |
| | `interval` | Steps between intermediate checkpoint saves (set to `null` to disable) | |
| | `keep_last_n` | Number of most recent checkpoints to keep (-1 = keep all) | |
| | `precision` | Precision for saved checkpoint weights: `"bfloat16"` (default) or `"float32"` | |
|
|
| ### HubConfig |
|
|
| Hugging Face Hub integration for automatic model uploads. |
|
|
| ```yaml |
| hub: |
| push_to_hub: false # Enable Hub uploading |
| hub_model_id: "username/model-name" # Hub repository ID |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |----------------|------------------------------------------------------------------| |
| | `push_to_hub` | Whether to automatically push trained models to Hugging Face Hub | |
| | `hub_model_id` | Repository ID in format `"username/repository-name"` | |
|
|
| ### WandbConfig |
|
|
| Weights & Biases logging configuration. |
|
|
| ```yaml |
| wandb: |
| enabled: false # Enable W&B logging |
| project: "ltx-2-trainer" # W&B project name |
| entity: null # W&B username or team |
| tags: [ ] # Tags for the run |
| log_validation_videos: true # Log validation videos to W&B |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |-------------------------|--------------------------------------------------| |
| | `enabled` | Whether to enable W&B logging | |
| | `project` | W&B project name | |
| | `entity` | W&B username or team (null uses default account) | |
| | `log_validation_videos` | Whether to log validation videos to W&B | |
|
|
| ### FlowMatchingConfig |
|
|
| Flow matching training configuration for timestep sampling. |
|
|
| ```yaml |
| flow_matching: |
| timestep_sampling_mode: "shifted_logit_normal" # Timestep sampling strategy |
| timestep_sampling_params: { } # Additional sampling parameters |
| ``` |
|
|
| **Key parameters:** |
|
|
| | Parameter | Description | |
| |----------------------------|------------------------------------------------------------| |
| | `timestep_sampling_mode` | Sampling strategy: `"uniform"` or `"shifted_logit_normal"` | |
| | `timestep_sampling_params` | Additional parameters for the sampling strategy | |
|
|
| ## 🚀 Next Steps |
|
|
| Once you've configured your training parameters: |
|
|
| - Set up your dataset using [Dataset Preparation](dataset-preparation.md) |
| - Choose your training approach in [Training Modes](training-modes.md) |
| - Start training with the [Training Guide](training-guide.md) |
|
|