vae / packages /ltx-trainer /docs /configuration-reference.md

Add files using upload-large-folder tool

a3c20e1 verified 21 days ago

19.1 kB

	# Configuration Reference

	The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters.
	This guide covers all available configuration options and their usage.

	## 📋 Overview

	The main configuration class is [`LtxTrainerConfig`](../src/ltx_trainer/config.py), which includes the following
	sub-configurations:

	- ModelConfig: Base model and training mode settings
	- LoraConfig: LoRA training parameters
	- TrainingStrategyConfig: Training strategy settings (text-to-video or video-to-video)
	- OptimizationConfig: Learning rate, batch sizes, and scheduler settings
	- AccelerationConfig: Mixed precision and quantization settings
	- DataConfig: Data loading parameters
	- ValidationConfig: Validation and inference settings
	- CheckpointsConfig: Checkpoint saving frequency and retention settings
	- HubConfig: Hugging Face Hub integration settings
	- WandbConfig: Weights & Biases logging settings
	- FlowMatchingConfig: Timestep sampling parameters

	## 📄 Example Configuration Files

	Check out our example configurations in the `configs` directory:

	- 📄 [Audio-Video LoRA Training](../configs/ltx2_av_lora.yaml) - Joint audio-video generation training
	- 📄 [Audio-Video LoRA Training (Low VRAM)](../configs/ltx2_av_lora_low_vram.yaml) - Memory-optimized config for 32GB
	GPUs (uses 8-bit optimizer, INT8 quantization, and reduced LoRA rank)
	- 📄 [IC-LoRA Training](../configs/ltx2_v2v_ic_lora.yaml) - Video-to-video transformation training

	## ⚙️ Configuration Sections

	### ModelConfig

	Controls the base model and training mode settings.

	```yaml
	model:
	model_path: "/path/to/ltx-2-model.safetensors" # Local path to model checkpoint
	text_encoder_path: "/path/to/gemma-model" # Path to Gemma text encoder directory
	training_mode: "lora" # "lora" or "full"
	load_checkpoint: null # Path to checkpoint to resume from
	```

	Key parameters:

	\| Parameter \| Description \|
	\|---------------------\|----------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| `model_path` \| Required. Local path to the LTX-2 model checkpoint (`.safetensors` file). URLs are not supported. \|
	\| `text_encoder_path` \| Required. Path to the Gemma text encoder model directory. Download from [HuggingFace](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized/). \|
	\| `training_mode` \| Training approach - `"lora"` for LoRA training or `"full"` for full-rank fine-tuning. \|
	\| `load_checkpoint` \| Optional path to resume training from a checkpoint file or directory. \|

	> [!NOTE]
	> LTX-2 requires both a model checkpoint and a Gemma text encoder. Both must be local paths.

	### LoraConfig

	LoRA-specific fine-tuning parameters (only used when `training_mode: "lora"`).

	```yaml
	lora:
	rank: 32 # LoRA rank (higher = more parameters)
	alpha: 32 # LoRA alpha scaling factor
	dropout: 0.0 # Dropout probability (0.0-1.0)
	target_modules: # Modules to apply LoRA to
	- "to_k"
	- "to_q"
	- "to_v"
	- "to_out.0"
	```

	Key parameters:

	\| Parameter \| Description \|
	\|------------------\|---------------------------------------------------------------------------------\|
	\| `rank` \| LoRA rank - higher values mean more trainable parameters (typical range: 8-128) \|
	\| `alpha` \| Alpha scaling factor - typically set equal to rank \|
	\| `dropout` \| Dropout probability for regularization \|
	\| `target_modules` \| List of transformer modules to apply LoRA adapters to (see below) \|

	#### Understanding Target Modules

	The LTX-2 transformer has separate attention and feed-forward blocks for video and audio, as well as cross-attention
	modules that enable the two modalities to exchange information. Choosing the right `target_modules` is critical for
	achieving good results, especially when training with audio.

	Video-only modules:

	\| Module Pattern \| Description \|
	\|------------------------------------------------------------\|---------------------------------\|
	\| `attn1.to_k`, `attn1.to_q`, `attn1.to_v`, `attn1.to_out.0` \| Video self-attention \|
	\| `attn2.to_k`, `attn2.to_q`, `attn2.to_v`, `attn2.to_out.0` \| Video cross-attention (to text) \|
	\| `ff.net.0.proj`, `ff.net.2` \| Video feed-forward network \|

	Audio-only modules:

	\| Module Pattern \| Description \|
	\|------------------------------------------------------------------------------------\|---------------------------------\|
	\| `audio_attn1.to_k`, `audio_attn1.to_q`, `audio_attn1.to_v`, `audio_attn1.to_out.0` \| Audio self-attention \|
	\| `audio_attn2.to_k`, `audio_attn2.to_q`, `audio_attn2.to_v`, `audio_attn2.to_out.0` \| Audio cross-attention (to text) \|
	\| `audio_ff.net.0.proj`, `audio_ff.net.2` \| Audio feed-forward network \|

	Audio-video cross-attention modules:

	These modules enable bidirectional information flow between the audio and video modalities:

	\| Module Pattern \| Description \|
	\|--------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------------\|
	\| `audio_to_video_attn.to_k`, `audio_to_video_attn.to_q`, `audio_to_video_attn.to_v`, `audio_to_video_attn.to_out.0` \| Video attends to audio (Q from video, K/V from audio) \|
	\| `video_to_audio_attn.to_k`, `video_to_audio_attn.to_q`, `video_to_audio_attn.to_v`, `video_to_audio_attn.to_out.0` \| Audio attends to video (Q from audio, K/V from video) \|

	Recommended configurations:

	For video-only training, target the video attention layers:

	```yaml
	target_modules:
	- "attn1.to_k"
	- "attn1.to_q"
	- "attn1.to_v"
	- "attn1.to_out.0"
	- "attn2.to_k"
	- "attn2.to_q"
	- "attn2.to_v"
	- "attn2.to_out.0"
	```

	For audio-video training, use patterns that match both branches:

	```yaml
	target_modules:
	- "to_k"
	- "to_q"
	- "to_v"
	- "to_out.0"
	```

	> [!NOTE]
	> Using shorter patterns like `"to_k"` will match all attention modules including `attn1.to_k`, `audio_attn1.to_k`,
	> `audio_to_video_attn.to_k`, and `video_to_audio_attn.to_k`, effectively training video, audio, and cross-modal
	> attention branches together.

	> [!TIP]
	> You can also target the feed-forward (FFN) modules (`ff.net.0.proj`, `ff.net.2` for video,
	> `audio_ff.net.0.proj`, `audio_ff.net.2` for audio) to increase the LoRA's capacity and potentially
	> help it capture the target distribution better.

	### TrainingStrategyConfig

	Configures the training strategy. The trainer includes two built-in strategies described below.
	For custom use cases, see [Implementing Custom Training Strategies](custom-training-strategies.md).

	#### Text-to-Video Strategy

	```yaml
	training_strategy:
	name: "text_to_video"
	first_frame_conditioning_p: 0.1 # Probability of first-frame conditioning
	with_audio: false # Enable joint audio-video training
	audio_latents_dir: "audio_latents" # Directory for audio latents (when with_audio: true)
	```

	#### Video-to-Video Strategy (IC-LoRA)

	```yaml
	training_strategy:
	name: "video_to_video"
	first_frame_conditioning_p: 0.1
	reference_latents_dir: "reference_latents" # Directory for reference video latents
	```

	Key parameters:

	\| Parameter \| Description \|
	\|------------------------------\|------------------------------------------------------------------\|
	\| `name` \| Strategy type: `"text_to_video"` or `"video_to_video"` \|
	\| `first_frame_conditioning_p` \| Probability of using first frame as conditioning (0.0-1.0) \|
	\| `with_audio` \| (text_to_video only) Enable joint audio-video training \|
	\| `audio_latents_dir` \| (text_to_video only) Directory name for audio latents \|
	\| `reference_latents_dir` \| (video_to_video only) Directory name for reference video latents \|

	### OptimizationConfig

	Training optimization parameters including learning rates, batch sizes, and schedulers.

	```yaml
	optimization:
	learning_rate: 1e-4 # Learning rate
	steps: 2000 # Total training steps
	batch_size: 1 # Batch size per GPU
	gradient_accumulation_steps: 1 # Steps to accumulate gradients
	max_grad_norm: 1.0 # Gradient clipping threshold
	optimizer_type: "adamw" # "adamw" or "adamw8bit"
	scheduler_type: "linear" # Scheduler type
	scheduler_params: { } # Additional scheduler parameters
	enable_gradient_checkpointing: true # Memory optimization
	```

	Key parameters:

	\| Parameter \| Description \|
	\|---------------------------------\|----------------------------------------------------------------------------------------------\|
	\| `learning_rate` \| Learning rate for optimization (typical range: 1e-5 to 1e-3) \|
	\| `steps` \| Total number of training steps \|
	\| `batch_size` \| Batch size per GPU (reduce if running out of memory) \|
	\| `gradient_accumulation_steps` \| Accumulate gradients over multiple steps \|
	\| `scheduler_type` \| LR scheduler: `"constant"`, `"linear"`, `"cosine"`, `"cosine_with_restarts"`, `"polynomial"` \|
	\| `enable_gradient_checkpointing` \| Trade training speed for GPU memory savings (recommended for large models) \|

	### AccelerationConfig

	Hardware acceleration and compute optimization settings.

	```yaml
	acceleration:
	mixed_precision_mode: "bf16" # "no", "fp16", or "bf16"
	quantization: null # Quantization options
	load_text_encoder_in_8bit: false # Load text encoder in 8-bit
	```

	Key parameters:

	\| Parameter \| Description \|
	\|-----------------------------\|------------------------------------------------------------------------------------\|
	\| `mixed_precision_mode` \| Precision mode - `"bf16"` recommended for modern GPUs \|
	\| `quantization` \| Model quantization: `null`, `"int8-quanto"`, `"int4-quanto"`, `"fp8-quanto"`, etc. \|
	\| `load_text_encoder_in_8bit` \| Load the Gemma text encoder in 8-bit to save GPU memory \|

	### DataConfig

	Data loading and processing configuration.

	```yaml
	data:
	preprocessed_data_root: "/path/to/preprocessed/data" # Path to precomputed dataset
	num_dataloader_workers: 2 # Background data loading workers
	```

	Key parameters:

	\| Parameter \| Description \|
	\|--------------------------\|--------------------------------------------------------------------------------------------\|
	\| `preprocessed_data_root` \| Path to your preprocessed dataset (contains `latents/`, `conditions/`, etc.) \|
	\| `num_dataloader_workers` \| Number of parallel data loading processes (0 = synchronous loading, useful when debugging) \|

	### ValidationConfig

	Validation and inference settings for monitoring training progress.

	```yaml
	validation:
	prompts: # Validation prompts
	- "A cat playing with a ball"
	- "A dog running in a field"
	negative_prompt: "worst quality, inconsistent motion, blurry, jittery, distorted"
	images: null # Optional image paths for image-to-video
	reference_videos: null # Reference video paths (IC-LoRA only)
	video_dims: [ 576, 576, 89 ] # Video dimensions [width, height, frames]
	frame_rate: 25.0 # Frame rate for generated videos
	seed: 42 # Random seed for reproducibility
	inference_steps: 30 # Number of inference steps
	interval: 100 # Steps between validation runs
	videos_per_prompt: 1 # Videos generated per prompt
	guidance_scale: 4.0 # CFG guidance strength
	stg_scale: 1.0 # STG guidance strength (0.0 to disable)
	stg_blocks: [ 29 ] # Transformer blocks to perturb for STG
	stg_mode: "stg_av" # "stg_av" or "stg_v" (video only)
	generate_audio: true # Whether to generate audio
	skip_initial_validation: false # Skip validation at step 0
	include_reference_in_output: false # Include reference video side-by-side (IC-LoRA)
	```

	Key parameters:

	\| Parameter \| Description \|
	\|-------------------------------\|--------------------------------------------------------------------------------------------------------------------------\|
	\| `prompts` \| List of text prompts for validation video generation \|
	\| `images` \| List of image paths for image-to-video validation (must match number of prompts) \|
	\| `reference_videos` \| List of reference video paths for IC-LoRA validation (must match number of prompts) \|
	\| `video_dims` \| Output dimensions `[width, height, frames]`. Width/height must be divisible by 32, frames must satisfy `frames % 8 == 1` \|
	\| `interval` \| Steps between validation runs (set to `null` to disable) \|
	\| `guidance_scale` \| CFG (Classifier-Free Guidance) scale. Recommended: 4.0 \|
	\| `stg_scale` \| STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended: 1.0 \|
	\| `stg_blocks` \| Transformer blocks to perturb for STG. Recommended: `[29]` (single block) \|
	\| `stg_mode` \| STG mode: `"stg_av"` perturbs both audio and video, `"stg_v"` perturbs video only \|
	\| `generate_audio` \| Whether to generate audio in validation samples \|
	\| `include_reference_in_output` \| For IC-LoRA: concatenate reference video side-by-side with output \|

	### CheckpointsConfig

	Model checkpointing configuration.

	```yaml
	checkpoints:
	interval: 250 # Steps between checkpoint saves (null = disabled)
	keep_last_n: 3 # Number of recent checkpoints to retain
	precision: bfloat16 # Precision for saved weights (bfloat16 or float32)
	```

	Key parameters:

	\| Parameter \| Description \|
	\|---------------\|-------------------------------------------------------------------------------\|
	\| `interval` \| Steps between intermediate checkpoint saves (set to `null` to disable) \|
	\| `keep_last_n` \| Number of most recent checkpoints to keep (-1 = keep all) \|
	\| `precision` \| Precision for saved checkpoint weights: `"bfloat16"` (default) or `"float32"` \|

	### HubConfig

	Hugging Face Hub integration for automatic model uploads.

	```yaml
	hub:
	push_to_hub: false # Enable Hub uploading
	hub_model_id: "username/model-name" # Hub repository ID
	```

	Key parameters:

	\| Parameter \| Description \|
	\|----------------\|------------------------------------------------------------------\|
	\| `push_to_hub` \| Whether to automatically push trained models to Hugging Face Hub \|
	\| `hub_model_id` \| Repository ID in format `"username/repository-name"` \|

	### WandbConfig

	Weights & Biases logging configuration.

	```yaml
	wandb:
	enabled: false # Enable W&B logging
	project: "ltx-2-trainer" # W&B project name
	entity: null # W&B username or team
	tags: [ ] # Tags for the run
	log_validation_videos: true # Log validation videos to W&B
	```

	Key parameters:

	\| Parameter \| Description \|
	\|-------------------------\|--------------------------------------------------\|
	\| `enabled` \| Whether to enable W&B logging \|
	\| `project` \| W&B project name \|
	\| `entity` \| W&B username or team (null uses default account) \|
	\| `log_validation_videos` \| Whether to log validation videos to W&B \|

	### FlowMatchingConfig

	Flow matching training configuration for timestep sampling.

	```yaml
	flow_matching:
	timestep_sampling_mode: "shifted_logit_normal" # Timestep sampling strategy
	timestep_sampling_params: { } # Additional sampling parameters
	```

	Key parameters:

	\| Parameter \| Description \|
	\|----------------------------\|------------------------------------------------------------\|
	\| `timestep_sampling_mode` \| Sampling strategy: `"uniform"` or `"shifted_logit_normal"` \|
	\| `timestep_sampling_params` \| Additional parameters for the sampling strategy \|

	## 🚀 Next Steps

	Once you've configured your training parameters:

	- Set up your dataset using [Dataset Preparation](dataset-preparation.md)
	- Choose your training approach in [Training Modes](training-modes.md)
	- Start training with the [Training Guide](training-guide.md)