Spaces:

Lightricks
/

ltx-2

Running on Zero

App Files Files Community

ltx-2 / packages /ltx-trainer /docs /training-modes.md

linoy

inital commit

ebfc6b3 13 days ago

preview code

raw

history blame contribute delete

8.82 kB

	# Training Modes Guide

	The trainer supports several training modes, each suited for different use cases and requirements.

	## 🎯 Standard LoRA Training (Video-Only)

	Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while
	keeping the base model frozen. This approach:

	- Requires significantly less memory and compute than full fine-tuning
	- Produces small, portable weight files (typically a few hundred MB)
	- Is ideal for learning specific styles, effects, or concepts
	- Can be easily combined with other LoRAs during inference

	Configure standard LoRA training with:

	```yaml
	model:
	training_mode: "lora"

	training_strategy:
	name: "text_to_video"
	first_frame_conditioning_p: 0.1
	with_audio: false # Video-only training
	```

	## 🔊 Audio-Video LoRA Training

	LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output:

	- Synchronized audio-video generation - Audio matches the visual content
	- Same efficient LoRA approach - Just enable audio training
	- Requires audio latents - Dataset must include preprocessed audio

	Configure audio-video training with:

	```yaml
	model:
	training_mode: "lora"

	training_strategy:
	name: "text_to_video"
	first_frame_conditioning_p: 0.1
	with_audio: true # Enable audio training
	audio_latents_dir: "audio_latents" # Directory containing audio latents
	```

	Example configuration file:

	- 📄 [Audio-Video LoRA Training](../configs/ltx2_av_lora.yaml)

	Dataset structure for audio-video training:

	```
	preprocessed_data_root/
	├── latents/ # Video latents
	├── conditions/ # Text embeddings
	└── audio_latents/ # Audio latents (required when with_audio: true)
	```

	> [!IMPORTANT]
	> When training audio-video LoRAs, ensure your `target_modules` configuration captures video, audio, and
	> cross-modal attention branches. Use patterns like `"to_k"` instead of `"attn1.to_k"` to match:
	> - Video modules: `attn1.to_k`, `attn2.to_k`
	> - Audio modules: `audio_attn1.to_k`, `audio_attn2.to_k`
	> - Cross-modal modules: `audio_to_video_attn.to_k`, `video_to_audio_attn.to_k`
	>
	> The cross-modal attention modules (`audio_to_video_attn` and `video_to_audio_attn`) enable bidirectional
	> information flow between audio and video, which is critical for synchronized audiovisual generation.
	> See [Understanding Target Modules](configuration-reference.md#understanding-target-modules) for detailed guidance.

	> [!NOTE]
	> You can generate audio during validation even if you're not training the audio branch.
	> Set `validation.generate_audio: true` independently of `training_strategy.with_audio`.

	## 🔥 Full Model Fine-tuning

	Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but
	requiring substantial computational resources and larger training datasets:

	- Offers the highest potential quality and capability improvements
	- Requires multiple GPUs and distributed training techniques (e.g., FSDP)
	- Produces large checkpoint files (several GB)
	- Best for major model adaptations or when LoRA limitations are reached

	Configure full fine-tuning with:

	```yaml
	model:
	training_mode: "full"

	training_strategy:
	name: "text_to_video"
	first_frame_conditioning_p: 0.1
	```

	> [!IMPORTANT]
	> Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8× H100 80GB) and distributed
	> training with FSDP. See [Training Guide](training-guide.md) for multi-GPU setup instructions.

	## 🔄 In-Context LoRA (IC-LoRA) Training

	IC-LoRA is a specialized training mode for video-to-video transformations.
	Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos.
	IC-LoRA enables a wide range of advanced video-to-video applications, such as:

	- Control adapters (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a
	target video
	- Video deblurring: Transform blurry input videos into sharp, high-quality outputs
	- Style transfer: Apply the style of a reference video to a target video sequence
	- Colorization: Convert grayscale reference videos into colorized outputs
	- Restoration and enhancement: Denoise, upscale, or restore old or degraded videos

	By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning.

	IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning:

	- Reference videos provide clean, unnoised conditioning input showing the "before" state
	- Target videos are noised during training and represent the desired "after" state
	- The model learns transformations from reference videos to target videos
	- Loss is applied only to the target portion, not the reference
	- Training and inference time increase significantly due to the doubled sequence length

	To enable IC-LoRA training, configure your YAML file with:

	```yaml
	model:
	training_mode: "lora" # Required: IC-LoRA uses LoRA mode

	training_strategy:
	name: "video_to_video"
	first_frame_conditioning_p: 0.1
	reference_latents_dir: "reference_latents" # Directory for reference video latents
	```

	Example configuration file:

	- 📄 [IC-LoRA Training](../configs/ltx2_v2v_ic_lora.yaml) - Video-to-video transformation training

	### Dataset Requirements for IC-LoRA

	- Your dataset must contain paired videos where each target video has a corresponding reference video
	- Reference and target videos must have identical resolution and length
	- Both reference and target videos should be preprocessed together using the same resolution buckets

	Dataset structure for IC-LoRA training:

	```
	preprocessed_data_root/
	├── latents/ # Target video latents (what the model learns to generate)
	├── conditions/ # Text embeddings for each video
	└── reference_latents/ # Reference video latents (conditioning input)
	```

	### Generating Reference Videos

	We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset.
	The script takes a JSON file as input (e.g., output of `caption_videos.py`) and updates it with the generated reference
	video paths.

	```bash
	uv run python scripts/compute_reference.py scenes_output_dir/ \
	--output scenes_output_dir/dataset.json
	```

	To compute a different condition (depth maps, pose skeletons, etc.), modify the `compute_reference()` function in the
	script.

	### Configuration Requirements for IC-LoRA

	- You must provide `reference_videos` in your validation configuration when using IC-LoRA training
	- The number of reference videos must match the number of validation prompts

	Example validation configuration for IC-LoRA:

	```yaml
	validation:
	prompts:
	- "First prompt describing the desired output"
	- "Second prompt describing the desired output"
	reference_videos:
	- "/path/to/reference1.mp4"
	- "/path/to/reference2.mp4"
	include_reference_in_output: true # Show reference side-by-side with output
	```

	## 📊 Training Mode Comparison

	\| Aspect \| LoRA \| Audio-Video LoRA \| Full Fine-tuning \| IC-LoRA \|
	\|----------------------\|------------\|------------------\|------------------\|----------------\|
	\| Memory Usage \| Low \| Low-Medium \| High \| Medium \|
	\| Training Speed \| Fast \| Fast \| Slow \| Medium \|
	\| Output Size \| 100MB-few GB (depends on rank) \| 100MB-few GB (depends on rank) \| Tens of GB \| 100MB-few GB (depends on rank) \|
	\| Flexibility \| Medium \| Medium \| High \| Specialized \|
	\| Audio Support \| Optional \| Yes \| Optional \| No \|
	\| Reference Videos \| No \| No \| No \| Yes (required) \|

	## 🎬 Using Trained Models for Inference

	After training, use the [`ltx-pipelines`](../../ltx-pipelines/) package for production inference with your trained LoRAs:

	\| Training Mode \| Recommended Pipeline \|
	\|---------------\|---------------------\|
	\| LoRA / Audio-Video LoRA \| `TI2VidOneStagePipeline` or `TI2VidTwoStagesPipeline` \|
	\| IC-LoRA \| `ICLoraPipeline` \|

	All pipelines support loading custom LoRAs via the `loras` parameter. See the [`ltx-pipelines`](../../ltx-pipelines/) package
	documentation for detailed usage instructions.

	## 🚀 Next Steps

	Once you've chosen your training mode:

	- Set up your dataset using [Dataset Preparation](dataset-preparation.md)
	- Configure your training parameters in [Configuration Reference](configuration-reference.md)
	- Start training with the [Training Guide](training-guide.md)