Spaces:
Running
on
Zero
Running
on
Zero
| # Training Modes Guide | |
| The trainer supports several training modes, each suited for different use cases and requirements. | |
| ## π― Standard LoRA Training (Video-Only) | |
| Standard LoRA (Low-Rank Adaptation) training fine-tunes the model by adding small, trainable adapter layers while | |
| keeping the base model frozen. This approach: | |
| - **Requires significantly less memory and compute** than full fine-tuning | |
| - **Produces small, portable weight files** (typically a few hundred MB) | |
| - **Is ideal for learning specific styles, effects, or concepts** | |
| - **Can be easily combined with other LoRAs** during inference | |
| Configure standard LoRA training with: | |
| ```yaml | |
| model: | |
| training_mode: "lora" | |
| training_strategy: | |
| name: "text_to_video" | |
| first_frame_conditioning_p: 0.1 | |
| with_audio: false # Video-only training | |
| ``` | |
| ## π Audio-Video LoRA Training | |
| LTX-2 supports joint audio-video generation. You can train LoRA adapters that affect both video and audio output: | |
| - **Synchronized audio-video generation** - Audio matches the visual content | |
| - **Same efficient LoRA approach** - Just enable audio training | |
| - **Requires audio latents** - Dataset must include preprocessed audio | |
| Configure audio-video training with: | |
| ```yaml | |
| model: | |
| training_mode: "lora" | |
| training_strategy: | |
| name: "text_to_video" | |
| first_frame_conditioning_p: 0.1 | |
| with_audio: true # Enable audio training | |
| audio_latents_dir: "audio_latents" # Directory containing audio latents | |
| ``` | |
| **Example configuration file:** | |
| - π [Audio-Video LoRA Training](../configs/ltx2_av_lora.yaml) | |
| **Dataset structure for audio-video training:** | |
| ``` | |
| preprocessed_data_root/ | |
| βββ latents/ # Video latents | |
| βββ conditions/ # Text embeddings | |
| βββ audio_latents/ # Audio latents (required when with_audio: true) | |
| ``` | |
| > [!IMPORTANT] | |
| > When training audio-video LoRAs, ensure your `target_modules` configuration captures video, audio, and | |
| > cross-modal attention branches. Use patterns like `"to_k"` instead of `"attn1.to_k"` to match: | |
| > - Video modules: `attn1.to_k`, `attn2.to_k` | |
| > - Audio modules: `audio_attn1.to_k`, `audio_attn2.to_k` | |
| > - Cross-modal modules: `audio_to_video_attn.to_k`, `video_to_audio_attn.to_k` | |
| > | |
| > The cross-modal attention modules (`audio_to_video_attn` and `video_to_audio_attn`) enable bidirectional | |
| > information flow between audio and video, which is critical for synchronized audiovisual generation. | |
| > See [Understanding Target Modules](configuration-reference.md#understanding-target-modules) for detailed guidance. | |
| > [!NOTE] | |
| > You can generate audio during validation even if you're not training the audio branch. | |
| > Set `validation.generate_audio: true` independently of `training_strategy.with_audio`. | |
| ## π₯ Full Model Fine-tuning | |
| Full model fine-tuning updates all parameters of the base model, providing maximum flexibility but | |
| requiring substantial computational resources and larger training datasets: | |
| - **Offers the highest potential quality and capability improvements** | |
| - **Requires multiple GPUs** and distributed training techniques (e.g., FSDP) | |
| - **Produces large checkpoint files** (several GB) | |
| - **Best for major model adaptations** or when LoRA limitations are reached | |
| Configure full fine-tuning with: | |
| ```yaml | |
| model: | |
| training_mode: "full" | |
| training_strategy: | |
| name: "text_to_video" | |
| first_frame_conditioning_p: 0.1 | |
| ``` | |
| > [!IMPORTANT] | |
| > Full fine-tuning of LTX-2 requires multiple high-end GPUs (e.g., 4-8Γ H100 80GB) and distributed | |
| > training with FSDP. See [Training Guide](training-guide.md) for multi-GPU setup instructions. | |
| ## π In-Context LoRA (IC-LoRA) Training | |
| IC-LoRA is a specialized training mode for video-to-video transformations. | |
| Unlike standard training modes that learn from individual videos, IC-LoRA learns transformations from pairs of videos. | |
| IC-LoRA enables a wide range of advanced video-to-video applications, such as: | |
| - **Control adapters** (e.g., Depth, Pose): Learn to map from a control signal (like a depth map or pose skeleton) to a | |
| target video | |
| - **Video deblurring**: Transform blurry input videos into sharp, high-quality outputs | |
| - **Style transfer**: Apply the style of a reference video to a target video sequence | |
| - **Colorization**: Convert grayscale reference videos into colorized outputs | |
| - **Restoration and enhancement**: Denoise, upscale, or restore old or degraded videos | |
| By providing paired reference and target videos, IC-LoRA can learn complex transformations that go beyond caption-based conditioning. | |
| IC-LoRA training fundamentally differs from standard LoRA and full fine-tuning: | |
| - **Reference videos** provide clean, unnoised conditioning input showing the "before" state | |
| - **Target videos** are noised during training and represent the desired "after" state | |
| - **The model learns transformations** from reference videos to target videos | |
| - **Loss is applied only to the target portion**, not the reference | |
| - **Training and inference time increase significantly** due to the doubled sequence length | |
| To enable IC-LoRA training, configure your YAML file with: | |
| ```yaml | |
| model: | |
| training_mode: "lora" # Required: IC-LoRA uses LoRA mode | |
| training_strategy: | |
| name: "video_to_video" | |
| first_frame_conditioning_p: 0.1 | |
| reference_latents_dir: "reference_latents" # Directory for reference video latents | |
| ``` | |
| **Example configuration file:** | |
| - π [IC-LoRA Training](../configs/ltx2_v2v_ic_lora.yaml) - Video-to-video transformation training | |
| ### Dataset Requirements for IC-LoRA | |
| - Your dataset must contain **paired videos** where each target video has a corresponding reference video | |
| - Reference and target videos must have **identical resolution and length** | |
| - Both reference and target videos should be **preprocessed together** using the same resolution buckets | |
| **Dataset structure for IC-LoRA training:** | |
| ``` | |
| preprocessed_data_root/ | |
| βββ latents/ # Target video latents (what the model learns to generate) | |
| βββ conditions/ # Text embeddings for each video | |
| βββ reference_latents/ # Reference video latents (conditioning input) | |
| ``` | |
| ### Generating Reference Videos | |
| We provide an example script to generate reference videos (e.g., Canny edge maps) for a given dataset. | |
| The script takes a JSON file as input (e.g., output of `caption_videos.py`) and updates it with the generated reference | |
| video paths. | |
| ```bash | |
| uv run python scripts/compute_reference.py scenes_output_dir/ \ | |
| --output scenes_output_dir/dataset.json | |
| ``` | |
| To compute a different condition (depth maps, pose skeletons, etc.), modify the `compute_reference()` function in the | |
| script. | |
| ### Configuration Requirements for IC-LoRA | |
| - You **must** provide `reference_videos` in your validation configuration when using IC-LoRA training | |
| - The number of reference videos must match the number of validation prompts | |
| Example validation configuration for IC-LoRA: | |
| ```yaml | |
| validation: | |
| prompts: | |
| - "First prompt describing the desired output" | |
| - "Second prompt describing the desired output" | |
| reference_videos: | |
| - "/path/to/reference1.mp4" | |
| - "/path/to/reference2.mp4" | |
| include_reference_in_output: true # Show reference side-by-side with output | |
| ``` | |
| ## π Training Mode Comparison | |
| | Aspect | LoRA | Audio-Video LoRA | Full Fine-tuning | IC-LoRA | | |
| |----------------------|------------|------------------|------------------|----------------| | |
| | **Memory Usage** | Low | Low-Medium | High | Medium | | |
| | **Training Speed** | Fast | Fast | Slow | Medium | | |
| | **Output Size** | 100MB-few GB (depends on rank) | 100MB-few GB (depends on rank) | Tens of GB | 100MB-few GB (depends on rank) | | |
| | **Flexibility** | Medium | Medium | High | Specialized | | |
| | **Audio Support** | Optional | Yes | Optional | No | | |
| | **Reference Videos** | No | No | No | Yes (required) | | |
| ## π¬ Using Trained Models for Inference | |
| After training, use the [`ltx-pipelines`](../../ltx-pipelines/) package for production inference with your trained LoRAs: | |
| | Training Mode | Recommended Pipeline | | |
| |---------------|---------------------| | |
| | LoRA / Audio-Video LoRA | `TI2VidOneStagePipeline` or `TI2VidTwoStagesPipeline` | | |
| | IC-LoRA | `ICLoraPipeline` | | |
| All pipelines support loading custom LoRAs via the `loras` parameter. See the [`ltx-pipelines`](../../ltx-pipelines/) package | |
| documentation for detailed usage instructions. | |
| ## π Next Steps | |
| Once you've chosen your training mode: | |
| - Set up your dataset using [Dataset Preparation](dataset-preparation.md) | |
| - Configure your training parameters in [Configuration Reference](configuration-reference.md) | |
| - Start training with the [Training Guide](training-guide.md) | |