Spaces:
Running
on
Zero
Running
on
Zero
| # Training Guide | |
| This guide covers how to run training jobs, from basic single-GPU training to advanced distributed setups and automatic | |
| model uploads. | |
| ## β‘ Basic Training (Single GPU) | |
| After preprocessing your dataset and preparing a configuration file, you can start training using the trainer script: | |
| ```bash | |
| uv run python scripts/train.py configs/ltx2_av_lora.yaml | |
| ``` | |
| The trainer will: | |
| 1. **Load your configuration** and validate all parameters | |
| 2. **Initialize models** and apply optimizations | |
| 3. **Run the training loop** with progress tracking | |
| 4. **Generate validation videos** (if configured) | |
| 5. **Save the trained weights** in your output directory | |
| ### Output Files | |
| **For LoRA training:** | |
| - `lora_weights.safetensors` - Main LoRA weights file | |
| - `training_config.yaml` - Copy of training configuration | |
| - `validation_samples/` - Generated validation videos (if enabled) | |
| **For full model fine-tuning:** | |
| - `model_weights.safetensors` - Full model weights | |
| - `training_config.yaml` - Copy of training configuration | |
| - `validation_samples/` - Generated validation videos (if enabled) | |
| ## π₯οΈ Distributed / Multi-GPU Training | |
| We use Hugging Face π€ [Accelerate](https://huggingface.co/docs/accelerate/index) for multi-GPU DDP and FSDP. | |
| ### Configure Accelerate | |
| Run the interactive wizard once to set up your environment (DDP / FSDP, GPU count, etc.): | |
| ```bash | |
| uv run accelerate config | |
| ``` | |
| This stores your preferences in `~/.cache/huggingface/accelerate/default_config.yaml`. | |
| ### Use the Provided Accelerate Configs (Recommended) | |
| We include ready-to-use Accelerate config files in `configs/accelerate/`: | |
| - [ddp.yaml](../configs/accelerate/ddp.yaml) β Standard DDP | |
| - [ddp_compile.yaml](../configs/accelerate/ddp_compile.yaml) β DDP with `torch.compile` (Inductor) | |
| - [fsdp.yaml](../configs/accelerate/fsdp.yaml) β Standard FSDP (auto-wraps `BasicAVTransformerBlock`) | |
| - [fsdp_compile.yaml](../configs/accelerate/fsdp_compile.yaml) β FSDP with `torch.compile` (Inductor) | |
| Launch with a specific config using `--config_file`: | |
| ```bash | |
| # DDP (2 GPUs shown as example) | |
| CUDA_VISIBLE_DEVICES=0,1 \ | |
| uv run accelerate launch --config_file configs/accelerate/ddp.yaml \ | |
| scripts/train.py configs/ltx2_av_lora.yaml | |
| # DDP + torch.compile | |
| CUDA_VISIBLE_DEVICES=0,1 \ | |
| uv run accelerate launch --config_file configs/accelerate/ddp_compile.yaml \ | |
| scripts/train.py configs/ltx2_av_lora.yaml | |
| # FSDP (4 GPUs shown as example) | |
| CUDA_VISIBLE_DEVICES=0,1,2,3 \ | |
| uv run accelerate launch --config_file configs/accelerate/fsdp.yaml \ | |
| scripts/train.py configs/ltx2_av_lora.yaml | |
| # FSDP + torch.compile | |
| CUDA_VISIBLE_DEVICES=0,1,2,3 \ | |
| uv run accelerate launch --config_file configs/accelerate/fsdp_compile.yaml \ | |
| scripts/train.py configs/ltx2_av_lora.yaml | |
| ``` | |
| **Notes:** | |
| - The number of processes is taken from the Accelerate config (`num_processes`). Override with `--num_processes X` or | |
| restrict GPUs with `CUDA_VISIBLE_DEVICES`. | |
| - The compile variants enable `torch.compile` with the Inductor backend via Accelerate's `dynamo_config`. | |
| - FSDP configs auto-wrap the transformer blocks (`fsdp_transformer_layer_cls_to_wrap: BasicAVTransformerBlock`). | |
| ### Launch with Your Default Accelerate Config | |
| If you prefer to use your default Accelerate profile: | |
| ```bash | |
| # Use settings from your default accelerate config | |
| uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml | |
| # Override number of processes on the fly (e.g., 2 GPUs) | |
| uv run accelerate launch --num_processes 2 scripts/train.py configs/ltx2_av_lora.yaml | |
| # Select specific GPUs | |
| CUDA_VISIBLE_DEVICES=0,1 uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml | |
| ``` | |
| > [!TIP] | |
| > You can disable the in-terminal progress bars with `--disable_progress_bars` flag in the trainer CLI if desired. | |
| ### Benefits of Distributed Training | |
| - **Faster training**: Distribute workload across multiple GPUs | |
| - **Larger effective batch sizes**: Combine gradients from multiple GPUs | |
| - **Memory efficiency**: Each GPU handles a portion of the batch | |
| > [!NOTE] | |
| > Distributed training requires that all GPUs have sufficient memory for the model and batch size. The effective batch | |
| > size becomes `batch_size Γ num_processes`. | |
| ## π€ Pushing Models to Hugging Face Hub | |
| You can automatically push your trained models to the Hugging Face Hub by adding the following to your configuration: | |
| ```yaml | |
| hub: | |
| push_to_hub: true | |
| hub_model_id: "your-username/your-model-name" | |
| ``` | |
| ### Prerequisites | |
| Before pushing, make sure you: | |
| 1. **Have a Hugging Face account** - Sign up at [huggingface.co](https://huggingface.co) | |
| 2. **Are logged in** via `huggingface-cli login` or have set the `HUGGING_FACE_HUB_TOKEN` environment variable | |
| 3. **Have write access** to the specified repository (it will be created if it doesn't exist) | |
| ### Login Options | |
| **Option 1: Interactive login** | |
| ```bash | |
| uv run huggingface-cli login | |
| ``` | |
| **Option 2: Environment variable** | |
| ```bash | |
| export HUGGING_FACE_HUB_TOKEN="your_token_here" | |
| ``` | |
| ### What Gets Uploaded | |
| The trainer will automatically: | |
| - **Create a model card** with training details and sample outputs | |
| - **Upload model weights** | |
| - **Push sample videos as GIFs** in the model card | |
| - **Include training configuration and prompts** | |
| ## π Weights & Biases Logging | |
| Enable experiment tracking with W&B by adding to your configuration: | |
| ```yaml | |
| wandb: | |
| enabled: true | |
| project: "ltx-2-trainer" | |
| entity: null # Your W&B username or team | |
| tags: [ "ltx2", "lora" ] | |
| log_validation_videos: true | |
| ``` | |
| This will log: | |
| - Training loss and learning rate | |
| - Validation videos | |
| - Model configuration | |
| - Training progress | |
| ## π Next Steps | |
| After training completes: | |
| - **Run inference with your trained LoRA** - The [`ltx-pipelines`](../../ltx-pipelines/) package provides production-ready inference | |
| pipelines that support loading custom LoRAs. Available pipelines include text-to-video, image-to-video, | |
| IC-LoRA video-to-video, and more. See the [`ltx-pipelines`](../../ltx-pipelines/) package for usage details. | |
| - **Test your model** with validation prompts | |
| - **Iterate and improve** based on validation results | |
| - **Share your results** by pushing to Hugging Face Hub | |
| ## π‘ Tips for Successful Training | |
| - **Start small**: Begin with a small dataset and a few hundred steps to verify everything works | |
| - **Monitor validation**: Keep an eye on validation samples to catch overfitting | |
| - **Adjust learning rate**: Lower learning rates often produce better results | |
| - **Use gradient checkpointing**: Essential for training with limited GPU memory | |
| - **Save checkpoints**: Regular checkpoints help recover from interruptions | |
| ## Need Help? | |
| If you encounter issues during training, see the [Troubleshooting Guide](troubleshooting.md). | |
| Join our [Discord community](https://discord.gg/2mafsHjJ) for real-time help! | |