Spaces:
Running
on
Zero
Running
on
Zero
File size: 6,817 Bytes
ebfc6b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
# Training Guide
This guide covers how to run training jobs, from basic single-GPU training to advanced distributed setups and automatic
model uploads.
## β‘ Basic Training (Single GPU)
After preprocessing your dataset and preparing a configuration file, you can start training using the trainer script:
```bash
uv run python scripts/train.py configs/ltx2_av_lora.yaml
```
The trainer will:
1. **Load your configuration** and validate all parameters
2. **Initialize models** and apply optimizations
3. **Run the training loop** with progress tracking
4. **Generate validation videos** (if configured)
5. **Save the trained weights** in your output directory
### Output Files
**For LoRA training:**
- `lora_weights.safetensors` - Main LoRA weights file
- `training_config.yaml` - Copy of training configuration
- `validation_samples/` - Generated validation videos (if enabled)
**For full model fine-tuning:**
- `model_weights.safetensors` - Full model weights
- `training_config.yaml` - Copy of training configuration
- `validation_samples/` - Generated validation videos (if enabled)
## π₯οΈ Distributed / Multi-GPU Training
We use Hugging Face π€ [Accelerate](https://huggingface.co/docs/accelerate/index) for multi-GPU DDP and FSDP.
### Configure Accelerate
Run the interactive wizard once to set up your environment (DDP / FSDP, GPU count, etc.):
```bash
uv run accelerate config
```
This stores your preferences in `~/.cache/huggingface/accelerate/default_config.yaml`.
### Use the Provided Accelerate Configs (Recommended)
We include ready-to-use Accelerate config files in `configs/accelerate/`:
- [ddp.yaml](../configs/accelerate/ddp.yaml) β Standard DDP
- [ddp_compile.yaml](../configs/accelerate/ddp_compile.yaml) β DDP with `torch.compile` (Inductor)
- [fsdp.yaml](../configs/accelerate/fsdp.yaml) β Standard FSDP (auto-wraps `BasicAVTransformerBlock`)
- [fsdp_compile.yaml](../configs/accelerate/fsdp_compile.yaml) β FSDP with `torch.compile` (Inductor)
Launch with a specific config using `--config_file`:
```bash
# DDP (2 GPUs shown as example)
CUDA_VISIBLE_DEVICES=0,1 \
uv run accelerate launch --config_file configs/accelerate/ddp.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
# DDP + torch.compile
CUDA_VISIBLE_DEVICES=0,1 \
uv run accelerate launch --config_file configs/accelerate/ddp_compile.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
# FSDP (4 GPUs shown as example)
CUDA_VISIBLE_DEVICES=0,1,2,3 \
uv run accelerate launch --config_file configs/accelerate/fsdp.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
# FSDP + torch.compile
CUDA_VISIBLE_DEVICES=0,1,2,3 \
uv run accelerate launch --config_file configs/accelerate/fsdp_compile.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
```
**Notes:**
- The number of processes is taken from the Accelerate config (`num_processes`). Override with `--num_processes X` or
restrict GPUs with `CUDA_VISIBLE_DEVICES`.
- The compile variants enable `torch.compile` with the Inductor backend via Accelerate's `dynamo_config`.
- FSDP configs auto-wrap the transformer blocks (`fsdp_transformer_layer_cls_to_wrap: BasicAVTransformerBlock`).
### Launch with Your Default Accelerate Config
If you prefer to use your default Accelerate profile:
```bash
# Use settings from your default accelerate config
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
# Override number of processes on the fly (e.g., 2 GPUs)
uv run accelerate launch --num_processes 2 scripts/train.py configs/ltx2_av_lora.yaml
# Select specific GPUs
CUDA_VISIBLE_DEVICES=0,1 uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
```
> [!TIP]
> You can disable the in-terminal progress bars with `--disable_progress_bars` flag in the trainer CLI if desired.
### Benefits of Distributed Training
- **Faster training**: Distribute workload across multiple GPUs
- **Larger effective batch sizes**: Combine gradients from multiple GPUs
- **Memory efficiency**: Each GPU handles a portion of the batch
> [!NOTE]
> Distributed training requires that all GPUs have sufficient memory for the model and batch size. The effective batch
> size becomes `batch_size Γ num_processes`.
## π€ Pushing Models to Hugging Face Hub
You can automatically push your trained models to the Hugging Face Hub by adding the following to your configuration:
```yaml
hub:
push_to_hub: true
hub_model_id: "your-username/your-model-name"
```
### Prerequisites
Before pushing, make sure you:
1. **Have a Hugging Face account** - Sign up at [huggingface.co](https://huggingface.co)
2. **Are logged in** via `huggingface-cli login` or have set the `HUGGING_FACE_HUB_TOKEN` environment variable
3. **Have write access** to the specified repository (it will be created if it doesn't exist)
### Login Options
**Option 1: Interactive login**
```bash
uv run huggingface-cli login
```
**Option 2: Environment variable**
```bash
export HUGGING_FACE_HUB_TOKEN="your_token_here"
```
### What Gets Uploaded
The trainer will automatically:
- **Create a model card** with training details and sample outputs
- **Upload model weights**
- **Push sample videos as GIFs** in the model card
- **Include training configuration and prompts**
## π Weights & Biases Logging
Enable experiment tracking with W&B by adding to your configuration:
```yaml
wandb:
enabled: true
project: "ltx-2-trainer"
entity: null # Your W&B username or team
tags: [ "ltx2", "lora" ]
log_validation_videos: true
```
This will log:
- Training loss and learning rate
- Validation videos
- Model configuration
- Training progress
## π Next Steps
After training completes:
- **Run inference with your trained LoRA** - The [`ltx-pipelines`](../../ltx-pipelines/) package provides production-ready inference
pipelines that support loading custom LoRAs. Available pipelines include text-to-video, image-to-video,
IC-LoRA video-to-video, and more. See the [`ltx-pipelines`](../../ltx-pipelines/) package for usage details.
- **Test your model** with validation prompts
- **Iterate and improve** based on validation results
- **Share your results** by pushing to Hugging Face Hub
## π‘ Tips for Successful Training
- **Start small**: Begin with a small dataset and a few hundred steps to verify everything works
- **Monitor validation**: Keep an eye on validation samples to catch overfitting
- **Adjust learning rate**: Lower learning rates often produce better results
- **Use gradient checkpointing**: Essential for training with limited GPU memory
- **Save checkpoints**: Regular checkpoints help recover from interruptions
## Need Help?
If you encounter issues during training, see the [Troubleshooting Guide](troubleshooting.md).
Join our [Discord community](https://discord.gg/2mafsHjJ) for real-time help!
|