ltx-2 / packages /ltx-trainer /docs /training-guide.md
linoy
inital commit
ebfc6b3
# Training Guide
This guide covers how to run training jobs, from basic single-GPU training to advanced distributed setups and automatic
model uploads.
## ⚑ Basic Training (Single GPU)
After preprocessing your dataset and preparing a configuration file, you can start training using the trainer script:
```bash
uv run python scripts/train.py configs/ltx2_av_lora.yaml
```
The trainer will:
1. **Load your configuration** and validate all parameters
2. **Initialize models** and apply optimizations
3. **Run the training loop** with progress tracking
4. **Generate validation videos** (if configured)
5. **Save the trained weights** in your output directory
### Output Files
**For LoRA training:**
- `lora_weights.safetensors` - Main LoRA weights file
- `training_config.yaml` - Copy of training configuration
- `validation_samples/` - Generated validation videos (if enabled)
**For full model fine-tuning:**
- `model_weights.safetensors` - Full model weights
- `training_config.yaml` - Copy of training configuration
- `validation_samples/` - Generated validation videos (if enabled)
## πŸ–₯️ Distributed / Multi-GPU Training
We use Hugging Face πŸ€— [Accelerate](https://huggingface.co/docs/accelerate/index) for multi-GPU DDP and FSDP.
### Configure Accelerate
Run the interactive wizard once to set up your environment (DDP / FSDP, GPU count, etc.):
```bash
uv run accelerate config
```
This stores your preferences in `~/.cache/huggingface/accelerate/default_config.yaml`.
### Use the Provided Accelerate Configs (Recommended)
We include ready-to-use Accelerate config files in `configs/accelerate/`:
- [ddp.yaml](../configs/accelerate/ddp.yaml) β€” Standard DDP
- [ddp_compile.yaml](../configs/accelerate/ddp_compile.yaml) β€” DDP with `torch.compile` (Inductor)
- [fsdp.yaml](../configs/accelerate/fsdp.yaml) β€” Standard FSDP (auto-wraps `BasicAVTransformerBlock`)
- [fsdp_compile.yaml](../configs/accelerate/fsdp_compile.yaml) β€” FSDP with `torch.compile` (Inductor)
Launch with a specific config using `--config_file`:
```bash
# DDP (2 GPUs shown as example)
CUDA_VISIBLE_DEVICES=0,1 \
uv run accelerate launch --config_file configs/accelerate/ddp.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
# DDP + torch.compile
CUDA_VISIBLE_DEVICES=0,1 \
uv run accelerate launch --config_file configs/accelerate/ddp_compile.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
# FSDP (4 GPUs shown as example)
CUDA_VISIBLE_DEVICES=0,1,2,3 \
uv run accelerate launch --config_file configs/accelerate/fsdp.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
# FSDP + torch.compile
CUDA_VISIBLE_DEVICES=0,1,2,3 \
uv run accelerate launch --config_file configs/accelerate/fsdp_compile.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
```
**Notes:**
- The number of processes is taken from the Accelerate config (`num_processes`). Override with `--num_processes X` or
restrict GPUs with `CUDA_VISIBLE_DEVICES`.
- The compile variants enable `torch.compile` with the Inductor backend via Accelerate's `dynamo_config`.
- FSDP configs auto-wrap the transformer blocks (`fsdp_transformer_layer_cls_to_wrap: BasicAVTransformerBlock`).
### Launch with Your Default Accelerate Config
If you prefer to use your default Accelerate profile:
```bash
# Use settings from your default accelerate config
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
# Override number of processes on the fly (e.g., 2 GPUs)
uv run accelerate launch --num_processes 2 scripts/train.py configs/ltx2_av_lora.yaml
# Select specific GPUs
CUDA_VISIBLE_DEVICES=0,1 uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
```
> [!TIP]
> You can disable the in-terminal progress bars with `--disable_progress_bars` flag in the trainer CLI if desired.
### Benefits of Distributed Training
- **Faster training**: Distribute workload across multiple GPUs
- **Larger effective batch sizes**: Combine gradients from multiple GPUs
- **Memory efficiency**: Each GPU handles a portion of the batch
> [!NOTE]
> Distributed training requires that all GPUs have sufficient memory for the model and batch size. The effective batch
> size becomes `batch_size Γ— num_processes`.
## πŸ€— Pushing Models to Hugging Face Hub
You can automatically push your trained models to the Hugging Face Hub by adding the following to your configuration:
```yaml
hub:
push_to_hub: true
hub_model_id: "your-username/your-model-name"
```
### Prerequisites
Before pushing, make sure you:
1. **Have a Hugging Face account** - Sign up at [huggingface.co](https://huggingface.co)
2. **Are logged in** via `huggingface-cli login` or have set the `HUGGING_FACE_HUB_TOKEN` environment variable
3. **Have write access** to the specified repository (it will be created if it doesn't exist)
### Login Options
**Option 1: Interactive login**
```bash
uv run huggingface-cli login
```
**Option 2: Environment variable**
```bash
export HUGGING_FACE_HUB_TOKEN="your_token_here"
```
### What Gets Uploaded
The trainer will automatically:
- **Create a model card** with training details and sample outputs
- **Upload model weights**
- **Push sample videos as GIFs** in the model card
- **Include training configuration and prompts**
## πŸ“Š Weights & Biases Logging
Enable experiment tracking with W&B by adding to your configuration:
```yaml
wandb:
enabled: true
project: "ltx-2-trainer"
entity: null # Your W&B username or team
tags: [ "ltx2", "lora" ]
log_validation_videos: true
```
This will log:
- Training loss and learning rate
- Validation videos
- Model configuration
- Training progress
## πŸš€ Next Steps
After training completes:
- **Run inference with your trained LoRA** - The [`ltx-pipelines`](../../ltx-pipelines/) package provides production-ready inference
pipelines that support loading custom LoRAs. Available pipelines include text-to-video, image-to-video,
IC-LoRA video-to-video, and more. See the [`ltx-pipelines`](../../ltx-pipelines/) package for usage details.
- **Test your model** with validation prompts
- **Iterate and improve** based on validation results
- **Share your results** by pushing to Hugging Face Hub
## πŸ’‘ Tips for Successful Training
- **Start small**: Begin with a small dataset and a few hundred steps to verify everything works
- **Monitor validation**: Keep an eye on validation samples to catch overfitting
- **Adjust learning rate**: Lower learning rates often produce better results
- **Use gradient checkpointing**: Essential for training with limited GPU memory
- **Save checkpoints**: Regular checkpoints help recover from interruptions
## Need Help?
If you encounter issues during training, see the [Troubleshooting Guide](troubleshooting.md).
Join our [Discord community](https://discord.gg/2mafsHjJ) for real-time help!