ltx-2 / packages /ltx-trainer /docs /troubleshooting.md
linoy
inital commit
ebfc6b3
# Troubleshooting Guide
This guide covers common issues and solutions when training with the LTX-2 trainer.
## 🔧 VRAM and Memory Issues
Memory management is crucial for successful training with LTX-2.
### Memory Optimization Techniques
#### 1. Enable Gradient Checkpointing
Gradient checkpointing trades training speed for memory savings. **Highly recommended** for most training runs:
```yaml
optimization:
enable_gradient_checkpointing: true
```
#### 2. Enable 8-bit Text Encoder
Load the Gemma text encoder in 8-bit precision to save GPU memory:
```yaml
acceleration:
load_text_encoder_in_8bit: true
```
#### 3. Reduce Batch Size
Lower the batch size if you encounter out-of-memory errors:
```yaml
optimization:
batch_size: 1 # Start with 1 and increase gradually
```
Use gradient accumulation to maintain a larger effective batch size:
```yaml
optimization:
batch_size: 1
gradient_accumulation_steps: 4 # Effective batch size = 4
```
#### 4. Use Lower Resolution
Reduce spatial or temporal dimensions to save memory:
```bash
# Smaller spatial resolution
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "512x512x49" \
--model-path /path/to/model.safetensors \
--text-encoder-path /path/to/gemma
# Fewer frames
uv run python scripts/process_dataset.py dataset.json \
--resolution-buckets "960x544x25" \
--model-path /path/to/model.safetensors \
--text-encoder-path /path/to/gemma
```
#### 5. Enable Model Quantization
Use quantization to reduce memory usage:
```yaml
acceleration:
quantization: "int8-quanto" # Options: int8-quanto, int4-quanto, fp8-quanto
```
#### 6. Use 8-bit Optimizer
The 8-bit AdamW optimizer uses less memory:
```yaml
optimization:
optimizer_type: "adamw8bit"
```
---
## ⚠️ Common Usage Issues
### Issue: "No module named 'ltx_trainer'" Error
**Solution:**
Ensure you've installed the dependencies and are using `uv run` to execute scripts:
```bash
# From the repository root
uv sync
cd packages/ltx-trainer
uv run python scripts/train.py configs/ltx2_av_lora.yaml
```
> [!TIP]
> Always use `uv run` to execute Python scripts. This automatically uses the correct virtual environment
> without requiring manual activation.
### Issue: "Gemma model path is not a directory" Error
**Solution:**
The `text_encoder_path` must point to a directory containing the Gemma model, not a file:
```yaml
model:
model_path: "/path/to/ltx-2-model.safetensors" # File path
text_encoder_path: "/path/to/gemma-model/" # Directory path
```
### Issue: "Model path does not exist" Error
**Solution:**
LTX-2 requires local model paths. URLs are not supported:
```yaml
# ✅ Correct - local path
model:
model_path: "/path/to/ltx-2-model.safetensors"
# ❌ Wrong - URL not supported
model:
model_path: "https://huggingface.co/..."
```
### Issue: "Frames must satisfy frames % 8 == 1" Error
**Solution:**
LTX-2 requires the number of frames to satisfy `frames % 8 == 1`:
- ✅ Valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121
- ❌ Invalid: 24, 32, 48, 64, 100
### Issue: Slow Training Speed
**Optimizations:**
1. **Disable gradient checkpointing** (if you have enough VRAM):
```yaml
optimization:
enable_gradient_checkpointing: false
```
2. **Use torch.compile** via Accelerate:
```bash
uv run accelerate launch --config_file configs/accelerate/ddp_compile.yaml \
scripts/train.py configs/ltx2_av_lora.yaml
```
### Issue: Poor Quality Validation Outputs
**Solutions:**
1. **Use Image-to-Video Validation:**
For more reliable validation, use image-to-video (first-frame conditioning) rather than pure text-to-video:
```yaml
validation:
prompts:
- "a professional portrait video of a person"
images:
- "/path/to/first_frame.png" # One image per prompt
```
2. **Increase inference steps:**
```yaml
validation:
inference_steps: 50 # Default is 30
```
3. **Adjust guidance settings:**
```yaml
validation:
guidance_scale: 3.0 # CFG scale (recommended: 3.0)
stg_scale: 1.0 # STG scale for temporal coherence (recommended: 1.0)
stg_blocks: [29] # Transformer block to perturb
```
4. **Check caption quality:**
Review and manually edit captions for accuracy if using auto-generated captions.
LTX-2 prefers long, detailed captions that describe both visual content and audio (e.g., ambient sounds, speech, music).
5. **Check target modules:**
Ensure your `target_modules` configuration matches your training goals. For audio-video training,
use patterns that match both branches (e.g., `"to_k"` instead of `"attn1.to_k"`).
See [Understanding Target Modules](configuration-reference.md#understanding-target-modules) for details.
6. **Adjust LoRA rank:**
Try higher values for more capacity:
```yaml
lora:
rank: 64 # Or 128 for more capacity
```
7. **Increase training steps:**
```yaml
optimization:
steps: 3000
```
---
## 🔍 Debugging Tools
### Monitor GPU Memory Usage
Track memory usage during training:
```bash
# Watch GPU memory in real-time
watch -n 1 nvidia-smi
# Log memory usage to file
nvidia-smi --query-gpu=memory.used,memory.total --format=csv --loop=5 > memory_log.csv
```
### Verify Preprocessed Data
Decode latents to visualize the preprocessed videos:
```bash
uv run python scripts/decode_latents.py dataset/.precomputed/latents debug_output \
--model-path /path/to/model.safetensors
```
To also decode audio latents, add the `--with-audio` flag:
```bash
uv run python scripts/decode_latents.py dataset/.precomputed/latents debug_output \
--model-path /path/to/model.safetensors \
--with-audio
```
Compare decoded videos and audio with originals to ensure quality.
---
## 💡 Best Practices
### Before Training
- [ ] Test preprocessing with a small subset first
- [ ] Verify all video files are accessible
- [ ] Check available GPU memory
- [ ] Review configuration against hardware capabilities
- [ ] Ensure model and text encoder paths are correct
### During Training
- [ ] Monitor GPU memory usage
- [ ] Check loss convergence regularly
- [ ] Review validation samples periodically
- [ ] Save checkpoints frequently
### After Training
- [ ] Test trained model with diverse prompts
- [ ] Document training parameters and results
- [ ] Archive training data and configs
## 🆘 Getting Help
If you're still experiencing issues:
1. **Check logs:** Review console output for error details
2. **Search issues:** Look through GitHub issues for similar problems
3. **Provide details:** When reporting issues, include:
- Hardware specifications (GPU model, VRAM)
- Configuration file used
- Complete error message
- Steps to reproduce the issue
---
## 🤝 Join the Community
Have questions, want to share your results, or need real-time help?
Join our [community Discord server](https://discord.gg/2mafsHjJ) to connect with other users and the development team!
- Get troubleshooting help
- Share your training results and workflows
- Stay up to date with announcements and updates
We look forward to seeing you there!