Spaces:

Lightricks
/

ltx-2

Running on Zero

App Files Files Community

ltx-2 / packages /ltx-trainer /docs /troubleshooting.md

linoy

inital commit

ebfc6b3 16 days ago

preview code

raw

history blame contribute delete

7.2 kB

	# Troubleshooting Guide

	This guide covers common issues and solutions when training with the LTX-2 trainer.

	## 🔧 VRAM and Memory Issues

	Memory management is crucial for successful training with LTX-2.

	### Memory Optimization Techniques

	#### 1. Enable Gradient Checkpointing

	Gradient checkpointing trades training speed for memory savings. Highly recommended for most training runs:

	```yaml
	optimization:
	enable_gradient_checkpointing: true
	```

	#### 2. Enable 8-bit Text Encoder

	Load the Gemma text encoder in 8-bit precision to save GPU memory:

	```yaml
	acceleration:
	load_text_encoder_in_8bit: true
	```

	#### 3. Reduce Batch Size

	Lower the batch size if you encounter out-of-memory errors:

	```yaml
	optimization:
	batch_size: 1 # Start with 1 and increase gradually
	```

	Use gradient accumulation to maintain a larger effective batch size:

	```yaml
	optimization:
	batch_size: 1
	gradient_accumulation_steps: 4 # Effective batch size = 4
	```

	#### 4. Use Lower Resolution

	Reduce spatial or temporal dimensions to save memory:

	```bash
	# Smaller spatial resolution
	uv run python scripts/process_dataset.py dataset.json \
	--resolution-buckets "512x512x49" \
	--model-path /path/to/model.safetensors \
	--text-encoder-path /path/to/gemma

	# Fewer frames
	uv run python scripts/process_dataset.py dataset.json \
	--resolution-buckets "960x544x25" \
	--model-path /path/to/model.safetensors \
	--text-encoder-path /path/to/gemma
	```

	#### 5. Enable Model Quantization

	Use quantization to reduce memory usage:

	```yaml
	acceleration:
	quantization: "int8-quanto" # Options: int8-quanto, int4-quanto, fp8-quanto
	```

	#### 6. Use 8-bit Optimizer

	The 8-bit AdamW optimizer uses less memory:

	```yaml
	optimization:
	optimizer_type: "adamw8bit"
	```

	---

	## ⚠️ Common Usage Issues

	### Issue: "No module named 'ltx_trainer'" Error

	Solution:
	Ensure you've installed the dependencies and are using `uv run` to execute scripts:

	```bash
	# From the repository root
	uv sync
	cd packages/ltx-trainer
	uv run python scripts/train.py configs/ltx2_av_lora.yaml
	```

	> [!TIP]
	> Always use `uv run` to execute Python scripts. This automatically uses the correct virtual environment
	> without requiring manual activation.

	### Issue: "Gemma model path is not a directory" Error

	Solution:
	The `text_encoder_path` must point to a directory containing the Gemma model, not a file:

	```yaml
	model:
	model_path: "/path/to/ltx-2-model.safetensors" # File path
	text_encoder_path: "/path/to/gemma-model/" # Directory path
	```

	### Issue: "Model path does not exist" Error

	Solution:
	LTX-2 requires local model paths. URLs are not supported:

	```yaml
	# ✅ Correct - local path
	model:
	model_path: "/path/to/ltx-2-model.safetensors"

	# ❌ Wrong - URL not supported
	model:
	model_path: "https://huggingface.co/..."
	```

	### Issue: "Frames must satisfy frames % 8 == 1" Error

	Solution:
	LTX-2 requires the number of frames to satisfy `frames % 8 == 1`:

	- ✅ Valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121
	- ❌ Invalid: 24, 32, 48, 64, 100

	### Issue: Slow Training Speed

	Optimizations:

	1. Disable gradient checkpointing (if you have enough VRAM):

	```yaml
	optimization:
	enable_gradient_checkpointing: false
	```


	2. Use torch.compile via Accelerate:

	```bash
	uv run accelerate launch --config_file configs/accelerate/ddp_compile.yaml \
	scripts/train.py configs/ltx2_av_lora.yaml
	```

	### Issue: Poor Quality Validation Outputs

	Solutions:

	1. Use Image-to-Video Validation:
	For more reliable validation, use image-to-video (first-frame conditioning) rather than pure text-to-video:

	```yaml
	validation:
	prompts:
	- "a professional portrait video of a person"
	images:
	- "/path/to/first_frame.png" # One image per prompt
	```

	2. Increase inference steps:

	```yaml
	validation:
	inference_steps: 50 # Default is 30
	```

	3. Adjust guidance settings:

	```yaml
	validation:
	guidance_scale: 3.0 # CFG scale (recommended: 3.0)
	stg_scale: 1.0 # STG scale for temporal coherence (recommended: 1.0)
	stg_blocks: [29] # Transformer block to perturb
	```

	4. Check caption quality:
	Review and manually edit captions for accuracy if using auto-generated captions.
	LTX-2 prefers long, detailed captions that describe both visual content and audio (e.g., ambient sounds, speech, music).

	5. Check target modules:
	Ensure your `target_modules` configuration matches your training goals. For audio-video training,
	use patterns that match both branches (e.g., `"to_k"` instead of `"attn1.to_k"`).
	See [Understanding Target Modules](configuration-reference.md#understanding-target-modules) for details.

	6. Adjust LoRA rank:
	Try higher values for more capacity:

	```yaml
	lora:
	rank: 64 # Or 128 for more capacity
	```

	7. Increase training steps:

	```yaml
	optimization:
	steps: 3000
	```

	---

	## 🔍 Debugging Tools

	### Monitor GPU Memory Usage

	Track memory usage during training:

	```bash
	# Watch GPU memory in real-time
	watch -n 1 nvidia-smi

	# Log memory usage to file
	nvidia-smi --query-gpu=memory.used,memory.total --format=csv --loop=5 > memory_log.csv
	```

	### Verify Preprocessed Data

	Decode latents to visualize the preprocessed videos:

	```bash
	uv run python scripts/decode_latents.py dataset/.precomputed/latents debug_output \
	--model-path /path/to/model.safetensors
	```

	To also decode audio latents, add the `--with-audio` flag:

	```bash
	uv run python scripts/decode_latents.py dataset/.precomputed/latents debug_output \
	--model-path /path/to/model.safetensors \
	--with-audio
	```

	Compare decoded videos and audio with originals to ensure quality.

	---

	## 💡 Best Practices

	### Before Training

	- [ ] Test preprocessing with a small subset first
	- [ ] Verify all video files are accessible
	- [ ] Check available GPU memory
	- [ ] Review configuration against hardware capabilities
	- [ ] Ensure model and text encoder paths are correct

	### During Training

	- [ ] Monitor GPU memory usage
	- [ ] Check loss convergence regularly
	- [ ] Review validation samples periodically
	- [ ] Save checkpoints frequently

	### After Training

	- [ ] Test trained model with diverse prompts
	- [ ] Document training parameters and results
	- [ ] Archive training data and configs

	## 🆘 Getting Help

	If you're still experiencing issues:

	1. Check logs: Review console output for error details
	2. Search issues: Look through GitHub issues for similar problems
	3. Provide details: When reporting issues, include:
	- Hardware specifications (GPU model, VRAM)
	- Configuration file used
	- Complete error message
	- Steps to reproduce the issue

	---

	## 🤝 Join the Community

	Have questions, want to share your results, or need real-time help?
	Join our [community Discord server](https://discord.gg/2mafsHjJ) to connect with other users and the development team!

	- Get troubleshooting help
	- Share your training results and workflows
	- Stay up to date with announcements and updates

	We look forward to seeing you there!