SUPIR

Running on Zero

App Files Files Community

Fabrice-TIERCELIN commited on Jan 24

Commit

8788929

verified ·

1 Parent(s): f267cb0

Upload 4 files

Browse files

Files changed (4) hide show

packages/ltx-trainer/AGENTS.md +352 -0
packages/ltx-trainer/CLAUDE.md +1 -0
packages/ltx-trainer/README.md +61 -0
packages/ltx-trainer/pyproject.toml +89 -0

packages/ltx-trainer/AGENTS.md ADDED Viewed

	@@ -0,0 +1,352 @@

+# AGENTS.md
+This file provides guidance to AI coding assistants (Claude, Cursor, etc.) when working with code in this repository.
+## Project Overview
+**LTX-2 Trainer** is a training toolkit for fine-tuning the Lightricks LTX-2 audio-video generation model. It supports:
+- **LoRA training** - Efficient fine-tuning with adapters
+- **Full fine-tuning** - Complete model training
+- **Audio-video training** - Joint audio and video generation
+- **IC-LoRA training** - In-context control adapters for video-to-video transformations
+**Key Dependencies:**
+- **[`ltx-core`](../ltx-core/)** - Core model implementations (transformer, VAE, text encoder)
+- **[`ltx-pipelines`](../ltx-pipelines/)** - Inference pipeline components
+> **Important:** This trainer only supports **LTX-2** (the audio-video model). The older LTXV models are not supported.
+## Architecture Overview
+### Package Structure
+```
+packages/ltx-trainer/
+├── src/ltx_trainer/           # Main training module
+│   ├── config.py              # Pydantic configuration models
+│   ├── trainer.py             # Main training orchestration with Accelerate
+│   ├── model_loader.py        # Model loading using ltx-core
+│   ├── validation_sampler.py  # Inference for validation samples
+│   ├── datasets.py            # PrecomputedDataset for latent-based training
+│   ├── training_strategies/   # Strategy pattern for different training modes
+│   │   ├── __init__.py        # Factory function: get_training_strategy()
+│   │   ├── base_strategy.py   # TrainingStrategy ABC, ModelInputs, TrainingStrategyConfigBase
+│   │   ├── text_to_video.py   # TextToVideoStrategy, TextToVideoConfig
+│   │   └── video_to_video.py  # VideoToVideoStrategy, VideoToVideoConfig
+│   ├── timestep_samplers.py   # Flow matching timestep sampling
+│   ├── captioning.py          # Video captioning utilities
+│   ├── video_utils.py         # Video processing utilities
+│   └── hf_hub_utils.py        # HuggingFace Hub integration
+├── scripts/                   # User-facing CLI tools
+│   ├── train.py               # Main training script
+│   ├── process_dataset.py     # Dataset preprocessing
+│   ├── process_videos.py      # Video latent encoding
+│   ├── process_captions.py    # Text embedding computation
+│   ├── caption_videos.py      # Automatic video captioning
+│   ├── decode_latents.py      # Latent decoding for debugging
+│   ├── inference.py           # Inference with trained models
+│   ├── compute_reference.py   # Generate IC-LoRA reference videos
+│   └── split_scenes.py        # Scene detection and splitting
+├── configs/                   # Example training configurations
+│   ├── ltx2_av_lora.yaml      # Audio-video LoRA training
+│   ├── ltx2_v2v_ic_lora.yaml  # IC-LoRA video-to-video
+│   └── accelerate/            # Accelerate configs for distributed training
+└── docs/                      # Documentation
+```
+### Key Architectural Patterns
+**Model Loading:**
+- `ltx_trainer.model_loader` provides component loaders using `ltx-core`
+- Individual loaders: `load_transformer()`, `load_video_vae_encoder()`, `load_video_vae_decoder()`, `load_text_encoder()`, etc.
+- Combined loader: `load_model()` returns `LtxModelComponents` dataclass
+- Uses `SingleGPUModelBuilder` from ltx-core internally
+**Training Flow:**
+1. Configuration loaded via Pydantic models in `config.py`
+2. `Trainer` class orchestrates the training loop
+3. Training strategies (`TextToVideoStrategy`, `VideoToVideoStrategy`) prepare inputs and compute loss
+4. Accelerate handles distributed training and device placement
+5. Data flows as precomputed latents through `PrecomputedDataset`
+**Model Interface (Modality-based):**
+```python
+from ltx_core.model.transformer.modality import Modality
+# Create modality objects for video and audio
+video = Modality(
+    enabled=True,
+    latent=video_latents,      # [B, seq_len, 128]
+    timesteps=video_timesteps,  # [B, seq_len] per-token
+    positions=video_positions,  # [B, 3, seq_len, 2]
+    context=video_embeds,
+    context_mask=None,
+)
+audio = Modality(
+    enabled=True,
+    latent=audio_latents,
+    timesteps=audio_timesteps,
+    positions=audio_positions,  # [B, 1, seq_len, 2]
+    context=audio_embeds,
+    context_mask=None,
+)
+# Forward pass returns predictions for both modalities
+video_pred, audio_pred = model(video=video, audio=audio, perturbations=None)
+```
+> **Note:** `Modality` is immutable (frozen dataclass). Use `dataclasses.replace()` to modify.
+**Configuration System:**
+- All config in `src/ltx_trainer/config.py`
+- Main class: `LtxTrainerConfig`
+- Training strategy configs: `TextToVideoConfig`, `VideoToVideoConfig`
+- Uses Pydantic field validators and model validators
+- Config files in `configs/` directory
+## Development Commands
+### Setup and Installation
+```bash
+# From the repository root
+uv sync
+cd packages/ltx-trainer
+```
+### Code Quality
+```bash
+# Run ruff linting and formatting
+uv run ruff check .
+uv run ruff format .
+# Run pre-commit checks
+uv run pre-commit run --all-files
+```
+### Running Tests
+```bash
+cd packages/ltx-trainer
+uv run pytest
+```
+### Running Training
+```bash
+# Single GPU
+uv run python scripts/train.py configs/ltx2_av_lora.yaml
+# Multi-GPU with Accelerate
+uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
+```
+## Code Standards
+### Type Hints
+- **Always use type hints** for all function arguments and return values
+- Use Python 3.10+ syntax: `list[str]` not `List[str]`, `str | Path` not `Union[str, Path]`
+- Use `pathlib.Path` for file operations
+### Class Methods
+- Mark methods as `@staticmethod` if they don't access instance or class state
+- Use `@classmethod` for alternative constructors
+### AI/ML Specific
+- Use `@torch.inference_mode()` for inference (prefer over `@torch.no_grad()`)
+- Use `accelerator.device` for distributed compatibility
+- Support mixed precision (`bfloat16` via dtype parameters)
+- Use gradient checkpointing for memory-intensive training
+### Logging
+- Use `from ltx_trainer import logger` for all messages
+- Avoid print statements in production code
+## Important Files & Modules
+### Configuration (CRITICAL)
+**`src/ltx_trainer/config.py`** - Master config definitions
+Key classes:
+- `LtxTrainerConfig` - Main configuration container
+- `ModelConfig` - Model paths and training mode
+- `TrainingStrategyConfig` - Union of `TextToVideoConfig` | `VideoToVideoConfig`
+- `LoraConfig` - LoRA hyperparameters
+- `OptimizationConfig` - Learning rate, batch size, etc.
+- `ValidationConfig` - Validation settings
+- `WandbConfig` - W&B logging settings
+**⚠️ When modifying config.py:**
+1. Update ALL config files in `configs/`
+2. Update `docs/configuration-reference.md`
+3. Test that all configs remain valid
+### Training Core
+**`src/ltx_trainer/trainer.py`** - Main training loop
+- Implements distributed training with Accelerate
+- Handles mixed precision, gradient accumulation, checkpointing
+- Uses training strategies for mode-specific logic
+**`src/ltx_trainer/training_strategies/`** - Strategy pattern
+- `base_strategy.py`: `TrainingStrategy` ABC, `ModelInputs` dataclass
+- `text_to_video.py`: Standard text-to-video (with optional audio)
+- `video_to_video.py`: IC-LoRA video-to-video transformations
+Key methods each strategy implements:
+- `get_data_sources()` - Required data directories
+- `prepare_training_inputs()` - Convert batch to `ModelInputs`
+- `compute_loss()` - Calculate training loss
+- `requires_audio` property - Whether audio components needed
+**`src/ltx_trainer/model_loader.py`** - Model loading
+Component loaders:
+- `load_transformer()` → `LTXModel`
+- `load_video_vae_encoder()` → `VideoVAEEncoder`
+- `load_video_vae_decoder()` → `VideoVAEDecoder`
+- `load_audio_vae_decoder()` → `AudioVAEDecoder`
+- `load_vocoder()` → `Vocoder`
+- `load_text_encoder()` → `AVGemmaTextEncoderModel`
+- `load_model()` → `LtxModelComponents` (convenience wrapper)
+**`src/ltx_trainer/validation_sampler.py`** - Inference for validation
+Uses ltx-core components for denoising:
+- `LTX2Scheduler` for sigma scheduling
+- `EulerDiffusionStep` for diffusion steps
+- `CFGGuider` for classifier-free guidance
+### Data
+**`src/ltx_trainer/datasets.py`** - Dataset handling
+- `PrecomputedDataset` loads pre-computed VAE latents
+- Supports video latents, audio latents, text embeddings, reference latents
+## Common Development Tasks
+### Adding a New Configuration Parameter
+1. Add field to appropriate config class in `src/ltx_trainer/config.py`
+2. Add validator if needed
+3. Update ALL config files in `configs/`
+4. Update `docs/configuration-reference.md`
+### Implementing a New Training Strategy
+1. Create new file in `src/ltx_trainer/training_strategies/`
+2. Create config class inheriting `TrainingStrategyConfigBase`
+3. Create strategy class inheriting `TrainingStrategy`
+4. Implement: `get_data_sources()`, `prepare_training_inputs()`, `compute_loss()`
+5. Add to `__init__.py`: import, add to `TrainingStrategyConfig` union, update factory
+6. Add discriminator tag to config.py's `TrainingStrategyConfig`
+7. Create example config file in `configs/`
+### Working with Modalities
+```python
+from dataclasses import replace
+from ltx_core.model.transformer.modality import Modality
+# Create modality
+video = Modality(
+    enabled=True,
+    latent=latents,
+    timesteps=timesteps,
+    positions=positions,
+    context=context,
+    context_mask=None,
+)
+# Update (immutable - must use replace)
+video = replace(video, latent=new_latent, timesteps=new_timesteps)
+# Disable a modality
+audio = replace(audio, enabled=False)
+```
+## Debugging Tips
+**Training Issues:**
+- Check logs first (rich logger provides context)
+- GPU memory: Look for OOM errors, enable `enable_gradient_checkpointing: true`
+- Distributed training: Check `accelerator.state` and device placement
+**Model Loading:**
+- Ensure `model_path` points to a local `.safetensors` file
+- Ensure `text_encoder_path` points to a Gemma model directory
+- URLs are NOT supported for model paths
+**Configuration:**
+- Validation errors: Check validators in `config.py`
+- Unknown fields: Config uses `extra="forbid"` - all fields must be defined
+- Strategy validation: IC-LoRA requires `reference_videos` in validation config
+## Key Constraints
+### LTX-2 Frame Requirements
+Frames must satisfy `frames % 8 == 1`:
+- ✅ Valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121
+- ❌ Invalid: 24, 32, 48, 64, 100
+### Resolution Requirements
+Width and height must be divisible by 32.
+### Model Paths
+- Must be local paths (URLs not supported)
+- `model_path`: Path to `.safetensors` checkpoint
+- `text_encoder_path`: Path to Gemma model directory
+### Platform Requirements
+- Linux required (uses `triton` which is Linux-only)
+- CUDA GPU with 24GB+ VRAM recommended
+## Reference: ltx-core Key Components
+```
+packages/ltx-core/src/ltx_core/
+├── model/
+│   ├── transformer/
+│   │   ├── model.py              # LTXModel
+│   │   ├── modality.py           # Modality dataclass
+│   │   └── transformer.py        # BasicAVTransformerBlock
+│   ├── video_vae/
+│   │   └── video_vae.py          # Encoder, Decoder
+│   ├── audio_vae/
+│   │   ├── audio_vae.py          # Decoder
+│   │   └── vocoder.py            # Vocoder
+│   └── clip/gemma/
+│       └── encoders/av_encoder.py  # AVGemmaTextEncoderModel
+├── pipeline/
+│   ├── components/
+│   │   ├── schedulers.py         # LTX2Scheduler
+│   │   ├── diffusion_steps.py    # EulerDiffusionStep
+│   │   ├── guiders.py            # CFGGuider
+│   │   └── patchifiers.py        # VideoLatentPatchifier, AudioPatchifier
+│   └── conditioning/             # VideoLatentTools, AudioLatentTools
+└── loader/
+    ├── single_gpu_model_builder.py  # SingleGPUModelBuilder
+    └── sd_ops.py                    # Key remapping (SDOps)
+```

packages/ltx-trainer/CLAUDE.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ AGENTS.md

packages/ltx-trainer/README.md ADDED Viewed

	@@ -0,0 +1,61 @@

+# LTX-2 Trainer
+This package provides tools and scripts for training and fine-tuning
+Lightricks' **LTX-2** audio-video generation model. It enables LoRA training, full
+fine-tuning, and training of video-to-video transformations (IC-LoRA) on custom datasets.
+---
+## 📖 Documentation
+All detailed guides and technical documentation are in the [docs](./docs/) directory:
+- [⚡ Quick Start Guide](docs/quick-start.md)
+- [🎬 Dataset Preparation](docs/dataset-preparation.md)
+- [🛠️ Training Modes](docs/training-modes.md)
+- [⚙️ Configuration Reference](docs/configuration-reference.md)
+- [🚀 Training Guide](docs/training-guide.md)
+- [🔧 Utility Scripts](docs/utility-scripts.md)
+- [📚 LTX-Core API Guide](docs/ltx-core-api-guide.md)
+- [🛡️ Troubleshooting Guide](docs/troubleshooting.md)
+---
+## 🔧 Requirements
+- **LTX-2 Model Checkpoint** - Local `.safetensors` file
+- **Gemma Text Encoder** - Local Gemma model directory (required for LTX-2)
+- **Linux with CUDA** - CUDA 13+ recommended for optimal performance
+- **Nvidia GPU with 80GB+ VRAM** - Is highly recommended; lower VRAM may work with gradient checkpointing and lower
+  resolutions
+---
+## 🤝 Contributing
+We welcome contributions from the community! Here's how you can help:
+- **Share Your Work**: If you've trained interesting LoRAs or achieved cool results, please share them with the
+  community.
+- **Report Issues**: Found a bug or have a suggestion? Open an issue on GitHub.
+- **Submit PRs**: Help improve the codebase with bug fixes or general improvements.
+- **Feature Requests**: Have ideas for new features? Let us know through GitHub issues.
+---
+## 💬 Join the Community
+Have questions, want to share your results, or need real-time help?
+Join our [community Discord server](https://discord.gg/2mafsHjJ) to connect with other users and the development team!
+- Get troubleshooting help
+- Share your training results and workflows
+- Collaborate on new ideas and features
+- Stay up to date with announcements and updates
+We look forward to seeing you there!
+---
+Happy training! 🎉

packages/ltx-trainer/pyproject.toml ADDED Viewed

	@@ -0,0 +1,89 @@

+[project]
+name = "ltx-trainer"
+version = "0.1.0"
+description = "LTX-2 training, democratized."
+readme = "README.md"
+authors = [
+    { name = "Matan Ben-Yosef", email = "mbyosef@lightricks.com" }
+]
+requires-python = ">=3.12"
+dependencies = [
+    "ltx-core",
+    "accelerate>=1.2.1",
+    "av>=14.2.1",
+    "bitsandbytes >=0.45.2; sys_platform == 'linux'",
+    "diffusers>=0.32.1",
+    "huggingface-hub[hf-xet]>=0.31.4",
+    "imageio>=2.37.0",
+    "imageio-ffmpeg>=0.6.0",
+    "opencv-python>=4.11.0.86",
+    "optimum-quanto>=0.2.6",
+    "pandas>=2.2.3",
+    "peft>=0.14.0",
+    "pillow-heif>=0.21.0",
+    "pydantic>=2.10.4",
+    "rich>=13.9.4",
+    "safetensors>=0.5.0",
+    "scenedetect>=0.6.5.2",
+    "sentencepiece>=0.2.0",
+    "torch>=2.6.0",
+    "torchaudio>=2.9.0",
+    "torchcodec>=0.8.1",
+    "torchvision>=0.21.0",
+    "typer>=0.15.1",
+    "wandb>=0.19.11",
+]
+[dependency-groups]
+dev = [
+    "pre-commit>=4.0.1",
+    "ruff>=0.8.6",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.ruff]
+target-version = "py311"
+line-length = 120
+[tool.ruff.lint]
+select = [
+    "E", # pycodestyle
+    "F", # pyflakes
+    "W", # pycodestyle (warnings)
+    "I", # isort
+    "N", # pep8-naming
+    "ANN", # flake8-annotations
+    "B", # flake8-bugbear
+    "A", # flake8-builtins
+    "COM", # flake8-commas
+    "C4", # flake8-comprehensions
+    "DTZ", # flake8-datetimez
+    "EXE", # flake8-executable
+    "PIE", # flake8-pie
+    "T20", # flake8-print
+    "PT", # flake8-pytest
+    "SIM", # flake8-simplify
+    "ARG", # flake8-unused-arguments
+    "PTH", # flake8--use-pathlib
+    "ERA", # flake8-eradicate
+    "RUF", # ruff specific rules
+    "PL", # pylint
+]
+ignore = [
+    "ANN002", # Missing type annotation for *args
+    "ANN003", # Missing type annotation for **kwargs
+    "ANN204", # Missing type annotation for special method
+    "COM812", # Missing trailing comma
+    "PTH123", # `open()` should be replaced by `Path.open()`
+    "PLR2004", # Magic value used in comparison, consider replacing with a constant variable
+]
+[tool.ruff.lint.pylint]
+max-args = 10
+[tool.ruff.lint.isort]
+known-first-party = ["ltx_trainer", "ltx_core", "ltx_pipelines"]