Spaces:
Paused
A newer version of the Gradio SDK is available: 6.13.0
AGENTS.md
This file provides guidance to AI coding assistants (Claude, Cursor, etc.) when working with code in this repository.
Project Overview
LTX-2 Trainer is a training toolkit for fine-tuning the Lightricks LTX-2 audio-video generation model. It supports:
- LoRA training - Efficient fine-tuning with adapters
- Full fine-tuning - Complete model training
- Audio-video training - Joint audio and video generation
- IC-LoRA training - In-context control adapters for video-to-video transformations
Key Dependencies:
ltx-core- Core model implementations (transformer, VAE, text encoder)ltx-pipelines- Inference pipeline components
Important: This trainer only supports LTX-2 (the audio-video model). The older LTXV models are not supported.
Architecture Overview
Package Structure
packages/ltx-trainer/
βββ src/ltx_trainer/ # Main training module
β βββ config.py # Pydantic configuration models
β βββ trainer.py # Main training orchestration with Accelerate
β βββ model_loader.py # Model loading using ltx-core
β βββ validation_sampler.py # Inference for validation samples
β βββ datasets.py # PrecomputedDataset for latent-based training
β βββ training_strategies/ # Strategy pattern for different training modes
β β βββ __init__.py # Factory function: get_training_strategy()
β β βββ base_strategy.py # TrainingStrategy ABC, ModelInputs, TrainingStrategyConfigBase
β β βββ text_to_video.py # TextToVideoStrategy, TextToVideoConfig
β β βββ video_to_video.py # VideoToVideoStrategy, VideoToVideoConfig
β βββ timestep_samplers.py # Flow matching timestep sampling
β βββ captioning.py # Video captioning utilities
β βββ video_utils.py # Video processing utilities
β βββ hf_hub_utils.py # HuggingFace Hub integration
βββ scripts/ # User-facing CLI tools
β βββ train.py # Main training script
β βββ process_dataset.py # Dataset preprocessing
β βββ process_videos.py # Video latent encoding
β βββ process_captions.py # Text embedding computation
β βββ caption_videos.py # Automatic video captioning
β βββ decode_latents.py # Latent decoding for debugging
β βββ inference.py # Inference with trained models
β βββ compute_reference.py # Generate IC-LoRA reference videos
β βββ split_scenes.py # Scene detection and splitting
βββ configs/ # Example training configurations
β βββ ltx2_av_lora.yaml # Audio-video LoRA training
β βββ ltx2_v2v_ic_lora.yaml # IC-LoRA video-to-video
β βββ accelerate/ # Accelerate configs for distributed training
βββ docs/ # Documentation
Key Architectural Patterns
Model Loading:
ltx_trainer.model_loaderprovides component loaders usingltx-core- Individual loaders:
load_transformer(),load_video_vae_encoder(),load_video_vae_decoder(),load_text_encoder(), etc. - Combined loader:
load_model()returnsLtxModelComponentsdataclass - Uses
SingleGPUModelBuilderfrom ltx-core internally
Training Flow:
- Configuration loaded via Pydantic models in
config.py Trainerclass orchestrates the training loop- Training strategies (
TextToVideoStrategy,VideoToVideoStrategy) prepare inputs and compute loss - Accelerate handles distributed training and device placement
- Data flows as precomputed latents through
PrecomputedDataset
Model Interface (Modality-based):
from ltx_core.model.transformer.modality import Modality
# Create modality objects for video and audio
video = Modality(
enabled=True,
latent=video_latents, # [B, seq_len, 128]
timesteps=video_timesteps, # [B, seq_len] per-token
positions=video_positions, # [B, 3, seq_len, 2]
context=video_embeds,
context_mask=None,
)
audio = Modality(
enabled=True,
latent=audio_latents,
timesteps=audio_timesteps,
positions=audio_positions, # [B, 1, seq_len, 2]
context=audio_embeds,
context_mask=None,
)
# Forward pass returns predictions for both modalities
video_pred, audio_pred = model(video=video, audio=audio, perturbations=None)
Note:
Modalityis immutable (frozen dataclass). Usedataclasses.replace()to modify.
Configuration System:
- All config in
src/ltx_trainer/config.py - Main class:
LtxTrainerConfig - Training strategy configs:
TextToVideoConfig,VideoToVideoConfig - Uses Pydantic field validators and model validators
- Config files in
configs/directory
Development Commands
Setup and Installation
# From the repository root
uv sync
cd packages/ltx-trainer
Code Quality
# Run ruff linting and formatting
uv run ruff check .
uv run ruff format .
# Run pre-commit checks
uv run pre-commit run --all-files
Running Tests
cd packages/ltx-trainer
uv run pytest
Running Training
# Single GPU
uv run python scripts/train.py configs/ltx2_av_lora.yaml
# Multi-GPU with Accelerate
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
Code Standards
Type Hints
- Always use type hints for all function arguments and return values
- Use Python 3.10+ syntax:
list[str]notList[str],str | PathnotUnion[str, Path] - Use
pathlib.Pathfor file operations
Class Methods
- Mark methods as
@staticmethodif they don't access instance or class state - Use
@classmethodfor alternative constructors
AI/ML Specific
- Use
@torch.inference_mode()for inference (prefer over@torch.no_grad()) - Use
accelerator.devicefor distributed compatibility - Support mixed precision (
bfloat16via dtype parameters) - Use gradient checkpointing for memory-intensive training
Logging
- Use
from ltx_trainer import loggerfor all messages - Avoid print statements in production code
Important Files & Modules
Configuration (CRITICAL)
src/ltx_trainer/config.py - Master config definitions
Key classes:
LtxTrainerConfig- Main configuration containerModelConfig- Model paths and training modeTrainingStrategyConfig- Union ofTextToVideoConfig|VideoToVideoConfigLoraConfig- LoRA hyperparametersOptimizationConfig- Learning rate, batch size, etc.ValidationConfig- Validation settingsWandbConfig- W&B logging settings
β οΈ When modifying config.py:
- Update ALL config files in
configs/ - Update
docs/configuration-reference.md - Test that all configs remain valid
Training Core
src/ltx_trainer/trainer.py - Main training loop
- Implements distributed training with Accelerate
- Handles mixed precision, gradient accumulation, checkpointing
- Uses training strategies for mode-specific logic
src/ltx_trainer/training_strategies/ - Strategy pattern
base_strategy.py:TrainingStrategyABC,ModelInputsdataclasstext_to_video.py: Standard text-to-video (with optional audio)video_to_video.py: IC-LoRA video-to-video transformations
Key methods each strategy implements:
get_data_sources()- Required data directoriesprepare_training_inputs()- Convert batch toModelInputscompute_loss()- Calculate training lossrequires_audioproperty - Whether audio components needed
src/ltx_trainer/model_loader.py - Model loading
Component loaders:
load_transformer()βLTXModelload_video_vae_encoder()βVideoVAEEncoderload_video_vae_decoder()βVideoVAEDecoderload_audio_vae_decoder()βAudioVAEDecoderload_vocoder()βVocoderload_text_encoder()βAVGemmaTextEncoderModelload_model()βLtxModelComponents(convenience wrapper)
src/ltx_trainer/validation_sampler.py - Inference for validation
Uses ltx-core components for denoising:
LTX2Schedulerfor sigma schedulingEulerDiffusionStepfor diffusion stepsCFGGuiderfor classifier-free guidance
Data
src/ltx_trainer/datasets.py - Dataset handling
PrecomputedDatasetloads pre-computed VAE latents- Supports video latents, audio latents, text embeddings, reference latents
Common Development Tasks
Adding a New Configuration Parameter
- Add field to appropriate config class in
src/ltx_trainer/config.py - Add validator if needed
- Update ALL config files in
configs/ - Update
docs/configuration-reference.md
Implementing a New Training Strategy
- Create new file in
src/ltx_trainer/training_strategies/ - Create config class inheriting
TrainingStrategyConfigBase - Create strategy class inheriting
TrainingStrategy - Implement:
get_data_sources(),prepare_training_inputs(),compute_loss() - Add to
__init__.py: import, add toTrainingStrategyConfigunion, update factory - Add discriminator tag to config.py's
TrainingStrategyConfig - Create example config file in
configs/
Working with Modalities
from dataclasses import replace
from ltx_core.model.transformer.modality import Modality
# Create modality
video = Modality(
enabled=True,
latent=latents,
timesteps=timesteps,
positions=positions,
context=context,
context_mask=None,
)
# Update (immutable - must use replace)
video = replace(video, latent=new_latent, timesteps=new_timesteps)
# Disable a modality
audio = replace(audio, enabled=False)
Debugging Tips
Training Issues:
- Check logs first (rich logger provides context)
- GPU memory: Look for OOM errors, enable
enable_gradient_checkpointing: true - Distributed training: Check
accelerator.stateand device placement
Model Loading:
- Ensure
model_pathpoints to a local.safetensorsfile - Ensure
text_encoder_pathpoints to a Gemma model directory - URLs are NOT supported for model paths
Configuration:
- Validation errors: Check validators in
config.py - Unknown fields: Config uses
extra="forbid"- all fields must be defined - Strategy validation: IC-LoRA requires
reference_videosin validation config
Key Constraints
LTX-2 Frame Requirements
Frames must satisfy frames % 8 == 1:
- β Valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121
- β Invalid: 24, 32, 48, 64, 100
Resolution Requirements
Width and height must be divisible by 32.
Model Paths
- Must be local paths (URLs not supported)
model_path: Path to.safetensorscheckpointtext_encoder_path: Path to Gemma model directory
Platform Requirements
- Linux required (uses
tritonwhich is Linux-only) - CUDA GPU with 24GB+ VRAM recommended
Reference: ltx-core Key Components
packages/ltx-core/src/ltx_core/
βββ model/
β βββ transformer/
β β βββ model.py # LTXModel
β β βββ modality.py # Modality dataclass
β β βββ transformer.py # BasicAVTransformerBlock
β βββ video_vae/
β β βββ video_vae.py # Encoder, Decoder
β βββ audio_vae/
β β βββ audio_vae.py # Decoder
β β βββ vocoder.py # Vocoder
β βββ clip/gemma/
β βββ encoders/av_encoder.py # AVGemmaTextEncoderModel
βββ pipeline/
β βββ components/
β β βββ schedulers.py # LTX2Scheduler
β β βββ diffusion_steps.py # EulerDiffusionStep
β β βββ guiders.py # CFGGuider
β β βββ patchifiers.py # VideoLatentPatchifier, AudioPatchifier
β βββ conditioning/ # VideoLatentTools, AudioLatentTools
βββ loader/
βββ single_gpu_model_builder.py # SingleGPUModelBuilder
βββ sd_ops.py # Key remapping (SDOps)