Spaces:

Lightricks
/

ltx-2

Running on Zero

App Files Files Community

ltx-2 / packages /ltx-trainer /docs /ltx-core-api-guide.md

linoy

inital commit

ebfc6b3 10 days ago

preview code

raw

history blame contribute delete

22.1 kB

	# LTX-Core Model API Guide

	This guide explains the core concepts and APIs used in the LTX-2 Audio-Video diffusion model. Understanding these concepts is essential for training, fine-tuning, and running inference with LTX models.

	## Table of Contents

	1. [Overview](#overview)
	2. [Core Concepts](#core-concepts)
	- [Modality](#modality---the-input-container)
	- [Patchifiers](#patchifiers---format-conversion)
	- [Latent Tools](#latent-tools---preparing-inputs)
	- [Conditioning Items](#conditioning-items---adding-constraints)
	- [Perturbations](#perturbations---fine-grained-control)
	3. [Model Architecture](#model-architecture)
	4. [Usage Patterns](#usage-patterns)
	- [Text-to-Video Generation](#text-to-video-generation)
	- [Image-to-Video Generation](#image-to-video-generation)
	- [Video-to-Video (IC-LoRA)](#video-to-video-ic-lora)
	- [Audio-Video Generation](#audio-video-generation)
	5. [Common Pitfalls](#common-pitfalls)

	---

	## Overview

	The LTX-2 model is a joint Audio-Video diffusion transformer. Unlike traditional models that handle one modality at a time, LTX-2 processes video and audio simultaneously in a unified architecture, enabling cross-modal attention between them.

	Key characteristics:
	- Dual-stream architecture: Separate processing paths for video and audio that interact via cross-attention
	- Per-token timesteps: Different tokens can have different noise levels (enables advanced conditioning)
	- Flexible conditioning: Supports text, image, and video conditioning

	---

	## Core Concepts

	### Modality - The Input Container

	The `Modality` dataclass wraps all information needed to process either video or audio:

	```python
	from ltx_core.model.transformer.modality import Modality

	@dataclass
	class Modality:
	enabled: bool # Whether this modality should be processed
	latent: torch.Tensor # Shape: (B, seq_len, D) - patchified tokens
	timesteps: torch.Tensor # Shape: (B, seq_len) - noise level per token
	positions: torch.Tensor # Shape: (B, dims, seq_len, 2) - spatial/temporal coordinates
	context: torch.Tensor # Text embeddings
	context_mask: torch.Tensor \| None
	```

	Field descriptions:

	\| Field \| Description \|
	\|-------\|-------------\|
	\| `enabled` \| Set to `False` to skip processing this modality \|
	\| `latent` \| Sequence of tokens in patchified format (not spatial `[B,C,F,H,W]`) \|
	\| `timesteps` \| Per-token noise levels (sigma values). Enables token-level conditioning \|
	\| `positions` \| Coordinates for RoPE (Rotary Position Embeddings). Video: `[B, 3, seq, 2]`, Audio: `[B, 1, seq, 2]` \|
	\| `context` \| Text prompt embeddings from the Gemma encoder \|
	\| `context_mask` \| Optional attention mask for the context \|

	### Patchifiers - Format Conversion

	Patchifiers convert between spatial format and sequence format:

	```python
	from ltx_core.pipeline.components.patchifiers import (
	VideoLatentPatchifier,
	AudioPatchifier,
	VideoLatentShape,
	AudioLatentShape,
	)

	# Video patchification
	video_patchifier = VideoLatentPatchifier(patch_size=1)

	# Spatial to sequence: [B, C, F, H, W] → [B, FHW, C]
	patchified = video_patchifier.patchify(video_latent)

	# Sequence to spatial: [B, seq_len, C] → [B, C, F, H, W]
	spatial = video_patchifier.unpatchify(
	patchified,
	output_shape=VideoLatentShape(
	batch=1, channels=128, frames=7, height=16, width=24
	)
	)

	# Audio patchification
	audio_patchifier = AudioPatchifier(patch_size=1)

	# [B, C, T, mel_bins] → [B, T, C*mel_bins]
	patchified_audio = audio_patchifier.patchify(audio_latent)
	```

	### Latent Tools - Preparing Inputs

	Latent tools handle the setup of initial latents, masks, and positions. Combined with conditioning items, they provide flexible input preparation:

	```python
	from ltx_core.pipeline.conditioning.tools import (
	VideoLatentTools,
	AudioLatentTools,
	LatentState,
	)
	from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape
	from ltx_core.pipeline.components.protocols import VideoPixelShape

	# Create video latent tools
	pixel_shape = VideoPixelShape(
	batch=1,
	frames=49, # Must be k*8 + 1 (e.g., 49, 97, 121)
	height=512,
	width=768,
	fps=25.0,
	)
	video_tools = VideoLatentTools(
	patchifier=video_patchifier,
	target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
	fps=25.0,
	)

	# Create an empty latent state (zeros with positions computed)
	video_state = video_tools.create_initial_state(device=device, dtype=torch.bfloat16)
	# video_state.latent: [B, seq_len, 128] - zeros (will be replaced with noise)
	# video_state.denoise_mask: [B, seq_len, 1] - ones (all tokens to denoise)
	# video_state.positions: [B, 3, seq_len, 2] - pixel coordinates for RoPE

	# Audio latent tools (similar pattern)
	audio_tools = AudioLatentTools(
	patchifier=audio_patchifier,
	target_shape=AudioLatentShape.from_duration(
	batch=1,
	duration=2.0, # seconds
	channels=8,
	mel_bins=16,
	),
	)
	audio_state = audio_tools.create_initial_state(device, dtype)
	```

	### Conditioning Items - Adding Constraints

	Conditioning items modify latent states to add constraints like first-frame conditioning:

	```python
	from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex
	from ltx_core.pipeline.conditioning.types.keyframe_cond import VideoConditionByKeyframeIndex

	# Option 1: Condition by latent index (replaces tokens in-place)
	first_frame_cond = VideoConditionByLatentIndex(
	latent=encoded_image, # VAE-encoded image [B, C, 1, H, W]
	strength=1.0, # 1.0 = fully conditioned, 0.0 = fully denoised
	latent_idx=0, # Which latent frame to condition
	)
	video_state = first_frame_cond.apply_to(video_state, video_tools)

	# Option 2: Condition by keyframe (appends conditioning tokens)
	keyframe_cond = VideoConditionByKeyframeIndex(
	keyframes=encoded_image, # VAE-encoded keyframe(s)
	frame_idx=0, # Target frame index
	strength=1.0,
	)
	video_state = keyframe_cond.apply_to(video_state, video_tools)
	```

	Key concepts:
	- `LatentState` is a frozen dataclass containing `latent`, `denoise_mask`, and `positions`
	- `denoise_mask` values: `1.0` = denoise this token, `0.0` = keep this token fixed
	- Conditioning items return a new `LatentState` (immutable pattern)

	### Perturbations - Fine-Grained Control

	Perturbations allow you to selectively skip operations at the per-sample, per-block level:

	```python
	from ltx_core.guidance.perturbations import (
	Perturbation,
	PerturbationType,
	PerturbationConfig,
	BatchedPerturbationConfig,
	)

	# Available perturbation types
	PerturbationType.SKIP_A2V_CROSS_ATTN # Skip audio→video cross attention
	PerturbationType.SKIP_V2A_CROSS_ATTN # Skip video→audio cross attention
	PerturbationType.SKIP_VIDEO_SELF_ATTN # Skip video self attention
	PerturbationType.SKIP_AUDIO_SELF_ATTN # Skip audio self attention

	# Example: Skip audio→video attention in specific blocks
	perturbation = Perturbation(
	type=PerturbationType.SKIP_A2V_CROSS_ATTN,
	blocks=[0, 1, 2, 3], # Skip in blocks 0-3, or None for all blocks
	)
	config = PerturbationConfig(perturbations=[perturbation])

	# For batched inputs
	batched_config = BatchedPerturbationConfig([config, config]) # batch_size=2

	# Or use empty config for normal operation
	batched_config = BatchedPerturbationConfig.empty(batch_size=2)
	```

	Use cases for perturbations:
	- STG (Spatio-Temporal Guidance): Skip self-attention in block 29 to improve video quality
	- Ablation studies (disable specific attention paths)
	- Custom guidance strategies
	- Debugging model behavior

	STG (Spatio-Temporal Guidance) Example:

	STG uses perturbations to improve video generation quality by running an additional forward pass with self-attention skipped:

	```python
	from ltx_core.guidance.perturbations import (
	Perturbation, PerturbationType, PerturbationConfig, BatchedPerturbationConfig
	)
	from ltx_core.pipeline.components.guiders import STGGuider

	# Create STG perturbation config (recommended: block 29)
	stg_perturbation = Perturbation(
	type=PerturbationType.SKIP_VIDEO_SELF_ATTN,
	blocks=[29], # Recommended: single block 29
	)
	stg_config = BatchedPerturbationConfig([PerturbationConfig([stg_perturbation])])

	# In your denoising loop:
	stg_guider = STGGuider(scale=1.0) # Recommended scale

	# Normal forward pass
	pos_video, pos_audio = model(video=video, audio=audio, perturbations=None)

	# Perturbed forward pass (for STG)
	perturbed_video, perturbed_audio = model(video=video, audio=audio, perturbations=stg_config)

	# Apply STG guidance
	denoised_video = pos_video + stg_guider.delta(pos_video, perturbed_video)
	```

	---

	## Model Architecture

	The LTX-2 transformer consists of 48 blocks, each with the following structure:

	```
	┌─────────────────────────────────────────────────────────────┐
	│ VIDEO STREAM AUDIO STREAM │
	│ ─────────── ──────────── │
	│ │
	│ 1. Video Self-Attention 1. Audio Self-Attention │
	│ (attends to all video) (attends to all audio) │
	│ │
	│ 2. Video Cross-Attention 2. Audio Cross-Attention │
	│ (attends to text prompt) (attends to text prompt)│
	│ │
	│ ╔═══════════════════════════════════╗ │
	│ ║ 3. AUDIO-VIDEO CROSS ATTENTION ║ │
	│ ║ ║ │
	│ ║ • Audio-to-Video (A→V): ║ │
	│ ║ Video queries, Audio keys/vals ║ │
	│ ║ ║ │
	│ ║ • Video-to-Audio (V→A): ║ │
	│ ║ Audio queries, Video keys/vals ║ │
	│ ╚═══════════════════════════════════╝ │
	│ │
	│ 4. Video Feed-Forward 4. Audio Feed-Forward │
	│ │
	└─────────────────────────────────────────────────────────────┘
	```

	Key insight: Video and audio "talk" to each other through bidirectional cross-attention in every block, enabling synchronized audio-video generation.

	### Forward Pass

	```python
	from ltx_core.model.transformer.model import LTXModel

	# The transformer takes both modalities and returns predictions for both
	video_velocity, audio_velocity = model(
	video=video_modality,
	audio=audio_modality,
	perturbations=None, # or BatchedPerturbationConfig
	)
	# Returns velocity predictions used in the Euler diffusion step
	```

	---

	## Usage Patterns

	### Text-to-Video Generation

	Basic text-to-video generation flow:

	```python
	from dataclasses import replace
	from ltx_core.pipeline.components.schedulers import LTX2Scheduler
	from ltx_core.pipeline.components.diffusion_steps import EulerDiffusionStep
	from ltx_core.pipeline.components.guiders import CFGGuider
	from ltx_core.pipeline.conditioning.tools import VideoLatentTools
	from ltx_core.pipeline.components.patchifiers import VideoLatentShape

	# 1. Encode text prompt
	video_context, audio_context, mask = text_encoder(prompt)

	# 2. Create video latent tools and initial state
	pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
	video_tools = VideoLatentTools(
	patchifier=video_patchifier,
	target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
	fps=25.0,
	)
	video_state = video_tools.create_initial_state(device, dtype)

	# 3. Add noise to the latent
	noise = torch.randn_like(video_state.latent)
	noised_latent = noise # Start from pure noise

	# 4. Create video modality
	video = Modality(
	enabled=True,
	latent=noised_latent,
	timesteps=video_state.denoise_mask, # Will be updated each step
	positions=video_state.positions,
	context=video_context,
	context_mask=None,
	)

	# 5. Setup scheduler and diffusion components
	scheduler = LTX2Scheduler()
	sigmas = scheduler.execute(steps=30).to(device)
	stepper = EulerDiffusionStep()

	# 6. Denoising loop
	for step_idx, sigma in enumerate(sigmas[:-1]):
	# Update timesteps with current sigma (use replace for immutable Modality)
	video = replace(video, timesteps=sigma * video_state.denoise_mask)

	# Forward pass
	video_vel, _ = model(video=video, audio=disabled_audio, perturbations=None)

	# Euler step
	new_latent = stepper.step(video.latent, video_vel, sigmas, step_idx)
	video = replace(video, latent=new_latent)

	# 7. Decode to pixels
	video_spatial = video_tools.unpatchify(
	replace(video_state, latent=video.latent)
	).latent # [B, C, F, H, W]
	video_pixels = vae_decoder(video_spatial) # [B, 3, F, H, W]
	```

	### Image-to-Video Generation

	Condition the first frame with an image:

	```python
	from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex

	# Encode the conditioning image
	image_latent = vae_encoder(image) # [B, C, 1, H, W]

	# Create video tools and initial state
	pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
	video_tools = VideoLatentTools(
	patchifier=video_patchifier,
	target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
	fps=25.0,
	)
	video_state = video_tools.create_initial_state(device, dtype)

	# Apply first-frame conditioning
	first_frame_cond = VideoConditionByLatentIndex(
	latent=image_latent,
	strength=1.0, # 1.0 = fully conditioned (no denoising on first frame)
	latent_idx=0, # Condition frame 0
	)
	video_state = first_frame_cond.apply_to(video_state, video_tools)
	# The denoise_mask will be 0.0 for first-frame tokens, 1.0 for the rest

	# Proceed with denoising as usual...
	```

	### Video-to-Video (IC-LoRA)

	IC-LoRA enables video-to-video transformation by conditioning on a reference video. The key insight is that reference tokens are included in the sequence but kept at timestep=0 (clean, no denoising).

	```python
	from dataclasses import replace
	from ltx_core.pipeline.conditioning.tools import VideoLatentTools
	from ltx_core.pipeline.components.patchifiers import VideoLatentShape
	from ltx_core.pipeline.components.protocols import VideoPixelShape

	# 1. Create video tools for target
	pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
	video_tools = VideoLatentTools(
	patchifier=video_patchifier,
	target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
	fps=25.0,
	)

	# 2. Encode reference video to latents and patchify
	ref_latents = vae_encoder(reference_video) # [B, C, F, H, W]
	patchified_ref = video_patchifier.patchify(ref_latents) # [B, ref_seq_len, C]
	ref_seq_len = patchified_ref.shape[1]

	# 3. Create target video state (positions computed automatically)
	target_state = video_tools.create_initial_state(device, dtype)

	# 4. Compute positions for reference (SAME grid as target!)
	# Reference positions are identical to target - this tells the model they correspond
	ref_positions = target_state.positions.clone()

	# 5. CONCATENATE reference + target
	combined_latent = torch.cat([patchified_ref, torch.randn_like(target_state.latent)], dim=1)
	combined_positions = torch.cat([ref_positions, target_state.positions], dim=2)

	# 6. Create denoise mask: 0 for reference (keep clean), 1 for target (denoise)
	ref_denoise_mask = torch.zeros(1, ref_seq_len, 1, device=device)
	combined_denoise_mask = torch.cat([ref_denoise_mask, target_state.denoise_mask], dim=1)

	# 7. Create modality with combined inputs
	video = Modality(
	enabled=True,
	latent=combined_latent,
	timesteps=combined_denoise_mask, # Will be updated with sigma
	positions=combined_positions,
	context=video_context,
	context_mask=None,
	)

	# 8. Denoising loop - only update target portion
	for step_idx, sigma in enumerate(sigmas[:-1]):
	# Timesteps: 0 for reference, sigma for target
	ref_timesteps = torch.zeros(1, ref_seq_len, 1, device=device)
	target_timesteps = sigma * target_state.denoise_mask
	new_timesteps = torch.cat([ref_timesteps, target_timesteps], dim=1)
	video = replace(video, timesteps=new_timesteps)

	# Forward pass
	video_vel, _ = model(video=video, audio=audio, perturbations=None)

	# Euler step - ONLY update target portion
	target_latent = video.latent[:, ref_seq_len:]
	target_vel = video_vel[:, ref_seq_len:]
	updated_target = stepper.step(target_latent, target_vel, sigmas, step_idx)

	# Reconstruct (reference stays fixed)
	new_latent = torch.cat([patchified_ref, updated_target], dim=1)
	video = replace(video, latent=new_latent)

	# 9. Extract and decode only the target portion
	final_target = video.latent[:, ref_seq_len:]
	target_state_with_output = replace(target_state, latent=final_target)
	target_spatial = video_tools.unpatchify(target_state_with_output).latent
	video_pixels = vae_decoder(target_spatial)
	```

	Why this works:
	- Self-attention sees both reference and target tokens
	- Reference tokens have `timestep=0` (clean signal) - model learns to "copy" from them
	- Shared positions tell the model "frame N of reference = frame N of target"
	- Only target portion is updated during denoising

	### Audio-Video Generation

	Generate synchronized audio and video:

	```python
	from dataclasses import replace
	from ltx_core.pipeline.conditioning.tools import VideoLatentTools, AudioLatentTools
	from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape
	from ltx_core.pipeline.components.protocols import VideoPixelShape

	# Create latent tools for both modalities
	pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
	video_tools = VideoLatentTools(
	patchifier=video_patchifier,
	target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
	fps=25.0,
	)
	audio_tools = AudioLatentTools(
	patchifier=audio_patchifier,
	target_shape=AudioLatentShape.from_duration(batch=1, duration=2.0, channels=8, mel_bins=16),
	)

	# Create initial states
	video_state = video_tools.create_initial_state(device, dtype)
	audio_state = audio_tools.create_initial_state(device, dtype)

	# Encode text (returns separate embeddings for each modality)
	video_context, audio_context, mask = text_encoder(prompt)

	# Create both modalities with noise
	video = Modality(
	enabled=True,
	latent=torch.randn_like(video_state.latent),
	timesteps=video_state.denoise_mask,
	positions=video_state.positions,
	context=video_context,
	context_mask=None,
	)
	audio = Modality(
	enabled=True,
	latent=torch.randn_like(audio_state.latent),
	timesteps=audio_state.denoise_mask,
	positions=audio_state.positions,
	context=audio_context,
	context_mask=None,
	)

	# Denoising loop - update both (use replace for immutable Modality)
	for step_idx, sigma in enumerate(sigmas[:-1]):
	video = replace(video, timesteps=sigma * video_state.denoise_mask)
	audio = replace(audio, timesteps=sigma * audio_state.denoise_mask)

	# Forward pass returns both predictions
	video_vel, audio_vel = model(video=video, audio=audio, perturbations=None)

	# Update both latents
	video = replace(video, latent=stepper.step(video.latent, video_vel, sigmas, step_idx))
	audio = replace(audio, latent=stepper.step(audio.latent, audio_vel, sigmas, step_idx))

	# Decode both
	video_spatial = video_tools.unpatchify(replace(video_state, latent=video.latent)).latent
	video_pixels = vae_decoder(video_spatial)
	audio_spatial = audio_tools.unpatchify(replace(audio_state, latent=audio.latent)).latent
	audio_mel = audio_decoder(audio_spatial)
	audio_waveform = vocoder(audio_mel)
	```

	---

	## Common Pitfalls

	### 1. Frame Count Constraints

	Video frame count must satisfy `num_frames % 8 == 1`:
	- ✅ Valid: 49, 97, 121, 145
	- ❌ Invalid: 48, 50, 100

	```python
	# The "+1" accounts for causal padding in the VAE
	latent_frames = (num_frames - 1) // 8 + 1
	```

	### 2. Resolution Constraints

	Height and width must be divisible by 32:
	- ✅ Valid: 512×768, 768×1024
	- ❌ Invalid: 500×750

	### 3. Position Tensor Shapes

	Different modalities have different position tensor shapes:
	- Video: `[B, 3, seq_len, 2]` - 3 dimensions for (time, height, width)
	- Audio: `[B, 1, seq_len, 2]` - 1 dimension for time only

	### 4. Separate Context Embeddings

	Video and audio modalities receive different context embeddings from the text encoder:

	```python
	# The text encoder returns separate embeddings
	video_context, audio_context, mask = text_encoder(prompt)

	# Use the appropriate one for each modality
	video = Modality(context=video_context, ...) # NOT audio_context!
	audio = Modality(context=audio_context, ...) # NOT video_context!
	```

	### 5. Immutable Modality

	The `Modality` dataclass is frozen (immutable). Use `dataclasses.replace()` to create modified copies:

	```python
	from dataclasses import replace

	# ❌ Wrong - will raise an error
	video.latent = new_latent

	# ✅ Correct - create a new Modality with updated field
	video = replace(video, latent=new_latent)

	# ✅ Update multiple fields at once
	video = replace(video, latent=new_latent, timesteps=new_timesteps)
	```

	---

	## Additional Resources

	- [Training Guide](./training-guide.md) - How to fine-tune LTX-2 models
	- [Configuration Reference](./configuration-reference.md) - All configuration options
	- [Training Modes](./training-modes.md) - LoRA, audio-video, and IC-LoRA training