Spaces:
Running
on
Zero
Running
on
Zero
File size: 22,134 Bytes
ebfc6b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 |
# LTX-Core Model API Guide
This guide explains the core concepts and APIs used in the LTX-2 Audio-Video diffusion model. Understanding these concepts is essential for training, fine-tuning, and running inference with LTX models.
## Table of Contents
1. [Overview](#overview)
2. [Core Concepts](#core-concepts)
- [Modality](#modality---the-input-container)
- [Patchifiers](#patchifiers---format-conversion)
- [Latent Tools](#latent-tools---preparing-inputs)
- [Conditioning Items](#conditioning-items---adding-constraints)
- [Perturbations](#perturbations---fine-grained-control)
3. [Model Architecture](#model-architecture)
4. [Usage Patterns](#usage-patterns)
- [Text-to-Video Generation](#text-to-video-generation)
- [Image-to-Video Generation](#image-to-video-generation)
- [Video-to-Video (IC-LoRA)](#video-to-video-ic-lora)
- [Audio-Video Generation](#audio-video-generation)
5. [Common Pitfalls](#common-pitfalls)
---
## Overview
The LTX-2 model is a **joint Audio-Video diffusion transformer**. Unlike traditional models that handle one modality at a time, LTX-2 processes **video and audio simultaneously** in a unified architecture, enabling cross-modal attention between them.
Key characteristics:
- **Dual-stream architecture**: Separate processing paths for video and audio that interact via cross-attention
- **Per-token timesteps**: Different tokens can have different noise levels (enables advanced conditioning)
- **Flexible conditioning**: Supports text, image, and video conditioning
---
## Core Concepts
### Modality - The Input Container
The `Modality` dataclass wraps all information needed to process either video or audio:
```python
from ltx_core.model.transformer.modality import Modality
@dataclass
class Modality:
enabled: bool # Whether this modality should be processed
latent: torch.Tensor # Shape: (B, seq_len, D) - patchified tokens
timesteps: torch.Tensor # Shape: (B, seq_len) - noise level per token
positions: torch.Tensor # Shape: (B, dims, seq_len, 2) - spatial/temporal coordinates
context: torch.Tensor # Text embeddings
context_mask: torch.Tensor | None
```
**Field descriptions:**
| Field | Description |
|-------|-------------|
| `enabled` | Set to `False` to skip processing this modality |
| `latent` | Sequence of tokens in patchified format (not spatial `[B,C,F,H,W]`) |
| `timesteps` | Per-token noise levels (sigma values). Enables token-level conditioning |
| `positions` | Coordinates for RoPE (Rotary Position Embeddings). Video: `[B, 3, seq, 2]`, Audio: `[B, 1, seq, 2]` |
| `context` | Text prompt embeddings from the Gemma encoder |
| `context_mask` | Optional attention mask for the context |
### Patchifiers - Format Conversion
Patchifiers convert between spatial format and sequence format:
```python
from ltx_core.pipeline.components.patchifiers import (
VideoLatentPatchifier,
AudioPatchifier,
VideoLatentShape,
AudioLatentShape,
)
# Video patchification
video_patchifier = VideoLatentPatchifier(patch_size=1)
# Spatial to sequence: [B, C, F, H, W] β [B, F*H*W, C]
patchified = video_patchifier.patchify(video_latent)
# Sequence to spatial: [B, seq_len, C] β [B, C, F, H, W]
spatial = video_patchifier.unpatchify(
patchified,
output_shape=VideoLatentShape(
batch=1, channels=128, frames=7, height=16, width=24
)
)
# Audio patchification
audio_patchifier = AudioPatchifier(patch_size=1)
# [B, C, T, mel_bins] β [B, T, C*mel_bins]
patchified_audio = audio_patchifier.patchify(audio_latent)
```
### Latent Tools - Preparing Inputs
Latent tools handle the setup of initial latents, masks, and positions. Combined with conditioning items, they provide flexible input preparation:
```python
from ltx_core.pipeline.conditioning.tools import (
VideoLatentTools,
AudioLatentTools,
LatentState,
)
from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape
from ltx_core.pipeline.components.protocols import VideoPixelShape
# Create video latent tools
pixel_shape = VideoPixelShape(
batch=1,
frames=49, # Must be k*8 + 1 (e.g., 49, 97, 121)
height=512,
width=768,
fps=25.0,
)
video_tools = VideoLatentTools(
patchifier=video_patchifier,
target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
fps=25.0,
)
# Create an empty latent state (zeros with positions computed)
video_state = video_tools.create_initial_state(device=device, dtype=torch.bfloat16)
# video_state.latent: [B, seq_len, 128] - zeros (will be replaced with noise)
# video_state.denoise_mask: [B, seq_len, 1] - ones (all tokens to denoise)
# video_state.positions: [B, 3, seq_len, 2] - pixel coordinates for RoPE
# Audio latent tools (similar pattern)
audio_tools = AudioLatentTools(
patchifier=audio_patchifier,
target_shape=AudioLatentShape.from_duration(
batch=1,
duration=2.0, # seconds
channels=8,
mel_bins=16,
),
)
audio_state = audio_tools.create_initial_state(device, dtype)
```
### Conditioning Items - Adding Constraints
Conditioning items modify latent states to add constraints like first-frame conditioning:
```python
from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex
from ltx_core.pipeline.conditioning.types.keyframe_cond import VideoConditionByKeyframeIndex
# Option 1: Condition by latent index (replaces tokens in-place)
first_frame_cond = VideoConditionByLatentIndex(
latent=encoded_image, # VAE-encoded image [B, C, 1, H, W]
strength=1.0, # 1.0 = fully conditioned, 0.0 = fully denoised
latent_idx=0, # Which latent frame to condition
)
video_state = first_frame_cond.apply_to(video_state, video_tools)
# Option 2: Condition by keyframe (appends conditioning tokens)
keyframe_cond = VideoConditionByKeyframeIndex(
keyframes=encoded_image, # VAE-encoded keyframe(s)
frame_idx=0, # Target frame index
strength=1.0,
)
video_state = keyframe_cond.apply_to(video_state, video_tools)
```
**Key concepts:**
- `LatentState` is a frozen dataclass containing `latent`, `denoise_mask`, and `positions`
- `denoise_mask` values: `1.0` = denoise this token, `0.0` = keep this token fixed
- Conditioning items return a new `LatentState` (immutable pattern)
### Perturbations - Fine-Grained Control
Perturbations allow you to selectively skip operations at the per-sample, per-block level:
```python
from ltx_core.guidance.perturbations import (
Perturbation,
PerturbationType,
PerturbationConfig,
BatchedPerturbationConfig,
)
# Available perturbation types
PerturbationType.SKIP_A2V_CROSS_ATTN # Skip audioβvideo cross attention
PerturbationType.SKIP_V2A_CROSS_ATTN # Skip videoβaudio cross attention
PerturbationType.SKIP_VIDEO_SELF_ATTN # Skip video self attention
PerturbationType.SKIP_AUDIO_SELF_ATTN # Skip audio self attention
# Example: Skip audioβvideo attention in specific blocks
perturbation = Perturbation(
type=PerturbationType.SKIP_A2V_CROSS_ATTN,
blocks=[0, 1, 2, 3], # Skip in blocks 0-3, or None for all blocks
)
config = PerturbationConfig(perturbations=[perturbation])
# For batched inputs
batched_config = BatchedPerturbationConfig([config, config]) # batch_size=2
# Or use empty config for normal operation
batched_config = BatchedPerturbationConfig.empty(batch_size=2)
```
**Use cases for perturbations:**
- **STG (Spatio-Temporal Guidance)**: Skip self-attention in block 29 to improve video quality
- Ablation studies (disable specific attention paths)
- Custom guidance strategies
- Debugging model behavior
**STG (Spatio-Temporal Guidance) Example:**
STG uses perturbations to improve video generation quality by running an additional forward pass with self-attention skipped:
```python
from ltx_core.guidance.perturbations import (
Perturbation, PerturbationType, PerturbationConfig, BatchedPerturbationConfig
)
from ltx_core.pipeline.components.guiders import STGGuider
# Create STG perturbation config (recommended: block 29)
stg_perturbation = Perturbation(
type=PerturbationType.SKIP_VIDEO_SELF_ATTN,
blocks=[29], # Recommended: single block 29
)
stg_config = BatchedPerturbationConfig([PerturbationConfig([stg_perturbation])])
# In your denoising loop:
stg_guider = STGGuider(scale=1.0) # Recommended scale
# Normal forward pass
pos_video, pos_audio = model(video=video, audio=audio, perturbations=None)
# Perturbed forward pass (for STG)
perturbed_video, perturbed_audio = model(video=video, audio=audio, perturbations=stg_config)
# Apply STG guidance
denoised_video = pos_video + stg_guider.delta(pos_video, perturbed_video)
```
---
## Model Architecture
The LTX-2 transformer consists of 48 blocks, each with the following structure:
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VIDEO STREAM AUDIO STREAM β
β βββββββββββ ββββββββββββ β
β β
β 1. Video Self-Attention 1. Audio Self-Attention β
β (attends to all video) (attends to all audio) β
β β
β 2. Video Cross-Attention 2. Audio Cross-Attention β
β (attends to text prompt) (attends to text prompt)β
β β
β βββββββββββββββββββββββββββββββββββββ β
β β 3. AUDIO-VIDEO CROSS ATTENTION β β
β β β β
β β β’ Audio-to-Video (AβV): β β
β β Video queries, Audio keys/vals β β
β β β β
β β β’ Video-to-Audio (VβA): β β
β β Audio queries, Video keys/vals β β
β βββββββββββββββββββββββββββββββββββββ β
β β
β 4. Video Feed-Forward 4. Audio Feed-Forward β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
**Key insight**: Video and audio "talk" to each other through bidirectional cross-attention in every block, enabling synchronized audio-video generation.
### Forward Pass
```python
from ltx_core.model.transformer.model import LTXModel
# The transformer takes both modalities and returns predictions for both
video_velocity, audio_velocity = model(
video=video_modality,
audio=audio_modality,
perturbations=None, # or BatchedPerturbationConfig
)
# Returns velocity predictions used in the Euler diffusion step
```
---
## Usage Patterns
### Text-to-Video Generation
Basic text-to-video generation flow:
```python
from dataclasses import replace
from ltx_core.pipeline.components.schedulers import LTX2Scheduler
from ltx_core.pipeline.components.diffusion_steps import EulerDiffusionStep
from ltx_core.pipeline.components.guiders import CFGGuider
from ltx_core.pipeline.conditioning.tools import VideoLatentTools
from ltx_core.pipeline.components.patchifiers import VideoLatentShape
# 1. Encode text prompt
video_context, audio_context, mask = text_encoder(prompt)
# 2. Create video latent tools and initial state
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
patchifier=video_patchifier,
target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
fps=25.0,
)
video_state = video_tools.create_initial_state(device, dtype)
# 3. Add noise to the latent
noise = torch.randn_like(video_state.latent)
noised_latent = noise # Start from pure noise
# 4. Create video modality
video = Modality(
enabled=True,
latent=noised_latent,
timesteps=video_state.denoise_mask, # Will be updated each step
positions=video_state.positions,
context=video_context,
context_mask=None,
)
# 5. Setup scheduler and diffusion components
scheduler = LTX2Scheduler()
sigmas = scheduler.execute(steps=30).to(device)
stepper = EulerDiffusionStep()
# 6. Denoising loop
for step_idx, sigma in enumerate(sigmas[:-1]):
# Update timesteps with current sigma (use replace for immutable Modality)
video = replace(video, timesteps=sigma * video_state.denoise_mask)
# Forward pass
video_vel, _ = model(video=video, audio=disabled_audio, perturbations=None)
# Euler step
new_latent = stepper.step(video.latent, video_vel, sigmas, step_idx)
video = replace(video, latent=new_latent)
# 7. Decode to pixels
video_spatial = video_tools.unpatchify(
replace(video_state, latent=video.latent)
).latent # [B, C, F, H, W]
video_pixels = vae_decoder(video_spatial) # [B, 3, F, H, W]
```
### Image-to-Video Generation
Condition the first frame with an image:
```python
from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex
# Encode the conditioning image
image_latent = vae_encoder(image) # [B, C, 1, H, W]
# Create video tools and initial state
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
patchifier=video_patchifier,
target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
fps=25.0,
)
video_state = video_tools.create_initial_state(device, dtype)
# Apply first-frame conditioning
first_frame_cond = VideoConditionByLatentIndex(
latent=image_latent,
strength=1.0, # 1.0 = fully conditioned (no denoising on first frame)
latent_idx=0, # Condition frame 0
)
video_state = first_frame_cond.apply_to(video_state, video_tools)
# The denoise_mask will be 0.0 for first-frame tokens, 1.0 for the rest
# Proceed with denoising as usual...
```
### Video-to-Video (IC-LoRA)
IC-LoRA enables video-to-video transformation by conditioning on a reference video. The key insight is that reference tokens are included in the sequence but kept at timestep=0 (clean, no denoising).
```python
from dataclasses import replace
from ltx_core.pipeline.conditioning.tools import VideoLatentTools
from ltx_core.pipeline.components.patchifiers import VideoLatentShape
from ltx_core.pipeline.components.protocols import VideoPixelShape
# 1. Create video tools for target
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
patchifier=video_patchifier,
target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
fps=25.0,
)
# 2. Encode reference video to latents and patchify
ref_latents = vae_encoder(reference_video) # [B, C, F, H, W]
patchified_ref = video_patchifier.patchify(ref_latents) # [B, ref_seq_len, C]
ref_seq_len = patchified_ref.shape[1]
# 3. Create target video state (positions computed automatically)
target_state = video_tools.create_initial_state(device, dtype)
# 4. Compute positions for reference (SAME grid as target!)
# Reference positions are identical to target - this tells the model they correspond
ref_positions = target_state.positions.clone()
# 5. CONCATENATE reference + target
combined_latent = torch.cat([patchified_ref, torch.randn_like(target_state.latent)], dim=1)
combined_positions = torch.cat([ref_positions, target_state.positions], dim=2)
# 6. Create denoise mask: 0 for reference (keep clean), 1 for target (denoise)
ref_denoise_mask = torch.zeros(1, ref_seq_len, 1, device=device)
combined_denoise_mask = torch.cat([ref_denoise_mask, target_state.denoise_mask], dim=1)
# 7. Create modality with combined inputs
video = Modality(
enabled=True,
latent=combined_latent,
timesteps=combined_denoise_mask, # Will be updated with sigma
positions=combined_positions,
context=video_context,
context_mask=None,
)
# 8. Denoising loop - only update target portion
for step_idx, sigma in enumerate(sigmas[:-1]):
# Timesteps: 0 for reference, sigma for target
ref_timesteps = torch.zeros(1, ref_seq_len, 1, device=device)
target_timesteps = sigma * target_state.denoise_mask
new_timesteps = torch.cat([ref_timesteps, target_timesteps], dim=1)
video = replace(video, timesteps=new_timesteps)
# Forward pass
video_vel, _ = model(video=video, audio=audio, perturbations=None)
# Euler step - ONLY update target portion
target_latent = video.latent[:, ref_seq_len:]
target_vel = video_vel[:, ref_seq_len:]
updated_target = stepper.step(target_latent, target_vel, sigmas, step_idx)
# Reconstruct (reference stays fixed)
new_latent = torch.cat([patchified_ref, updated_target], dim=1)
video = replace(video, latent=new_latent)
# 9. Extract and decode only the target portion
final_target = video.latent[:, ref_seq_len:]
target_state_with_output = replace(target_state, latent=final_target)
target_spatial = video_tools.unpatchify(target_state_with_output).latent
video_pixels = vae_decoder(target_spatial)
```
**Why this works:**
- Self-attention sees both reference and target tokens
- Reference tokens have `timestep=0` (clean signal) - model learns to "copy" from them
- Shared positions tell the model "frame N of reference = frame N of target"
- Only target portion is updated during denoising
### Audio-Video Generation
Generate synchronized audio and video:
```python
from dataclasses import replace
from ltx_core.pipeline.conditioning.tools import VideoLatentTools, AudioLatentTools
from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape
from ltx_core.pipeline.components.protocols import VideoPixelShape
# Create latent tools for both modalities
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
patchifier=video_patchifier,
target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
fps=25.0,
)
audio_tools = AudioLatentTools(
patchifier=audio_patchifier,
target_shape=AudioLatentShape.from_duration(batch=1, duration=2.0, channels=8, mel_bins=16),
)
# Create initial states
video_state = video_tools.create_initial_state(device, dtype)
audio_state = audio_tools.create_initial_state(device, dtype)
# Encode text (returns separate embeddings for each modality)
video_context, audio_context, mask = text_encoder(prompt)
# Create both modalities with noise
video = Modality(
enabled=True,
latent=torch.randn_like(video_state.latent),
timesteps=video_state.denoise_mask,
positions=video_state.positions,
context=video_context,
context_mask=None,
)
audio = Modality(
enabled=True,
latent=torch.randn_like(audio_state.latent),
timesteps=audio_state.denoise_mask,
positions=audio_state.positions,
context=audio_context,
context_mask=None,
)
# Denoising loop - update both (use replace for immutable Modality)
for step_idx, sigma in enumerate(sigmas[:-1]):
video = replace(video, timesteps=sigma * video_state.denoise_mask)
audio = replace(audio, timesteps=sigma * audio_state.denoise_mask)
# Forward pass returns both predictions
video_vel, audio_vel = model(video=video, audio=audio, perturbations=None)
# Update both latents
video = replace(video, latent=stepper.step(video.latent, video_vel, sigmas, step_idx))
audio = replace(audio, latent=stepper.step(audio.latent, audio_vel, sigmas, step_idx))
# Decode both
video_spatial = video_tools.unpatchify(replace(video_state, latent=video.latent)).latent
video_pixels = vae_decoder(video_spatial)
audio_spatial = audio_tools.unpatchify(replace(audio_state, latent=audio.latent)).latent
audio_mel = audio_decoder(audio_spatial)
audio_waveform = vocoder(audio_mel)
```
---
## Common Pitfalls
### 1. Frame Count Constraints
Video frame count must satisfy `num_frames % 8 == 1`:
- β
Valid: 49, 97, 121, 145
- β Invalid: 48, 50, 100
```python
# The "+1" accounts for causal padding in the VAE
latent_frames = (num_frames - 1) // 8 + 1
```
### 2. Resolution Constraints
Height and width must be divisible by 32:
- β
Valid: 512Γ768, 768Γ1024
- β Invalid: 500Γ750
### 3. Position Tensor Shapes
Different modalities have different position tensor shapes:
- Video: `[B, 3, seq_len, 2]` - 3 dimensions for (time, height, width)
- Audio: `[B, 1, seq_len, 2]` - 1 dimension for time only
### 4. Separate Context Embeddings
Video and audio modalities receive **different** context embeddings from the text encoder:
```python
# The text encoder returns separate embeddings
video_context, audio_context, mask = text_encoder(prompt)
# Use the appropriate one for each modality
video = Modality(context=video_context, ...) # NOT audio_context!
audio = Modality(context=audio_context, ...) # NOT video_context!
```
### 5. Immutable Modality
The `Modality` dataclass is **frozen** (immutable). Use `dataclasses.replace()` to create modified copies:
```python
from dataclasses import replace
# β Wrong - will raise an error
video.latent = new_latent
# β
Correct - create a new Modality with updated field
video = replace(video, latent=new_latent)
# β
Update multiple fields at once
video = replace(video, latent=new_latent, timesteps=new_timesteps)
```
---
## Additional Resources
- [Training Guide](./training-guide.md) - How to fine-tune LTX-2 models
- [Configuration Reference](./configuration-reference.md) - All configuration options
- [Training Modes](./training-modes.md) - LoRA, audio-video, and IC-LoRA training
|