File size: 22,134 Bytes
ebfc6b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
# LTX-Core Model API Guide

This guide explains the core concepts and APIs used in the LTX-2 Audio-Video diffusion model. Understanding these concepts is essential for training, fine-tuning, and running inference with LTX models.

## Table of Contents

1. [Overview](#overview)
2. [Core Concepts](#core-concepts)
   - [Modality](#modality---the-input-container)
   - [Patchifiers](#patchifiers---format-conversion)
   - [Latent Tools](#latent-tools---preparing-inputs)
   - [Conditioning Items](#conditioning-items---adding-constraints)
   - [Perturbations](#perturbations---fine-grained-control)
3. [Model Architecture](#model-architecture)
4. [Usage Patterns](#usage-patterns)
   - [Text-to-Video Generation](#text-to-video-generation)
   - [Image-to-Video Generation](#image-to-video-generation)
   - [Video-to-Video (IC-LoRA)](#video-to-video-ic-lora)
   - [Audio-Video Generation](#audio-video-generation)
5. [Common Pitfalls](#common-pitfalls)

---

## Overview

The LTX-2 model is a **joint Audio-Video diffusion transformer**. Unlike traditional models that handle one modality at a time, LTX-2 processes **video and audio simultaneously** in a unified architecture, enabling cross-modal attention between them.

Key characteristics:
- **Dual-stream architecture**: Separate processing paths for video and audio that interact via cross-attention
- **Per-token timesteps**: Different tokens can have different noise levels (enables advanced conditioning)
- **Flexible conditioning**: Supports text, image, and video conditioning

---

## Core Concepts

### Modality - The Input Container

The `Modality` dataclass wraps all information needed to process either video or audio:

```python
from ltx_core.model.transformer.modality import Modality

@dataclass
class Modality:
    enabled: bool           # Whether this modality should be processed
    latent: torch.Tensor    # Shape: (B, seq_len, D) - patchified tokens
    timesteps: torch.Tensor # Shape: (B, seq_len) - noise level per token
    positions: torch.Tensor # Shape: (B, dims, seq_len, 2) - spatial/temporal coordinates
    context: torch.Tensor   # Text embeddings
    context_mask: torch.Tensor | None
```

**Field descriptions:**

| Field | Description |
|-------|-------------|
| `enabled` | Set to `False` to skip processing this modality |
| `latent` | Sequence of tokens in patchified format (not spatial `[B,C,F,H,W]`) |
| `timesteps` | Per-token noise levels (sigma values). Enables token-level conditioning |
| `positions` | Coordinates for RoPE (Rotary Position Embeddings). Video: `[B, 3, seq, 2]`, Audio: `[B, 1, seq, 2]` |
| `context` | Text prompt embeddings from the Gemma encoder |
| `context_mask` | Optional attention mask for the context |

### Patchifiers - Format Conversion

Patchifiers convert between spatial format and sequence format:

```python
from ltx_core.pipeline.components.patchifiers import (
    VideoLatentPatchifier,
    AudioPatchifier,
    VideoLatentShape,
    AudioLatentShape,
)

# Video patchification
video_patchifier = VideoLatentPatchifier(patch_size=1)

# Spatial to sequence: [B, C, F, H, W] β†’ [B, F*H*W, C]
patchified = video_patchifier.patchify(video_latent)

# Sequence to spatial: [B, seq_len, C] β†’ [B, C, F, H, W]
spatial = video_patchifier.unpatchify(
    patchified,
    output_shape=VideoLatentShape(
        batch=1, channels=128, frames=7, height=16, width=24
    )
)

# Audio patchification
audio_patchifier = AudioPatchifier(patch_size=1)

# [B, C, T, mel_bins] β†’ [B, T, C*mel_bins]
patchified_audio = audio_patchifier.patchify(audio_latent)
```

### Latent Tools - Preparing Inputs

Latent tools handle the setup of initial latents, masks, and positions. Combined with conditioning items, they provide flexible input preparation:

```python
from ltx_core.pipeline.conditioning.tools import (
    VideoLatentTools,
    AudioLatentTools,
    LatentState,
)
from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape
from ltx_core.pipeline.components.protocols import VideoPixelShape

# Create video latent tools
pixel_shape = VideoPixelShape(
    batch=1,
    frames=49,   # Must be k*8 + 1 (e.g., 49, 97, 121)
    height=512,
    width=768,
    fps=25.0,
)
video_tools = VideoLatentTools(
    patchifier=video_patchifier,
    target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
    fps=25.0,
)

# Create an empty latent state (zeros with positions computed)
video_state = video_tools.create_initial_state(device=device, dtype=torch.bfloat16)
# video_state.latent: [B, seq_len, 128] - zeros (will be replaced with noise)
# video_state.denoise_mask: [B, seq_len, 1] - ones (all tokens to denoise)
# video_state.positions: [B, 3, seq_len, 2] - pixel coordinates for RoPE

# Audio latent tools (similar pattern)
audio_tools = AudioLatentTools(
    patchifier=audio_patchifier,
    target_shape=AudioLatentShape.from_duration(
        batch=1,
        duration=2.0,  # seconds
        channels=8,
        mel_bins=16,
    ),
)
audio_state = audio_tools.create_initial_state(device, dtype)
```

### Conditioning Items - Adding Constraints

Conditioning items modify latent states to add constraints like first-frame conditioning:

```python
from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex
from ltx_core.pipeline.conditioning.types.keyframe_cond import VideoConditionByKeyframeIndex

# Option 1: Condition by latent index (replaces tokens in-place)
first_frame_cond = VideoConditionByLatentIndex(
    latent=encoded_image,  # VAE-encoded image [B, C, 1, H, W]
    strength=1.0,          # 1.0 = fully conditioned, 0.0 = fully denoised
    latent_idx=0,          # Which latent frame to condition
)
video_state = first_frame_cond.apply_to(video_state, video_tools)

# Option 2: Condition by keyframe (appends conditioning tokens)
keyframe_cond = VideoConditionByKeyframeIndex(
    keyframes=encoded_image,  # VAE-encoded keyframe(s)
    frame_idx=0,              # Target frame index
    strength=1.0,
)
video_state = keyframe_cond.apply_to(video_state, video_tools)
```

**Key concepts:**
- `LatentState` is a frozen dataclass containing `latent`, `denoise_mask`, and `positions`
- `denoise_mask` values: `1.0` = denoise this token, `0.0` = keep this token fixed
- Conditioning items return a new `LatentState` (immutable pattern)

### Perturbations - Fine-Grained Control

Perturbations allow you to selectively skip operations at the per-sample, per-block level:

```python
from ltx_core.guidance.perturbations import (
    Perturbation,
    PerturbationType,
    PerturbationConfig,
    BatchedPerturbationConfig,
)

# Available perturbation types
PerturbationType.SKIP_A2V_CROSS_ATTN  # Skip audio→video cross attention
PerturbationType.SKIP_V2A_CROSS_ATTN  # Skip video→audio cross attention
PerturbationType.SKIP_VIDEO_SELF_ATTN # Skip video self attention
PerturbationType.SKIP_AUDIO_SELF_ATTN # Skip audio self attention

# Example: Skip audio→video attention in specific blocks
perturbation = Perturbation(
    type=PerturbationType.SKIP_A2V_CROSS_ATTN,
    blocks=[0, 1, 2, 3],  # Skip in blocks 0-3, or None for all blocks
)
config = PerturbationConfig(perturbations=[perturbation])

# For batched inputs
batched_config = BatchedPerturbationConfig([config, config])  # batch_size=2

# Or use empty config for normal operation
batched_config = BatchedPerturbationConfig.empty(batch_size=2)
```

**Use cases for perturbations:**
- **STG (Spatio-Temporal Guidance)**: Skip self-attention in block 29 to improve video quality
- Ablation studies (disable specific attention paths)
- Custom guidance strategies
- Debugging model behavior

**STG (Spatio-Temporal Guidance) Example:**

STG uses perturbations to improve video generation quality by running an additional forward pass with self-attention skipped:

```python
from ltx_core.guidance.perturbations import (
    Perturbation, PerturbationType, PerturbationConfig, BatchedPerturbationConfig
)
from ltx_core.pipeline.components.guiders import STGGuider

# Create STG perturbation config (recommended: block 29)
stg_perturbation = Perturbation(
    type=PerturbationType.SKIP_VIDEO_SELF_ATTN,
    blocks=[29],  # Recommended: single block 29
)
stg_config = BatchedPerturbationConfig([PerturbationConfig([stg_perturbation])])

# In your denoising loop:
stg_guider = STGGuider(scale=1.0)  # Recommended scale

# Normal forward pass
pos_video, pos_audio = model(video=video, audio=audio, perturbations=None)

# Perturbed forward pass (for STG)
perturbed_video, perturbed_audio = model(video=video, audio=audio, perturbations=stg_config)

# Apply STG guidance
denoised_video = pos_video + stg_guider.delta(pos_video, perturbed_video)
```

---

## Model Architecture

The LTX-2 transformer consists of 48 blocks, each with the following structure:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  VIDEO STREAM                    AUDIO STREAM               β”‚
β”‚  ───────────                     ────────────               β”‚
β”‚                                                             β”‚
β”‚  1. Video Self-Attention         1. Audio Self-Attention    β”‚
β”‚     (attends to all video)          (attends to all audio)  β”‚
β”‚                                                             β”‚
β”‚  2. Video Cross-Attention        2. Audio Cross-Attention   β”‚
β”‚     (attends to text prompt)        (attends to text prompt)β”‚
β”‚                                                             β”‚
β”‚           ╔═══════════════════════════════════╗             β”‚
β”‚           β•‘  3. AUDIO-VIDEO CROSS ATTENTION   β•‘             β”‚
β”‚           β•‘                                   β•‘             β”‚
│           ║  ‒ Audio-to-Video (A→V):          ║             │
β”‚           β•‘    Video queries, Audio keys/vals β•‘             β”‚
β”‚           β•‘                                   β•‘             β”‚
│           ║  ‒ Video-to-Audio (V→A):          ║             │
β”‚           β•‘    Audio queries, Video keys/vals β•‘             β”‚
β”‚           β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•             β”‚
β”‚                                                             β”‚
β”‚  4. Video Feed-Forward           4. Audio Feed-Forward      β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Key insight**: Video and audio "talk" to each other through bidirectional cross-attention in every block, enabling synchronized audio-video generation.

### Forward Pass

```python
from ltx_core.model.transformer.model import LTXModel

# The transformer takes both modalities and returns predictions for both
video_velocity, audio_velocity = model(
    video=video_modality,
    audio=audio_modality,
    perturbations=None,  # or BatchedPerturbationConfig
)
# Returns velocity predictions used in the Euler diffusion step
```

---

## Usage Patterns

### Text-to-Video Generation

Basic text-to-video generation flow:

```python
from dataclasses import replace
from ltx_core.pipeline.components.schedulers import LTX2Scheduler
from ltx_core.pipeline.components.diffusion_steps import EulerDiffusionStep
from ltx_core.pipeline.components.guiders import CFGGuider
from ltx_core.pipeline.conditioning.tools import VideoLatentTools
from ltx_core.pipeline.components.patchifiers import VideoLatentShape

# 1. Encode text prompt
video_context, audio_context, mask = text_encoder(prompt)

# 2. Create video latent tools and initial state
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
    patchifier=video_patchifier,
    target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
    fps=25.0,
)
video_state = video_tools.create_initial_state(device, dtype)

# 3. Add noise to the latent
noise = torch.randn_like(video_state.latent)
noised_latent = noise  # Start from pure noise

# 4. Create video modality
video = Modality(
    enabled=True,
    latent=noised_latent,
    timesteps=video_state.denoise_mask,  # Will be updated each step
    positions=video_state.positions,
    context=video_context,
    context_mask=None,
)

# 5. Setup scheduler and diffusion components
scheduler = LTX2Scheduler()
sigmas = scheduler.execute(steps=30).to(device)
stepper = EulerDiffusionStep()

# 6. Denoising loop
for step_idx, sigma in enumerate(sigmas[:-1]):
    # Update timesteps with current sigma (use replace for immutable Modality)
    video = replace(video, timesteps=sigma * video_state.denoise_mask)

    # Forward pass
    video_vel, _ = model(video=video, audio=disabled_audio, perturbations=None)

    # Euler step
    new_latent = stepper.step(video.latent, video_vel, sigmas, step_idx)
    video = replace(video, latent=new_latent)

# 7. Decode to pixels
video_spatial = video_tools.unpatchify(
    replace(video_state, latent=video.latent)
).latent  # [B, C, F, H, W]
video_pixels = vae_decoder(video_spatial)  # [B, 3, F, H, W]
```

### Image-to-Video Generation

Condition the first frame with an image:

```python
from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex

# Encode the conditioning image
image_latent = vae_encoder(image)  # [B, C, 1, H, W]

# Create video tools and initial state
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
    patchifier=video_patchifier,
    target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
    fps=25.0,
)
video_state = video_tools.create_initial_state(device, dtype)

# Apply first-frame conditioning
first_frame_cond = VideoConditionByLatentIndex(
    latent=image_latent,
    strength=1.0,   # 1.0 = fully conditioned (no denoising on first frame)
    latent_idx=0,   # Condition frame 0
)
video_state = first_frame_cond.apply_to(video_state, video_tools)
# The denoise_mask will be 0.0 for first-frame tokens, 1.0 for the rest

# Proceed with denoising as usual...
```

### Video-to-Video (IC-LoRA)

IC-LoRA enables video-to-video transformation by conditioning on a reference video. The key insight is that reference tokens are included in the sequence but kept at timestep=0 (clean, no denoising).

```python
from dataclasses import replace
from ltx_core.pipeline.conditioning.tools import VideoLatentTools
from ltx_core.pipeline.components.patchifiers import VideoLatentShape
from ltx_core.pipeline.components.protocols import VideoPixelShape

# 1. Create video tools for target
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
    patchifier=video_patchifier,
    target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
    fps=25.0,
)

# 2. Encode reference video to latents and patchify
ref_latents = vae_encoder(reference_video)  # [B, C, F, H, W]
patchified_ref = video_patchifier.patchify(ref_latents)  # [B, ref_seq_len, C]
ref_seq_len = patchified_ref.shape[1]

# 3. Create target video state (positions computed automatically)
target_state = video_tools.create_initial_state(device, dtype)

# 4. Compute positions for reference (SAME grid as target!)
# Reference positions are identical to target - this tells the model they correspond
ref_positions = target_state.positions.clone()

# 5. CONCATENATE reference + target
combined_latent = torch.cat([patchified_ref, torch.randn_like(target_state.latent)], dim=1)
combined_positions = torch.cat([ref_positions, target_state.positions], dim=2)

# 6. Create denoise mask: 0 for reference (keep clean), 1 for target (denoise)
ref_denoise_mask = torch.zeros(1, ref_seq_len, 1, device=device)
combined_denoise_mask = torch.cat([ref_denoise_mask, target_state.denoise_mask], dim=1)

# 7. Create modality with combined inputs
video = Modality(
    enabled=True,
    latent=combined_latent,
    timesteps=combined_denoise_mask,  # Will be updated with sigma
    positions=combined_positions,
    context=video_context,
    context_mask=None,
)

# 8. Denoising loop - only update target portion
for step_idx, sigma in enumerate(sigmas[:-1]):
    # Timesteps: 0 for reference, sigma for target
    ref_timesteps = torch.zeros(1, ref_seq_len, 1, device=device)
    target_timesteps = sigma * target_state.denoise_mask
    new_timesteps = torch.cat([ref_timesteps, target_timesteps], dim=1)
    video = replace(video, timesteps=new_timesteps)

    # Forward pass
    video_vel, _ = model(video=video, audio=audio, perturbations=None)

    # Euler step - ONLY update target portion
    target_latent = video.latent[:, ref_seq_len:]
    target_vel = video_vel[:, ref_seq_len:]
    updated_target = stepper.step(target_latent, target_vel, sigmas, step_idx)

    # Reconstruct (reference stays fixed)
    new_latent = torch.cat([patchified_ref, updated_target], dim=1)
    video = replace(video, latent=new_latent)

# 9. Extract and decode only the target portion
final_target = video.latent[:, ref_seq_len:]
target_state_with_output = replace(target_state, latent=final_target)
target_spatial = video_tools.unpatchify(target_state_with_output).latent
video_pixels = vae_decoder(target_spatial)
```

**Why this works:**
- Self-attention sees both reference and target tokens
- Reference tokens have `timestep=0` (clean signal) - model learns to "copy" from them
- Shared positions tell the model "frame N of reference = frame N of target"
- Only target portion is updated during denoising

### Audio-Video Generation

Generate synchronized audio and video:

```python
from dataclasses import replace
from ltx_core.pipeline.conditioning.tools import VideoLatentTools, AudioLatentTools
from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape
from ltx_core.pipeline.components.protocols import VideoPixelShape

# Create latent tools for both modalities
pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0)
video_tools = VideoLatentTools(
    patchifier=video_patchifier,
    target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape),
    fps=25.0,
)
audio_tools = AudioLatentTools(
    patchifier=audio_patchifier,
    target_shape=AudioLatentShape.from_duration(batch=1, duration=2.0, channels=8, mel_bins=16),
)

# Create initial states
video_state = video_tools.create_initial_state(device, dtype)
audio_state = audio_tools.create_initial_state(device, dtype)

# Encode text (returns separate embeddings for each modality)
video_context, audio_context, mask = text_encoder(prompt)

# Create both modalities with noise
video = Modality(
    enabled=True,
    latent=torch.randn_like(video_state.latent),
    timesteps=video_state.denoise_mask,
    positions=video_state.positions,
    context=video_context,
    context_mask=None,
)
audio = Modality(
    enabled=True,
    latent=torch.randn_like(audio_state.latent),
    timesteps=audio_state.denoise_mask,
    positions=audio_state.positions,
    context=audio_context,
    context_mask=None,
)

# Denoising loop - update both (use replace for immutable Modality)
for step_idx, sigma in enumerate(sigmas[:-1]):
    video = replace(video, timesteps=sigma * video_state.denoise_mask)
    audio = replace(audio, timesteps=sigma * audio_state.denoise_mask)

    # Forward pass returns both predictions
    video_vel, audio_vel = model(video=video, audio=audio, perturbations=None)

    # Update both latents
    video = replace(video, latent=stepper.step(video.latent, video_vel, sigmas, step_idx))
    audio = replace(audio, latent=stepper.step(audio.latent, audio_vel, sigmas, step_idx))

# Decode both
video_spatial = video_tools.unpatchify(replace(video_state, latent=video.latent)).latent
video_pixels = vae_decoder(video_spatial)
audio_spatial = audio_tools.unpatchify(replace(audio_state, latent=audio.latent)).latent
audio_mel = audio_decoder(audio_spatial)
audio_waveform = vocoder(audio_mel)
```

---

## Common Pitfalls

### 1. Frame Count Constraints

Video frame count must satisfy `num_frames % 8 == 1`:
- βœ… Valid: 49, 97, 121, 145
- ❌ Invalid: 48, 50, 100

```python
# The "+1" accounts for causal padding in the VAE
latent_frames = (num_frames - 1) // 8 + 1
```

### 2. Resolution Constraints

Height and width must be divisible by 32:
- βœ… Valid: 512Γ—768, 768Γ—1024
- ❌ Invalid: 500Γ—750

### 3. Position Tensor Shapes

Different modalities have different position tensor shapes:
- Video: `[B, 3, seq_len, 2]` - 3 dimensions for (time, height, width)
- Audio: `[B, 1, seq_len, 2]` - 1 dimension for time only

### 4. Separate Context Embeddings

Video and audio modalities receive **different** context embeddings from the text encoder:

```python
# The text encoder returns separate embeddings
video_context, audio_context, mask = text_encoder(prompt)

# Use the appropriate one for each modality
video = Modality(context=video_context, ...)  # NOT audio_context!
audio = Modality(context=audio_context, ...)  # NOT video_context!
```

### 5. Immutable Modality

The `Modality` dataclass is **frozen** (immutable). Use `dataclasses.replace()` to create modified copies:

```python
from dataclasses import replace

# ❌ Wrong - will raise an error
video.latent = new_latent

# βœ… Correct - create a new Modality with updated field
video = replace(video, latent=new_latent)

# βœ… Update multiple fields at once
video = replace(video, latent=new_latent, timesteps=new_timesteps)
```

---

## Additional Resources

- [Training Guide](./training-guide.md) - How to fine-tune LTX-2 models
- [Configuration Reference](./configuration-reference.md) - All configuration options
- [Training Modes](./training-modes.md) - LoRA, audio-video, and IC-LoRA training