Spaces:
Running
on
Zero
Running
on
Zero
| # LTX-Core Model API Guide | |
| This guide explains the core concepts and APIs used in the LTX-2 Audio-Video diffusion model. Understanding these concepts is essential for training, fine-tuning, and running inference with LTX models. | |
| ## Table of Contents | |
| 1. [Overview](#overview) | |
| 2. [Core Concepts](#core-concepts) | |
| - [Modality](#modality---the-input-container) | |
| - [Patchifiers](#patchifiers---format-conversion) | |
| - [Latent Tools](#latent-tools---preparing-inputs) | |
| - [Conditioning Items](#conditioning-items---adding-constraints) | |
| - [Perturbations](#perturbations---fine-grained-control) | |
| 3. [Model Architecture](#model-architecture) | |
| 4. [Usage Patterns](#usage-patterns) | |
| - [Text-to-Video Generation](#text-to-video-generation) | |
| - [Image-to-Video Generation](#image-to-video-generation) | |
| - [Video-to-Video (IC-LoRA)](#video-to-video-ic-lora) | |
| - [Audio-Video Generation](#audio-video-generation) | |
| 5. [Common Pitfalls](#common-pitfalls) | |
| --- | |
| ## Overview | |
| The LTX-2 model is a **joint Audio-Video diffusion transformer**. Unlike traditional models that handle one modality at a time, LTX-2 processes **video and audio simultaneously** in a unified architecture, enabling cross-modal attention between them. | |
| Key characteristics: | |
| - **Dual-stream architecture**: Separate processing paths for video and audio that interact via cross-attention | |
| - **Per-token timesteps**: Different tokens can have different noise levels (enables advanced conditioning) | |
| - **Flexible conditioning**: Supports text, image, and video conditioning | |
| --- | |
| ## Core Concepts | |
| ### Modality - The Input Container | |
| The `Modality` dataclass wraps all information needed to process either video or audio: | |
| ```python | |
| from ltx_core.model.transformer.modality import Modality | |
| @dataclass | |
| class Modality: | |
| enabled: bool # Whether this modality should be processed | |
| latent: torch.Tensor # Shape: (B, seq_len, D) - patchified tokens | |
| timesteps: torch.Tensor # Shape: (B, seq_len) - noise level per token | |
| positions: torch.Tensor # Shape: (B, dims, seq_len, 2) - spatial/temporal coordinates | |
| context: torch.Tensor # Text embeddings | |
| context_mask: torch.Tensor | None | |
| ``` | |
| **Field descriptions:** | |
| | Field | Description | | |
| |-------|-------------| | |
| | `enabled` | Set to `False` to skip processing this modality | | |
| | `latent` | Sequence of tokens in patchified format (not spatial `[B,C,F,H,W]`) | | |
| | `timesteps` | Per-token noise levels (sigma values). Enables token-level conditioning | | |
| | `positions` | Coordinates for RoPE (Rotary Position Embeddings). Video: `[B, 3, seq, 2]`, Audio: `[B, 1, seq, 2]` | | |
| | `context` | Text prompt embeddings from the Gemma encoder | | |
| | `context_mask` | Optional attention mask for the context | | |
| ### Patchifiers - Format Conversion | |
| Patchifiers convert between spatial format and sequence format: | |
| ```python | |
| from ltx_core.pipeline.components.patchifiers import ( | |
| VideoLatentPatchifier, | |
| AudioPatchifier, | |
| VideoLatentShape, | |
| AudioLatentShape, | |
| ) | |
| # Video patchification | |
| video_patchifier = VideoLatentPatchifier(patch_size=1) | |
| # Spatial to sequence: [B, C, F, H, W] β [B, F*H*W, C] | |
| patchified = video_patchifier.patchify(video_latent) | |
| # Sequence to spatial: [B, seq_len, C] β [B, C, F, H, W] | |
| spatial = video_patchifier.unpatchify( | |
| patchified, | |
| output_shape=VideoLatentShape( | |
| batch=1, channels=128, frames=7, height=16, width=24 | |
| ) | |
| ) | |
| # Audio patchification | |
| audio_patchifier = AudioPatchifier(patch_size=1) | |
| # [B, C, T, mel_bins] β [B, T, C*mel_bins] | |
| patchified_audio = audio_patchifier.patchify(audio_latent) | |
| ``` | |
| ### Latent Tools - Preparing Inputs | |
| Latent tools handle the setup of initial latents, masks, and positions. Combined with conditioning items, they provide flexible input preparation: | |
| ```python | |
| from ltx_core.pipeline.conditioning.tools import ( | |
| VideoLatentTools, | |
| AudioLatentTools, | |
| LatentState, | |
| ) | |
| from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape | |
| from ltx_core.pipeline.components.protocols import VideoPixelShape | |
| # Create video latent tools | |
| pixel_shape = VideoPixelShape( | |
| batch=1, | |
| frames=49, # Must be k*8 + 1 (e.g., 49, 97, 121) | |
| height=512, | |
| width=768, | |
| fps=25.0, | |
| ) | |
| video_tools = VideoLatentTools( | |
| patchifier=video_patchifier, | |
| target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape), | |
| fps=25.0, | |
| ) | |
| # Create an empty latent state (zeros with positions computed) | |
| video_state = video_tools.create_initial_state(device=device, dtype=torch.bfloat16) | |
| # video_state.latent: [B, seq_len, 128] - zeros (will be replaced with noise) | |
| # video_state.denoise_mask: [B, seq_len, 1] - ones (all tokens to denoise) | |
| # video_state.positions: [B, 3, seq_len, 2] - pixel coordinates for RoPE | |
| # Audio latent tools (similar pattern) | |
| audio_tools = AudioLatentTools( | |
| patchifier=audio_patchifier, | |
| target_shape=AudioLatentShape.from_duration( | |
| batch=1, | |
| duration=2.0, # seconds | |
| channels=8, | |
| mel_bins=16, | |
| ), | |
| ) | |
| audio_state = audio_tools.create_initial_state(device, dtype) | |
| ``` | |
| ### Conditioning Items - Adding Constraints | |
| Conditioning items modify latent states to add constraints like first-frame conditioning: | |
| ```python | |
| from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex | |
| from ltx_core.pipeline.conditioning.types.keyframe_cond import VideoConditionByKeyframeIndex | |
| # Option 1: Condition by latent index (replaces tokens in-place) | |
| first_frame_cond = VideoConditionByLatentIndex( | |
| latent=encoded_image, # VAE-encoded image [B, C, 1, H, W] | |
| strength=1.0, # 1.0 = fully conditioned, 0.0 = fully denoised | |
| latent_idx=0, # Which latent frame to condition | |
| ) | |
| video_state = first_frame_cond.apply_to(video_state, video_tools) | |
| # Option 2: Condition by keyframe (appends conditioning tokens) | |
| keyframe_cond = VideoConditionByKeyframeIndex( | |
| keyframes=encoded_image, # VAE-encoded keyframe(s) | |
| frame_idx=0, # Target frame index | |
| strength=1.0, | |
| ) | |
| video_state = keyframe_cond.apply_to(video_state, video_tools) | |
| ``` | |
| **Key concepts:** | |
| - `LatentState` is a frozen dataclass containing `latent`, `denoise_mask`, and `positions` | |
| - `denoise_mask` values: `1.0` = denoise this token, `0.0` = keep this token fixed | |
| - Conditioning items return a new `LatentState` (immutable pattern) | |
| ### Perturbations - Fine-Grained Control | |
| Perturbations allow you to selectively skip operations at the per-sample, per-block level: | |
| ```python | |
| from ltx_core.guidance.perturbations import ( | |
| Perturbation, | |
| PerturbationType, | |
| PerturbationConfig, | |
| BatchedPerturbationConfig, | |
| ) | |
| # Available perturbation types | |
| PerturbationType.SKIP_A2V_CROSS_ATTN # Skip audioβvideo cross attention | |
| PerturbationType.SKIP_V2A_CROSS_ATTN # Skip videoβaudio cross attention | |
| PerturbationType.SKIP_VIDEO_SELF_ATTN # Skip video self attention | |
| PerturbationType.SKIP_AUDIO_SELF_ATTN # Skip audio self attention | |
| # Example: Skip audioβvideo attention in specific blocks | |
| perturbation = Perturbation( | |
| type=PerturbationType.SKIP_A2V_CROSS_ATTN, | |
| blocks=[0, 1, 2, 3], # Skip in blocks 0-3, or None for all blocks | |
| ) | |
| config = PerturbationConfig(perturbations=[perturbation]) | |
| # For batched inputs | |
| batched_config = BatchedPerturbationConfig([config, config]) # batch_size=2 | |
| # Or use empty config for normal operation | |
| batched_config = BatchedPerturbationConfig.empty(batch_size=2) | |
| ``` | |
| **Use cases for perturbations:** | |
| - **STG (Spatio-Temporal Guidance)**: Skip self-attention in block 29 to improve video quality | |
| - Ablation studies (disable specific attention paths) | |
| - Custom guidance strategies | |
| - Debugging model behavior | |
| **STG (Spatio-Temporal Guidance) Example:** | |
| STG uses perturbations to improve video generation quality by running an additional forward pass with self-attention skipped: | |
| ```python | |
| from ltx_core.guidance.perturbations import ( | |
| Perturbation, PerturbationType, PerturbationConfig, BatchedPerturbationConfig | |
| ) | |
| from ltx_core.pipeline.components.guiders import STGGuider | |
| # Create STG perturbation config (recommended: block 29) | |
| stg_perturbation = Perturbation( | |
| type=PerturbationType.SKIP_VIDEO_SELF_ATTN, | |
| blocks=[29], # Recommended: single block 29 | |
| ) | |
| stg_config = BatchedPerturbationConfig([PerturbationConfig([stg_perturbation])]) | |
| # In your denoising loop: | |
| stg_guider = STGGuider(scale=1.0) # Recommended scale | |
| # Normal forward pass | |
| pos_video, pos_audio = model(video=video, audio=audio, perturbations=None) | |
| # Perturbed forward pass (for STG) | |
| perturbed_video, perturbed_audio = model(video=video, audio=audio, perturbations=stg_config) | |
| # Apply STG guidance | |
| denoised_video = pos_video + stg_guider.delta(pos_video, perturbed_video) | |
| ``` | |
| --- | |
| ## Model Architecture | |
| The LTX-2 transformer consists of 48 blocks, each with the following structure: | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β VIDEO STREAM AUDIO STREAM β | |
| β βββββββββββ ββββββββββββ β | |
| β β | |
| β 1. Video Self-Attention 1. Audio Self-Attention β | |
| β (attends to all video) (attends to all audio) β | |
| β β | |
| β 2. Video Cross-Attention 2. Audio Cross-Attention β | |
| β (attends to text prompt) (attends to text prompt)β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββ β | |
| β β 3. AUDIO-VIDEO CROSS ATTENTION β β | |
| β β β β | |
| β β β’ Audio-to-Video (AβV): β β | |
| β β Video queries, Audio keys/vals β β | |
| β β β β | |
| β β β’ Video-to-Audio (VβA): β β | |
| β β Audio queries, Video keys/vals β β | |
| β βββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β 4. Video Feed-Forward 4. Audio Feed-Forward β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Key insight**: Video and audio "talk" to each other through bidirectional cross-attention in every block, enabling synchronized audio-video generation. | |
| ### Forward Pass | |
| ```python | |
| from ltx_core.model.transformer.model import LTXModel | |
| # The transformer takes both modalities and returns predictions for both | |
| video_velocity, audio_velocity = model( | |
| video=video_modality, | |
| audio=audio_modality, | |
| perturbations=None, # or BatchedPerturbationConfig | |
| ) | |
| # Returns velocity predictions used in the Euler diffusion step | |
| ``` | |
| --- | |
| ## Usage Patterns | |
| ### Text-to-Video Generation | |
| Basic text-to-video generation flow: | |
| ```python | |
| from dataclasses import replace | |
| from ltx_core.pipeline.components.schedulers import LTX2Scheduler | |
| from ltx_core.pipeline.components.diffusion_steps import EulerDiffusionStep | |
| from ltx_core.pipeline.components.guiders import CFGGuider | |
| from ltx_core.pipeline.conditioning.tools import VideoLatentTools | |
| from ltx_core.pipeline.components.patchifiers import VideoLatentShape | |
| # 1. Encode text prompt | |
| video_context, audio_context, mask = text_encoder(prompt) | |
| # 2. Create video latent tools and initial state | |
| pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0) | |
| video_tools = VideoLatentTools( | |
| patchifier=video_patchifier, | |
| target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape), | |
| fps=25.0, | |
| ) | |
| video_state = video_tools.create_initial_state(device, dtype) | |
| # 3. Add noise to the latent | |
| noise = torch.randn_like(video_state.latent) | |
| noised_latent = noise # Start from pure noise | |
| # 4. Create video modality | |
| video = Modality( | |
| enabled=True, | |
| latent=noised_latent, | |
| timesteps=video_state.denoise_mask, # Will be updated each step | |
| positions=video_state.positions, | |
| context=video_context, | |
| context_mask=None, | |
| ) | |
| # 5. Setup scheduler and diffusion components | |
| scheduler = LTX2Scheduler() | |
| sigmas = scheduler.execute(steps=30).to(device) | |
| stepper = EulerDiffusionStep() | |
| # 6. Denoising loop | |
| for step_idx, sigma in enumerate(sigmas[:-1]): | |
| # Update timesteps with current sigma (use replace for immutable Modality) | |
| video = replace(video, timesteps=sigma * video_state.denoise_mask) | |
| # Forward pass | |
| video_vel, _ = model(video=video, audio=disabled_audio, perturbations=None) | |
| # Euler step | |
| new_latent = stepper.step(video.latent, video_vel, sigmas, step_idx) | |
| video = replace(video, latent=new_latent) | |
| # 7. Decode to pixels | |
| video_spatial = video_tools.unpatchify( | |
| replace(video_state, latent=video.latent) | |
| ).latent # [B, C, F, H, W] | |
| video_pixels = vae_decoder(video_spatial) # [B, 3, F, H, W] | |
| ``` | |
| ### Image-to-Video Generation | |
| Condition the first frame with an image: | |
| ```python | |
| from ltx_core.pipeline.conditioning.types.latent_cond import VideoConditionByLatentIndex | |
| # Encode the conditioning image | |
| image_latent = vae_encoder(image) # [B, C, 1, H, W] | |
| # Create video tools and initial state | |
| pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0) | |
| video_tools = VideoLatentTools( | |
| patchifier=video_patchifier, | |
| target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape), | |
| fps=25.0, | |
| ) | |
| video_state = video_tools.create_initial_state(device, dtype) | |
| # Apply first-frame conditioning | |
| first_frame_cond = VideoConditionByLatentIndex( | |
| latent=image_latent, | |
| strength=1.0, # 1.0 = fully conditioned (no denoising on first frame) | |
| latent_idx=0, # Condition frame 0 | |
| ) | |
| video_state = first_frame_cond.apply_to(video_state, video_tools) | |
| # The denoise_mask will be 0.0 for first-frame tokens, 1.0 for the rest | |
| # Proceed with denoising as usual... | |
| ``` | |
| ### Video-to-Video (IC-LoRA) | |
| IC-LoRA enables video-to-video transformation by conditioning on a reference video. The key insight is that reference tokens are included in the sequence but kept at timestep=0 (clean, no denoising). | |
| ```python | |
| from dataclasses import replace | |
| from ltx_core.pipeline.conditioning.tools import VideoLatentTools | |
| from ltx_core.pipeline.components.patchifiers import VideoLatentShape | |
| from ltx_core.pipeline.components.protocols import VideoPixelShape | |
| # 1. Create video tools for target | |
| pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0) | |
| video_tools = VideoLatentTools( | |
| patchifier=video_patchifier, | |
| target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape), | |
| fps=25.0, | |
| ) | |
| # 2. Encode reference video to latents and patchify | |
| ref_latents = vae_encoder(reference_video) # [B, C, F, H, W] | |
| patchified_ref = video_patchifier.patchify(ref_latents) # [B, ref_seq_len, C] | |
| ref_seq_len = patchified_ref.shape[1] | |
| # 3. Create target video state (positions computed automatically) | |
| target_state = video_tools.create_initial_state(device, dtype) | |
| # 4. Compute positions for reference (SAME grid as target!) | |
| # Reference positions are identical to target - this tells the model they correspond | |
| ref_positions = target_state.positions.clone() | |
| # 5. CONCATENATE reference + target | |
| combined_latent = torch.cat([patchified_ref, torch.randn_like(target_state.latent)], dim=1) | |
| combined_positions = torch.cat([ref_positions, target_state.positions], dim=2) | |
| # 6. Create denoise mask: 0 for reference (keep clean), 1 for target (denoise) | |
| ref_denoise_mask = torch.zeros(1, ref_seq_len, 1, device=device) | |
| combined_denoise_mask = torch.cat([ref_denoise_mask, target_state.denoise_mask], dim=1) | |
| # 7. Create modality with combined inputs | |
| video = Modality( | |
| enabled=True, | |
| latent=combined_latent, | |
| timesteps=combined_denoise_mask, # Will be updated with sigma | |
| positions=combined_positions, | |
| context=video_context, | |
| context_mask=None, | |
| ) | |
| # 8. Denoising loop - only update target portion | |
| for step_idx, sigma in enumerate(sigmas[:-1]): | |
| # Timesteps: 0 for reference, sigma for target | |
| ref_timesteps = torch.zeros(1, ref_seq_len, 1, device=device) | |
| target_timesteps = sigma * target_state.denoise_mask | |
| new_timesteps = torch.cat([ref_timesteps, target_timesteps], dim=1) | |
| video = replace(video, timesteps=new_timesteps) | |
| # Forward pass | |
| video_vel, _ = model(video=video, audio=audio, perturbations=None) | |
| # Euler step - ONLY update target portion | |
| target_latent = video.latent[:, ref_seq_len:] | |
| target_vel = video_vel[:, ref_seq_len:] | |
| updated_target = stepper.step(target_latent, target_vel, sigmas, step_idx) | |
| # Reconstruct (reference stays fixed) | |
| new_latent = torch.cat([patchified_ref, updated_target], dim=1) | |
| video = replace(video, latent=new_latent) | |
| # 9. Extract and decode only the target portion | |
| final_target = video.latent[:, ref_seq_len:] | |
| target_state_with_output = replace(target_state, latent=final_target) | |
| target_spatial = video_tools.unpatchify(target_state_with_output).latent | |
| video_pixels = vae_decoder(target_spatial) | |
| ``` | |
| **Why this works:** | |
| - Self-attention sees both reference and target tokens | |
| - Reference tokens have `timestep=0` (clean signal) - model learns to "copy" from them | |
| - Shared positions tell the model "frame N of reference = frame N of target" | |
| - Only target portion is updated during denoising | |
| ### Audio-Video Generation | |
| Generate synchronized audio and video: | |
| ```python | |
| from dataclasses import replace | |
| from ltx_core.pipeline.conditioning.tools import VideoLatentTools, AudioLatentTools | |
| from ltx_core.pipeline.components.patchifiers import VideoLatentShape, AudioLatentShape | |
| from ltx_core.pipeline.components.protocols import VideoPixelShape | |
| # Create latent tools for both modalities | |
| pixel_shape = VideoPixelShape(batch=1, frames=49, height=512, width=768, fps=25.0) | |
| video_tools = VideoLatentTools( | |
| patchifier=video_patchifier, | |
| target_shape=VideoLatentShape.from_pixel_shape(shape=pixel_shape), | |
| fps=25.0, | |
| ) | |
| audio_tools = AudioLatentTools( | |
| patchifier=audio_patchifier, | |
| target_shape=AudioLatentShape.from_duration(batch=1, duration=2.0, channels=8, mel_bins=16), | |
| ) | |
| # Create initial states | |
| video_state = video_tools.create_initial_state(device, dtype) | |
| audio_state = audio_tools.create_initial_state(device, dtype) | |
| # Encode text (returns separate embeddings for each modality) | |
| video_context, audio_context, mask = text_encoder(prompt) | |
| # Create both modalities with noise | |
| video = Modality( | |
| enabled=True, | |
| latent=torch.randn_like(video_state.latent), | |
| timesteps=video_state.denoise_mask, | |
| positions=video_state.positions, | |
| context=video_context, | |
| context_mask=None, | |
| ) | |
| audio = Modality( | |
| enabled=True, | |
| latent=torch.randn_like(audio_state.latent), | |
| timesteps=audio_state.denoise_mask, | |
| positions=audio_state.positions, | |
| context=audio_context, | |
| context_mask=None, | |
| ) | |
| # Denoising loop - update both (use replace for immutable Modality) | |
| for step_idx, sigma in enumerate(sigmas[:-1]): | |
| video = replace(video, timesteps=sigma * video_state.denoise_mask) | |
| audio = replace(audio, timesteps=sigma * audio_state.denoise_mask) | |
| # Forward pass returns both predictions | |
| video_vel, audio_vel = model(video=video, audio=audio, perturbations=None) | |
| # Update both latents | |
| video = replace(video, latent=stepper.step(video.latent, video_vel, sigmas, step_idx)) | |
| audio = replace(audio, latent=stepper.step(audio.latent, audio_vel, sigmas, step_idx)) | |
| # Decode both | |
| video_spatial = video_tools.unpatchify(replace(video_state, latent=video.latent)).latent | |
| video_pixels = vae_decoder(video_spatial) | |
| audio_spatial = audio_tools.unpatchify(replace(audio_state, latent=audio.latent)).latent | |
| audio_mel = audio_decoder(audio_spatial) | |
| audio_waveform = vocoder(audio_mel) | |
| ``` | |
| --- | |
| ## Common Pitfalls | |
| ### 1. Frame Count Constraints | |
| Video frame count must satisfy `num_frames % 8 == 1`: | |
| - β Valid: 49, 97, 121, 145 | |
| - β Invalid: 48, 50, 100 | |
| ```python | |
| # The "+1" accounts for causal padding in the VAE | |
| latent_frames = (num_frames - 1) // 8 + 1 | |
| ``` | |
| ### 2. Resolution Constraints | |
| Height and width must be divisible by 32: | |
| - β Valid: 512Γ768, 768Γ1024 | |
| - β Invalid: 500Γ750 | |
| ### 3. Position Tensor Shapes | |
| Different modalities have different position tensor shapes: | |
| - Video: `[B, 3, seq_len, 2]` - 3 dimensions for (time, height, width) | |
| - Audio: `[B, 1, seq_len, 2]` - 1 dimension for time only | |
| ### 4. Separate Context Embeddings | |
| Video and audio modalities receive **different** context embeddings from the text encoder: | |
| ```python | |
| # The text encoder returns separate embeddings | |
| video_context, audio_context, mask = text_encoder(prompt) | |
| # Use the appropriate one for each modality | |
| video = Modality(context=video_context, ...) # NOT audio_context! | |
| audio = Modality(context=audio_context, ...) # NOT video_context! | |
| ``` | |
| ### 5. Immutable Modality | |
| The `Modality` dataclass is **frozen** (immutable). Use `dataclasses.replace()` to create modified copies: | |
| ```python | |
| from dataclasses import replace | |
| # β Wrong - will raise an error | |
| video.latent = new_latent | |
| # β Correct - create a new Modality with updated field | |
| video = replace(video, latent=new_latent) | |
| # β Update multiple fields at once | |
| video = replace(video, latent=new_latent, timesteps=new_timesteps) | |
| ``` | |
| --- | |
| ## Additional Resources | |
| - [Training Guide](./training-guide.md) - How to fine-tune LTX-2 models | |
| - [Configuration Reference](./configuration-reference.md) - All configuration options | |
| - [Training Modes](./training-modes.md) - LoRA, audio-video, and IC-LoRA training | |