Fabrice-TIERCELIN commited on
Commit
745c566
Β·
verified Β·
1 Parent(s): d542a91

Upload 2 files

Browse files
packages/ltx-core/README.md CHANGED
@@ -1 +1,280 @@
1
- # LTX-2 Core
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LTX-Core
2
+
3
+ The foundational library for the LTX-2 Audio-Video generation model. This package contains the raw model definitions, component implementations, and loading logic used by `ltx-pipelines` and `ltx-trainer`.
4
+
5
+ ## πŸ“¦ What's Inside?
6
+
7
+ - **`components/`**: Modular diffusion components (Schedulers, Guiders, Noisers, Patchifiers) following standard protocols
8
+ - **`conditioning/`**: Tools for preparing latent states and applying conditioning (image, video, keyframes)
9
+ - **`guidance/`**: Perturbation system for fine-grained control over attention mechanisms
10
+ - **`loader/`**: Utilities for loading weights from `.safetensors`, fusing LoRAs, and managing memory
11
+ - **`model/`**: PyTorch implementations of the LTX-2 Transformer, Video VAE, Audio VAE, Vocoder and Upscaler
12
+ - **`text_encoders/gemma`**: Gemma text encoder implementation with tokenizers, feature extractors, and separate encoders for audio-video and video-only generation
13
+
14
+ ## πŸš€ Quick Start
15
+
16
+ `ltx-core` provides the building blocks (models, components, and utilities) needed to construct inference flows. For ready-made inference pipelines use [`ltx-pipelines`](../ltx-pipelines/) or [`ltx-trainer`](../ltx-trainer/) for training.
17
+
18
+ ## πŸ”§ Installation
19
+
20
+ ```bash
21
+ # From the repository root
22
+ uv sync --frozen
23
+
24
+ # Or install as a package
25
+ pip install -e packages/ltx-core
26
+ ```
27
+
28
+ ## Building Blocks Overview
29
+
30
+ `ltx-core` provides modular components that can be combined to build custom inference flows:
31
+
32
+ ### Core Models
33
+
34
+ - **Transformer** ([`model/transformer/`](src/ltx_core/model/transformer/)): The 48-layer LTX-2 transformer with cross-modal attention for joint audio-video processing. Expects inputs in [`Modality`](src/ltx_core/model/transformer/modality.py) format
35
+ - **Video VAE** ([`model/video_vae/`](src/ltx_core/model/video_vae/)): Encodes/decodes video pixels to/from latent space with temporal and spatial compression
36
+ - **Audio VAE** ([`model/audio_vae/`](src/ltx_core/model/audio_vae/)): Encodes/decodes audio spectrograms to/from latent space
37
+ - **Vocoder** ([`model/audio_vae/`](src/ltx_core/model/audio_vae/)): Neural vocoder that converts mel spectrograms to audio waveforms
38
+ - **Text Encoder** ([`text_encoders/`](src/ltx_core/text_encoders/)): Gemma-based encoder that produces separate embeddings for video and audio conditioning
39
+ - **Spatial Upscaler** ([`model/upsampler/`](src/ltx_core/model/upsampler/)): Upsamples latent representations for higher-resolution generation
40
+
41
+ ### Diffusion Components
42
+
43
+ - **Schedulers** ([`components/schedulers.py`](src/ltx_core/components/schedulers.py)): Noise schedules (LTX2Scheduler, LinearQuadratic, Beta) that control the denoising process
44
+ - **Guiders** ([`components/guiders.py`](src/ltx_core/components/guiders.py)): Guidance strategies (CFG, STG, APG) for controlling generation quality and adherence to prompts
45
+ - **Noisers** ([`components/noisers.py`](src/ltx_core/components/noisers.py)): Add noise to latents according to the diffusion schedule
46
+ - **Patchifiers** ([`components/patchifiers.py`](src/ltx_core/components/patchifiers.py)): Convert between spatial latents `[B, C, F, H, W]` and sequence format `[B, seq_len, dim]` for transformer processing
47
+
48
+ ### Conditioning & Control
49
+
50
+ - **Conditioning** ([`conditioning/`](src/ltx_core/conditioning/)): Tools for preparing and applying various conditioning types (image, video, keyframes)
51
+ - **Guidance** ([`guidance/`](src/ltx_core/guidance/)): Perturbation system for fine-grained control over attention mechanisms (e.g., skipping specific attention layers)
52
+
53
+ ### Utilities
54
+
55
+ - **Loader** ([`loader/`](src/ltx_core/loader/)): Model loading from `.safetensors`, LoRA fusion, weight remapping, and memory management
56
+
57
+ For complete, production-ready pipeline implementations that combine these building blocks, see the [`ltx-pipelines`](../ltx-pipelines/) package.
58
+
59
+ ---
60
+
61
+ # Architecture Overview
62
+
63
+ This section provides a deep dive into the internal architecture of the LTX-2 Audio-Video generation model.
64
+
65
+ ## Table of Contents
66
+
67
+ 1. [High-Level Architecture](#high-level-architecture)
68
+ 2. [The Transformer](#the-transformer)
69
+ 3. [Video VAE](#video-vae)
70
+ 4. [Audio VAE](#audio-vae)
71
+ 5. [Text Encoding (Gemma)](#text-encoding-gemma)
72
+ 6. [Spatial Upscaler](#spatial-upsampler)
73
+ 7. [Data Flow](#data-flow)
74
+
75
+ ---
76
+
77
+ ## High-Level Architecture
78
+
79
+ LTX-2 is a **joint Audio-Video diffusion transformer** that processes both modalities simultaneously in a unified architecture. Unlike traditional models that handle video and audio separately, LTX-2 uses cross-modal attention to enable natural synchronization.
80
+
81
+ ```text
82
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
83
+ β”‚ INPUT PREPARATION β”‚
84
+ β”‚ β”‚
85
+ β”‚ Video Pixels β†’ Video VAE Encoder β†’ Video Latents β”‚
86
+ β”‚ Audio Waveform β†’ Audio VAE Encoder β†’ Audio Latents β”‚
87
+ β”‚ Text Prompt β†’ Gemma Encoder β†’ Text Embeddings β”‚
88
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
89
+ ↓
90
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
91
+ β”‚ LTX-2 TRANSFORMER (48 Blocks) β”‚
92
+ β”‚ β”‚
93
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
94
+ β”‚ β”‚ Video Stream β”‚ β”‚ Audio Stream β”‚ β”‚
95
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
96
+ β”‚ β”‚ Self-Attn β”‚ β”‚ Self-Attn β”‚ β”‚
97
+ β”‚ β”‚ Cross-Attn β”‚ β”‚ Cross-Attn β”‚ β”‚
98
+ β”‚ β”‚ │◄────────────►│ β”‚ β”‚
99
+ β”‚ β”‚ A↔V Cross β”‚ β”‚ A↔V Cross β”‚ β”‚
100
+ β”‚ β”‚ Feed-Forward β”‚ β”‚ Feed-Forward β”‚ β”‚
101
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
102
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
103
+ ↓
104
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
105
+ β”‚ OUTPUT DECODING β”‚
106
+ β”‚ β”‚
107
+ β”‚ Video Latents β†’ Video VAE Decoder β†’ Video Pixels β”‚
108
+ β”‚ Audio Latents β†’ Audio VAE Decoder β†’ Mel Spectrogram β”‚
109
+ β”‚ Mel Spectrogram β†’ Vocoder β†’ Audio Waveform β”‚
110
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
111
+ ```
112
+
113
+ ---
114
+
115
+ ## The Transformer
116
+
117
+ The core of LTX-2 is a 48-layer transformer that processes both video and audio tokens simultaneously.
118
+
119
+ ### Model Structure
120
+
121
+ **Source**: [`src/ltx_core/model/transformer/model.py`](src/ltx_core/model/transformer/model.py)
122
+
123
+ The `LTXModel` class implements the transformer. It supports both video-only and audio-video generation modes. For actual usage, see the [`ltx-pipelines`](../ltx-pipelines/) package which handles model loading and initialization.
124
+
125
+ ### Transformer Block Architecture
126
+
127
+ **Source**: [`src/ltx_core/model/transformer/transformer.py`](src/ltx_core/model/transformer/transformer.py)
128
+
129
+ ```text
130
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
131
+ β”‚ TRANSFORMER BLOCK β”‚
132
+ β”‚ β”‚
133
+ β”‚ VIDEO PATH: β”‚
134
+ β”‚ Input β†’ RMSNorm β†’ AdaLN β†’ Self-Attn (attn1) β”‚
135
+ β”‚ β†’ RMSNorm β†’ Cross-Attn (attn2, text) β”‚
136
+ β”‚ β†’ RMSNorm β†’ AdaLN β†’ A↔V Cross-Attn β”‚
137
+ β”‚ β†’ RMSNorm β†’ AdaLN β†’ Feed-Forward (ff) β†’ Output β”‚
138
+ β”‚ β”‚
139
+ β”‚ AUDIO PATH: β”‚
140
+ β”‚ Input β†’ RMSNorm β†’ AdaLN β†’ Self-Attn (audio_attn1) β”‚
141
+ β”‚ β†’ RMSNorm β†’ Cross-Attn (audio_attn2, text) β”‚
142
+ β”‚ β†’ RMSNorm β†’ AdaLN β†’ A↔V Cross-Attn β”‚
143
+ β”‚ β†’ RMSNorm β†’ AdaLN β†’ Feed-Forward (audio_ff) β”‚
144
+ β”‚ β”‚
145
+ β”‚ AdaLN (Adaptive Layer Normalization): β”‚
146
+ β”‚ - Uses scale_shift_table (6 params) for video/audio β”‚
147
+ β”‚ - Uses scale_shift_table_a2v_ca (5 params) for A↔V CA β”‚
148
+ β”‚ - Conditioned on per-token timestep embeddings β”‚
149
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
150
+ ```
151
+
152
+ ### Perturbations
153
+
154
+ The transformer supports [**perturbations**](src/ltx_core/guidance/perturbations.py) that selectively skip attention operations.
155
+
156
+ Perturbations allow you to disable specific attention mechanisms during inference, which is useful for guidance techniques like STG (Spatio-Temporal Guidance).
157
+
158
+ **Supported Perturbation Types**:
159
+
160
+ - `SKIP_VIDEO_SELF_ATTN`: Skip video self-attention
161
+ - `SKIP_AUDIO_SELF_ATTN`: Skip audio self-attention
162
+ - `SKIP_A2V_CROSS_ATTN`: Skip audio-to-video cross-attention
163
+ - `SKIP_V2A_CROSS_ATTN`: Skip video-to-audio cross-attention
164
+
165
+ Perturbations are used internally by guidance mechanisms like STG (Spatio-Temporal Guidance). For usage examples, see the [`ltx-pipelines`](../ltx-pipelines/) package.
166
+
167
+ ---
168
+
169
+ ## Video VAE
170
+
171
+ The Video VAE ([`src/ltx_core/model/video_vae/`](src/ltx_core/model/video_vae/)) encodes video pixels into latent representations and decodes them back.
172
+
173
+ ### Architecture
174
+
175
+ - **Encoder**: Compresses `[B, 3, F, H, W]` pixels β†’ `[B, 128, F', H/32, W/32]` latents
176
+ - Where `F' = 1 + (F-1)/8` (frame count must satisfy `(F-1) % 8 == 0`)
177
+ - Example: `[B, 3, 33, 512, 512]` β†’ `[B, 128, 5, 16, 16]`
178
+ - **Decoder**: Expands `[B, 128, F, H, W]` latents β†’ `[B, 3, F', H*32, W*32]` pixels
179
+ - Where `F' = 1 + (F-1)*8`
180
+ - Example: `[B, 128, 5, 16, 16]` β†’ `[B, 3, 33, 512, 512]`
181
+
182
+ The Video VAE is used internally by pipelines for encoding video pixels to latents and decoding latents back to pixels. For usage examples, see the [`ltx-pipelines`](../ltx-pipelines/) package.
183
+
184
+ ---
185
+
186
+ ## Audio VAE
187
+
188
+ The Audio VAE ([`src/ltx_core/model/audio_vae/`](src/ltx_core/model/audio_vae/)) processes audio spectrograms.
189
+
190
+ ### Audio VAE Architecture
191
+
192
+ - **Encoder**: Compresses mel spectrogram `[B, mel_bins, T]` β†’ `[B, 8, T/4, 16]` latents
193
+ - Temporal downsampling: 4Γ— (`LATENT_DOWNSAMPLE_FACTOR = 4`)
194
+ - Frequency bins: Fixed 16 mel bins in latent space
195
+ - Latent channels: 8
196
+ - **Decoder**: Expands `[B, 8, T, 16]` latents β†’ mel spectrogram `[B, mel_bins, T*4]`
197
+ - **Vocoder**: Converts mel spectrogram β†’ audio waveform
198
+
199
+ **Downsampling**:
200
+
201
+ - Temporal: 4Γ— (time steps)
202
+ - Frequency: Variable (input mel_bins β†’ fixed 16 in latent space)
203
+
204
+ The Audio VAE is used internally by pipelines for encoding mel spectrograms to latents and decoding latents back to mel spectrograms. The vocoder converts mel spectrograms to audio waveforms. For usage examples, see the [`ltx-pipelines`](../ltx-pipelines/) package.
205
+
206
+ ---
207
+
208
+ ## Text Encoding (Gemma)
209
+
210
+ LTX-2 uses **Gemma** (Google's open LLM) as the text encoder, located in [`src/ltx_core/text_encoders/gemma/`](src/ltx_core/text_encoders/gemma/).
211
+
212
+ ### Text Encoder Architecture
213
+
214
+ - **Tokenizer**: Converts text β†’ token IDs
215
+ - **Gemma Model**: Processes tokens β†’ embeddings
216
+ - **Text Projection**: Uses `PixArtAlphaTextProjection` to project caption embeddings
217
+ - Two-layer MLP with GELU (tanh approximation) or SiLU activation
218
+ - Projects from caption channels (3840) to model dimensions
219
+ - **Feature Extractor**: Extracts video/audio-specific embeddings
220
+ - **Separate Encoders**:
221
+ - `AVEncoder`: For audio-video generation (outputs separate video and audio contexts)
222
+ - `VideoOnlyEncoder`: For video-only generation
223
+
224
+ ### System Prompts
225
+
226
+ System prompts are also used to enhance user's prompts.
227
+
228
+ - **Text-to-Video**: [`gemma_t2v_system_prompt.txt`](src/ltx_core/text_encoders/gemma/encoders/prompts/gemma_t2v_system_prompt.txt)
229
+ - **Image-to-Video**: [`gemma_i2v_system_prompt.txt`](src/ltx_core/text_encoders/gemma/encoders/prompts/gemma_i2v_system_prompt.txt)
230
+
231
+ **Important**: Video and audio receive **different** context embeddings, even from the same prompt. This allows better modality-specific conditioning.
232
+
233
+ **Output Format**:
234
+
235
+ - Video context: `[B, seq_len, 4096]` - Video-specific text embeddings
236
+ - Audio context: `[B, seq_len, 2048]` - Audio-specific text embeddings
237
+
238
+ The text encoder is used internally by pipelines. For usage examples, see the [`ltx-pipelines`](../ltx-pipelines/) package.
239
+
240
+ ---
241
+
242
+ ## Upscaler
243
+
244
+ The Upscaler ([`src/ltx_core/model/upsampler/`](src/ltx_core/model/upsampler/)) upsamples latent representations for higher-resolution output.
245
+
246
+ The spatial upsampler is used internally by two-stage pipelines (e.g., [`TI2VidTwoStagesPipeline`](../ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py), [`ICLoraPipeline`](../ltx-pipelines/src/ltx_pipelines/ic_lora.py)) to upsample low-resolution latents before final VAE decoding. For usage examples, see the [`ltx-pipelines`](../ltx-pipelines/) package.
247
+
248
+ ---
249
+
250
+ ## Data Flow
251
+
252
+ ### Complete Generation Pipeline
253
+
254
+ Here's how all the components work together conceptually ([`src/ltx_core/components/`](src/ltx_core/components/)):
255
+
256
+ **Pipeline Steps**:
257
+
258
+ 1. **Text Encoding**: Text prompt β†’ Gemma encoder β†’ separate video/audio embeddings
259
+ 2. **Latent Initialization**: Initialize noise latents in spatial format `[B, C, F, H, W]`
260
+ 3. **Patchification**: Convert spatial latents to sequence format `[B, seq_len, dim]` for transformer
261
+ 4. **Sigma Schedule**: Generate noise schedule (adapts to token count)
262
+ 5. **Denoising Loop**: Iteratively denoise using transformer predictions
263
+ - Create Modality inputs with per-token timesteps and RoPE positions
264
+ - Forward pass through transformer (conditional and unconditional for CFG)
265
+ - Apply guidance (CFG, STG, etc.)
266
+ - Update latents using diffusion step (Euler, etc.)
267
+ 6. **Unpatchification**: Convert sequence back to spatial format
268
+ 7. **VAE Decoding**: Decode latents to pixel space (with optional upsampling for two-stage)
269
+
270
+ - [`TI2VidTwoStagesPipeline`](../ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py) - Two-stage text-to-video (recommended)
271
+ - [`ICLoraPipeline`](../ltx-pipelines/src/ltx_pipelines/ic_lora.py) - Video-to-video with IC-LoRA control
272
+ - [`DistilledPipeline`](../ltx-pipelines/src/ltx_pipelines/distilled.py) - Fast inference with distilled model
273
+ - [`KeyframeInterpolationPipeline`](../ltx-pipelines/src/ltx_pipelines/keyframe_interpolation.py) - Keyframe-based interpolation
274
+
275
+ See the [ltx-pipelines README](../ltx-pipelines/README.md) for usage examples.
276
+
277
+ ## πŸ”— Related Projects
278
+
279
+ - **[ltx-pipelines](../ltx-pipelines/)** - High-level pipeline implementations for text-to-video, image-to-video, and video-to-video
280
+ - **[ltx-trainer](../ltx-trainer/)** - Training and fine-tuning tools
packages/ltx-core/pyproject.toml CHANGED
@@ -1,9 +1,9 @@
1
  [project]
2
  name = "ltx-core"
3
- version = "0.1.0"
4
  description = "Core implementation of Lightricks' LTX-2 model"
5
  readme = "README.md"
6
- requires-python = ">=3.12"
7
  dependencies = [
8
  "torch~=2.7",
9
  "torchaudio",
@@ -16,7 +16,6 @@ dependencies = [
16
  ]
17
 
18
  [project.optional-dependencies]
19
- flashpack = ["flashpack==0.1.2"]
20
  xformers = ["xformers"]
21
 
22
 
 
1
  [project]
2
  name = "ltx-core"
3
+ version = "1.0.0"
4
  description = "Core implementation of Lightricks' LTX-2 model"
5
  readme = "README.md"
6
+ requires-python = ">=3.10"
7
  dependencies = [
8
  "torch~=2.7",
9
  "torchaudio",
 
16
  ]
17
 
18
  [project.optional-dependencies]
 
19
  xformers = ["xformers"]
20
 
21