ofirbibi commited on
Commit
c95d1d1
·
1 Parent(s): 998921d

Docs: Update readme.

Browse files
Files changed (1) hide show
  1. README.md +8 -94
README.md CHANGED
@@ -39,7 +39,10 @@ demo: https://app.ltx.studio/ltx-2-playground/i2v
39
  # LTX-2.3 Model Card
40
 
41
  This model card focuses on the LTX-2.3 model, which is a significant update to the [LTX-2 model](https://huggingface.co/Lightricks/LTX-2) with improved audio and visual quality as well as enhanced prompt adherence.
42
- LTX-2 was presented in the paper [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://huggingface.co/papers/2601.03233). The codebase is available [here](https://github.com/Lightricks/LTX-2).
 
 
 
43
 
44
  LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
45
 
@@ -49,9 +52,9 @@ LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchro
49
 
50
  | Name | Notes |
51
  |------------------------------------|--------------------------------------------------------------------------------------------------------------------|
52
- | ltx-2.3-20b-dev | The full model, flexible and trainable in bf16 |
53
- | ltx-2.3-20b-distilled | The distilled version of the full model, 8 steps, CFG=1 |
54
- | ltx-2.3-20b-distilled-lora-384 | A LoRA version of the distilled model applicable to the full model |
55
  | ltx-2.3-spatial-upscaler-x2-1.0 | An x2 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
56
  | ltx-2.3-spatial-upscaler-x1.5-1.0 | An x1.5 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
57
  | ltx-2.3-temporal-upscaler-x2-1.0 | An x2 temporal upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher FPS |
@@ -97,96 +100,7 @@ To use our model, please follow the instructions in our [ltx-pipelines](https://
97
 
98
  ## Diffusers 🧨
99
 
100
- LTX-2 is supported in the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index) for text & image-to-video generation.
101
- Read more on LTX-2 with diffusers [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx2#diffusers.LTX2Pipeline.__call__.example).
102
-
103
- ### Use with diffusers
104
- To achieve production quality generation, it's recommended to use the two-stage generation pipeline.
105
- Example for 2-stage inference of text-to-video:
106
- ```python
107
- import torch
108
- from diffusers import FlowMatchEulerDiscreteScheduler
109
- from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
110
- from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
111
- from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES
112
- from diffusers.pipelines.ltx2.export_utils import encode_video
113
-
114
- device = "cuda:0"
115
- width = 768
116
- height = 512
117
-
118
- pipe = LTX2Pipeline.from_pretrained(
119
- "Lightricks/LTX-2", torch_dtype=torch.bfloat16
120
- )
121
- pipe.enable_sequential_cpu_offload(device=device)
122
-
123
- prompt = "A beautiful sunset over the ocean"
124
- negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."
125
-
126
- # Stage 1 default (non-distilled) inference
127
- frame_rate = 24.0
128
- video_latent, audio_latent = pipe(
129
- prompt=prompt,
130
- negative_prompt=negative_prompt,
131
- width=width,
132
- height=height,
133
- num_frames=121,
134
- frame_rate=frame_rate,
135
- num_inference_steps=40,
136
- sigmas=None,
137
- guidance_scale=4.0,
138
- output_type="latent",
139
- return_dict=False,
140
- )
141
-
142
- latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
143
- "Lightricks/LTX-2",
144
- subfolder="latent_upsampler",
145
- torch_dtype=torch.bfloat16,
146
- )
147
- upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
148
- upsample_pipe.enable_model_cpu_offload(device=device)
149
- upscaled_video_latent = upsample_pipe(
150
- latents=video_latent,
151
- output_type="latent",
152
- return_dict=False,
153
- )[0]
154
-
155
- # Load Stage 2 distilled LoRA
156
- pipe.load_lora_weights(
157
- "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
158
- )
159
- pipe.set_adapters("stage_2_distilled", 1.0)
160
- # VAE tiling is usually necessary to avoid OOM error when VAE decoding
161
- pipe.vae.enable_tiling()
162
- # Change scheduler to use Stage 2 distilled sigmas as is
163
- new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
164
- pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
165
- )
166
- pipe.scheduler = new_scheduler
167
- # Stage 2 inference with distilled LoRA and sigmas
168
- video, audio = pipe(
169
- latents=upscaled_video_latent,
170
- audio_latents=audio_latent,
171
- prompt=prompt,
172
- negative_prompt=negative_prompt,
173
- num_inference_steps=3,
174
- noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L218
175
- sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
176
- guidance_scale=1.0,
177
- output_type="np",
178
- return_dict=False,
179
- )
180
-
181
- encode_video(
182
- video[0],
183
- fps=frame_rate,
184
- audio=audio[0].float().cpu(),
185
- audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
186
- output_path="ltx2_lora_distilled_sample.mp4",
187
- )
188
- ```
189
- For more inference examples, including generation with the distilled checkpoint, visit [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx2#diffusers.LTX2Pipeline.__call__.example).
190
 
191
  ## General tips:
192
  * Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
 
39
  # LTX-2.3 Model Card
40
 
41
  This model card focuses on the LTX-2.3 model, which is a significant update to the [LTX-2 model](https://huggingface.co/Lightricks/LTX-2) with improved audio and visual quality as well as enhanced prompt adherence.
42
+
43
+ ## If you want to dive in right to the code - it is available [here](https://github.com/Lightricks/LTX-2).
44
+
45
+ LTX-2 was presented in the paper [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://huggingface.co/papers/2601.03233).
46
 
47
  LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
48
 
 
52
 
53
  | Name | Notes |
54
  |------------------------------------|--------------------------------------------------------------------------------------------------------------------|
55
+ | ltx-2.3-22b-dev | The full model, flexible and trainable in bf16 |
56
+ | ltx-2.3-22b-distilled | The distilled version of the full model, 8 steps, CFG=1 |
57
+ | ltx-2.3-22b-distilled-lora-384 | A LoRA version of the distilled model applicable to the full model |
58
  | ltx-2.3-spatial-upscaler-x2-1.0 | An x2 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
59
  | ltx-2.3-spatial-upscaler-x1.5-1.0 | An x1.5 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
60
  | ltx-2.3-temporal-upscaler-x2-1.0 | An x2 temporal upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher FPS |
 
100
 
101
  ## Diffusers 🧨
102
 
103
+ LTX-2.3 support in the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index) is coming soon!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  ## General tips:
106
  * Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.