Lightricks
/

LTX-2.3

@@ -39,7 +39,10 @@ demo: https://app.ltx.studio/ltx-2-playground/i2v
 # LTX-2.3 Model Card
 This model card focuses on the LTX-2.3 model, which is a significant update to the [LTX-2 model](https://huggingface.co/Lightricks/LTX-2) with improved audio and visual quality as well as enhanced prompt adherence.
-LTX-2 was presented in the paper [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://huggingface.co/papers/2601.03233). The codebase is available [here](https://github.com/Lightricks/LTX-2).
 LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
@@ -49,9 +52,9 @@ LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchro
 | Name                               | Notes                                                                                                              |
 |------------------------------------|--------------------------------------------------------------------------------------------------------------------|
-| ltx-2.3-20b-dev                    | The full model, flexible and trainable in bf16                                                                     |
-| ltx-2.3-20b-distilled              | The distilled version of the full model, 8 steps, CFG=1                                                            |
-| ltx-2.3-20b-distilled-lora-384     | A LoRA version of the distilled model applicable to the full model                                                 |
 | ltx-2.3-spatial-upscaler-x2-1.0    | An x2 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution   |
 | ltx-2.3-spatial-upscaler-x1.5-1.0  | An x1.5 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
 | ltx-2.3-temporal-upscaler-x2-1.0   | An x2 temporal upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher FPS         |
@@ -97,96 +100,7 @@ To use our model, please follow the instructions in our [ltx-pipelines](https://
 ## Diffusers 🧨
-LTX-2 is supported in the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index) for text & image-to-video generation.
-Read more on LTX-2 with diffusers [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx2#diffusers.LTX2Pipeline.__call__.example).
-### Use with diffusers
-To achieve production quality generation, it's recommended to use the two-stage generation pipeline.
-Example for 2-stage inference of text-to-video:
-```python
-import torch
-from diffusers import FlowMatchEulerDiscreteScheduler
-from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
-from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
-from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES
-from diffusers.pipelines.ltx2.export_utils import encode_video
-device = "cuda:0"
-width = 768
-height = 512
-pipe = LTX2Pipeline.from_pretrained(
-    "Lightricks/LTX-2", torch_dtype=torch.bfloat16
-)
-pipe.enable_sequential_cpu_offload(device=device)
-prompt = "A beautiful sunset over the ocean"
-negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."
-# Stage 1 default (non-distilled) inference
-frame_rate = 24.0
-video_latent, audio_latent = pipe(
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    width=width,
-    height=height,
-    num_frames=121,
-    frame_rate=frame_rate,
-    num_inference_steps=40,
-    sigmas=None,
-    guidance_scale=4.0,
-    output_type="latent",
-    return_dict=False,
-)
-latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
-    "Lightricks/LTX-2",
-    subfolder="latent_upsampler",
-    torch_dtype=torch.bfloat16,
-)
-upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
-upsample_pipe.enable_model_cpu_offload(device=device)
-upscaled_video_latent = upsample_pipe(
-    latents=video_latent,
-    output_type="latent",
-    return_dict=False,
-)[0]
-# Load Stage 2 distilled LoRA
-pipe.load_lora_weights(
-    "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
-)
-pipe.set_adapters("stage_2_distilled", 1.0)
-# VAE tiling is usually necessary to avoid OOM error when VAE decoding
-pipe.vae.enable_tiling()
-# Change scheduler to use Stage 2 distilled sigmas as is
-new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
-    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
-)
-pipe.scheduler = new_scheduler
-# Stage 2 inference with distilled LoRA and sigmas
-video, audio = pipe(
-    latents=upscaled_video_latent,
-    audio_latents=audio_latent,
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    num_inference_steps=3,
-    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L218
-    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
-    guidance_scale=1.0,
-    output_type="np",
-    return_dict=False,
-)
-encode_video(
-    video[0],
-    fps=frame_rate,
-    audio=audio[0].float().cpu(),
-    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
-    output_path="ltx2_lora_distilled_sample.mp4",
-)
-```
-For more inference examples, including generation with the distilled checkpoint, visit [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx2#diffusers.LTX2Pipeline.__call__.example).
 ## General tips:
 * Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.

 # LTX-2.3 Model Card
 This model card focuses on the LTX-2.3 model, which is a significant update to the [LTX-2 model](https://huggingface.co/Lightricks/LTX-2) with improved audio and visual quality as well as enhanced prompt adherence.
+## If you want to dive in right to the code - it is available [here](https://github.com/Lightricks/LTX-2).
+LTX-2 was presented in the paper [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://huggingface.co/papers/2601.03233).
 LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
 | Name                               | Notes                                                                                                              |
 |------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| ltx-2.3-22b-dev                    | The full model, flexible and trainable in bf16                                                                     |
+| ltx-2.3-22b-distilled              | The distilled version of the full model, 8 steps, CFG=1                                                            |
+| ltx-2.3-22b-distilled-lora-384     | A LoRA version of the distilled model applicable to the full model                                                 |
 | ltx-2.3-spatial-upscaler-x2-1.0    | An x2 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution   |
 | ltx-2.3-spatial-upscaler-x1.5-1.0  | An x1.5 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
 | ltx-2.3-temporal-upscaler-x2-1.0   | An x2 temporal upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher FPS         |
 ## Diffusers 🧨
+LTX-2.3 support in the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index) is coming soon!
 ## General tips:
 * Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.