LTX-2.3 (Diffusers)

Diffusers-format weights for Lightricks/LTX-2.3 — a DiT-based foundation model that jointly generates synchronized video and audio.

A distilled variant (8 steps, CFG=1) is available at diffusers/LTX-2.3-Distilled-Diffusers.

Usage

Requires a recent build of diffusers with LTX-2 support:

pip install -U git+https://github.com/huggingface/diffusers

Text-to-video + audio

import torch
from diffusers import LTX2Pipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT

pipe = LTX2Pipeline.from_pretrained(
    "diffusers/LTX-2.3-Diffusers", torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()

prompt = "A flowing river in a forest at golden hour, gentle wind in the leaves."
frame_rate = 24.0

video, audio = pipe(
    prompt=prompt,
    negative_prompt=DEFAULT_NEGATIVE_PROMPT,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=30,
    guidance_scale=3.0,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_t2v.mp4",
)

First-last-frame-to-video (FLF2V)

import torch
from diffusers import LTX2ConditionPipeline
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
from diffusers.utils import load_image

pipe = LTX2ConditionPipeline.from_pretrained(
    "diffusers/LTX-2.3-Diffusers", torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()

first_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

conditions = [
    LTX2VideoCondition(frames=first_image, index=0, strength=1.0),
    LTX2VideoCondition(frames=last_image, index=-1, strength=1.0),
]

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings."
frame_rate = 24.0

video = pipe(
    conditions=conditions,
    prompt=prompt,
    negative_prompt=DEFAULT_NEGATIVE_PROMPT,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    guidance_scale=4.0,
    output_type="np",
    return_dict=False,
)

IC-LoRA (camera control)

import torch
from diffusers import LTX2InContextPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT

pipe = LTX2InContextPipeline.from_pretrained(
    "diffusers/LTX-2.3-Diffusers", torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
pipe.load_lora_weights(
    "Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-In",
    adapter_name="ic_lora",
    weight_name="ltx-2-19b-lora-camera-control-dolly-in.safetensors",
)
pipe.set_adapters("ic_lora", 1.0)

prompt = "A flowing river in a forest"
frame_rate = 24.0

video, audio = pipe(
    prompt=prompt,
    negative_prompt=DEFAULT_NEGATIVE_PROMPT,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=30,
    guidance_scale=3.0,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_ic_lora.mp4",
)

Notes

  • width and height must be divisible by 32; num_frames must equal 8k + 1.
  • See the Diffusers LTX-2 docs for multimodal guidance, prompt enhancement, and the upscaling/refinement pipeline.

License

These weights are released under the LTX Video 2 Open Source License.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for diffusers/LTX-2.3-Diffusers

Finetuned
(59)
this model