LTX-2.3 Model Card

This model card focuses on the LTX-2.3 model, which is a significant update to the LTX-2 model with improved audio and visual quality as well as enhanced prompt adherence. LTX-2 was presented in the paper LTX-2: Efficient Joint Audio-Visual Foundation Model.

💻💻 If you want to dive in right to the code - it is available here. 💾💾

LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

Model Checkpoints

Name	Notes
ltx-2.3-22b-dev	The full model, flexible and trainable in bf16
ltx-2.3-22b-distilled	The distilled version of the full model, 8 steps, CFG=1
ltx-2.3-22b-distilled-1.1	The distilled v1.1 version of the full model, 8 steps, CFG=1 - A different aesthetic experience and improved audio compared to v1.0
ltx-2.3-22b-distilled-lora-384	A LoRA version of the distilled model applicable to the full model
ltx-2.3-22b-distilled-lora-384-1.1	A LoRA version of the v1.1 distilled model applicable to the full model
ltx-2.3-spatial-upscaler-x2-1.1	An x2 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution
ltx-2.3-spatial-upscaler-x1.5-1.0	An x1.5 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution
ltx-2.3-temporal-upscaler-x2-1.0	An x2 temporal upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher FPS

Model Details

Developed by: Lightricks
Model type: Diffusion-based audio-video foundation model
Language(s): English

Online demo

LTX-2.3 is accessible right away via the API Playground.

Run locally

Direct use license

You can use the models - full, distilled, upscalers and any derivatives of the models - for purposes under the license.

ComfyUI

We recommend you use the built-in LTXVideo nodes that can be found in the ComfyUI Manager. For manual installation information, please refer to our documentation site.

PyTorch codebase

The LTX-2 codebase is a monorepo with several packages. From model definition in 'ltx-core' to pipelines in 'ltx-pipelines' and training capabilities in 'ltx-trainer'. The codebase was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7.

Installation

git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2

# From the repository root
uv sync
source .venv/bin/activate

Inference

To use our model, please follow the instructions in our ltx-pipelines package.

Diffusers 🧨

LTX-2.3 support in the Diffusers Python library is coming soon!

General tips:

Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
For tips on writing effective prompts, please visit our Prompting guide

Limitations

This model is not intended or able to provide factual information.
As a statistical model this checkpoint might amplify existing societal biases.
The model may fail to generate videos that matches the prompts perfectly.
Prompt following is heavily influenced by the prompting-style.
The model may generate content that is inappropriate or offensive.
When generating audio without speech, the audio may be of lower quality.

Train the model

The base (dev) model is fully trainable.

It's extremely easy to reproduce the LoRAs and IC-LoRAs we publish with the model by following the instructions on the LTX-2 Trainer Readme.

Training for motion, style or likeness (sound+appearance) can take less than an hour in many settings.

Citation

@article{hacohen2025ltx2,
  title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
  author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
  journal={arXiv preprint arXiv:2601.03233},
  year={2025}
}

Sulphur 2

An uncensored video generation model based on LTX 2.3 supporting both t2v and i2v natively, as well as all of the other ltx 2.3 formats.

Join our Discord

Support the next version of the project, even just a few dollars would go a long way: Kofi

Get Started: To get started with the model, I recommend downloading either of the dev versions, (fp8mixed or bf16) and downloading the distill lora provided. By the way, I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time.

This model contains a prompt enhancer. The easiest way to get started with the prompt enhancer is by using it on lmstudio. The way to accomplish this is by going to your model folder inside lmstudio, then opening it up in your file explorer. Create a folder named "Sulphur", then a folder inside that called "promptenhancer". Inside that folder, place the gguf file and the mmproj file. Once you've done that, you should be able to load the prompt enhancer in lmstudio. There is no system prompt for it, just send the text (and an image) you'd like to be enhanced.

*As a note, this readme will contain better setup instructions and how to train on top of the model soon.

Links

(CivitAI Base Model) -
(CivitAI Quant Model) -

Credits

(TenStrip) — Testing & model merging (His i2v merge of sulphur 2, highly recommend for i2v)
@s1lv3rc01n — Testing & model merging/quantizing (silveroxides)
@mov7162 — Musubi Tuner guidance
And many others, if you'd like to be on the credits and I didn't place you here, message me I likely assumed you didn't want to be here.

Funders

Anonymous funder #1 — Supported the original Sulphur
Anonymous funder #2 — Made Sulphur 2 possible; this model wouldn't exist without them

Thank you to everyone who contributed.

10 Eros

v1.2 Changelog: Leveraged tuned connector data to reduce face drift and aid long prompts/director. Also using sulphur EXP weights on top of v1 to hone the most explicit motions. All common issues like mistaken extra anatomy, subtitles, unexpected transitions, etc all still present from v1.

https://huggingface.co/TenStrip/LTX2.3-10Eros_Workflows

Quants: https://huggingface.co/vantagewithai/LTX2.3-10Eros-GGUF/tree/main

Nodes: https://github.com/TenStrip/10S-Comfy-nodes

Reliant on https://huggingface.co/SulphurAI/Sulphur-2-base This is a different merge attempt for ideal I2V use. It uses layer scaled merges of different steps, it's not a straight weight merge. It behaves much nicer than lora load and respects prompt. Prompt should be enhanced, LTX has very little self reasoning and input when it is conditioned, first frame and all following motions, evolutions, and audio must be commanded-you will get nothing if you don't ask it.

BF16 loads as a checkpoint with clip and VAEs.

Fp8_mixed_learned is the better FP8 version and is a full checkpoint as well, quant by S1LV3RC01N.

Kijai split files are for 10Eros FP8 Transformer version, but it has a different structure and variance. That one goes inside diffusion_models: https://huggingface.co/Kijai/LTX2.3_comfy/tree/main

!!! Larger distilled Loras will harm the model's fine tune, try the cond_safe ones: https://huggingface.co/TenStrip/LTX2.3_Distilled_Lora_1.1_Experiments/tree/main

For prompt enhancement, try this foreword in Grok or Uncensored LLM:

Generate a video scene script with a description based on the attached image for an LLM that has a tokenizer that uses interleaved attention to support long-context understanding that is fed into a multimodal video model. Strict specification, follow up to the word: No timestamps. No unnecessary embellishment. Output only plain English text and make it a copy box.

First, describe the image initial scene in concise natural language; subject(s), subject(s) appearance, subject(s) composition and pose, background, and context.

Next, formulate a naturally evolving scenario that would take place describing every moving body part, composition change, and manipulation from the uploaded initial frame that would be reflected in the video models post-latent evolution output. If the image is explicit or sexual in nature, use full anatomical terminology and spice it up slightly with visually representable erotic themes.

Center the prompt around this basic idea: [ concept ]

interweave this dialogue or sound concept into the scene with descriptions of voice tone followed by the lines delivered in quotations, in a temporal sequence between or during motions. Dialogue should be concise and non-rambling as it will take away from video quality: [ dialogue ]

Inside that prompt describe only notable audio and audio queues, both normal and explicit; background noise as well as foley and natural sounds. In a temporal sequence paired with coinciding motions. In the case of absent dialogue or soundscapes and only if background music is fitting; describe a fitting genre and melodic tone with matching mood.

Output only text following above instruction. Follow-up suggestions should be on the topic of expanding or changing motion or dialogue from the output text.

LTX 2.3 Music Video Creator V5.1

ComfyUI workflows for creating music-videos with LTX 2.3. This release includes a prompt-creation workflow plus both text-to-video and image-to-video music video workflows.

These workflows are designed for creators who want a fast and almost fully automated setup for building cinematic music video clips, generating scene prompts, adding optional LoRAs, and controlling advanced prompt details.

Included Workflows

Important: You must run the Prompt Creator workflow first before using the T2V or I2V video workflows.

LTX2.3_Music_Video_Creator_Prompt_Creator_V5.json
LTX2.3_Music_Video_Creator_T2V_V5.1.json
LTX2.3_Music_Video_Creator_I2V_V5.1.json

Full walkthrough video on entire process. Please watch the full video and follow along.

Sample Videos

These samples were created with the LTX 2.3 Music Video Creator workflows.

Sample 1 - Text to video using my Lux_Sensual Style LoRa. Light Post-Editing

This sample includes light post-editing in CapCut. I ran the workflow a few times to get different shots, then edited the final version together.