LTX-2 / README.md

Update README.md

3027fe8 verified 8 days ago

5.28 kB

	---
	pipeline_tag: image-to-video
	tags:
	- image-to-video
	- text-to-video
	- video-to-video
	- image-text-to-video
	- audio-to-video
	- text-to-audio
	- video-to-audio
	- audio-to-audio
	- text-to-audio-video
	- image-to-audio-video
	- image-text-to-audio-video
	- ltx-2
	- ltx-video
	- ltxv
	- lightricks
	pinned: true
	language:
	- en
	- de
	- es
	- fr
	- ja
	- ko
	- zh
	- it
	- pt
	license: other
	license_name: ltx-2-community-license-agreement
	license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
	library_name: diffusers
	demo: https://app.ltx.studio/ltx-2-playground/i2v
	---

	# LTX-2 Model Card
	This model card focuses on the LTX-2 model, codebase available [here](https://github.com/Lightricks/LTX-2).

	LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

	[![LTX-2 Open Source](https://img.youtube.com/vi/8fWAJXZJbRA/maxresdefault.jpg)](https://www.youtube.com/watch?v=8fWAJXZJbRA)

	# Model Checkpoints

	\| Name \| Notes \|
	\|--------------------------------\|----------------------------------------------------------------------------------------------------------------\|
	\| ltx-2-19b-dev \| The full model, flexible and trainable in bf16 \|
	\| ltx-2-19b-dev-fp8 \| The full model in fp8 quantization \|
	\| ltx-2-19b-dev-fp4 \| The full model in nvfp4 quantization \|
	\| ltx-2-19b-distilled \| The distilled version of the full model, 8 steps, CFG=1 \|
	\| ltx-2-19b-distilled-lora-384 \| A LoRA version of the distilled model applicable to the full model \|
	\| ltx-2-spatial-upscaler-x2-1.0 \| An x2 spatial upscaler for the ltx-2 latents, used in multi stage (multiscale) pipelines for higher resolution \|
	\| ltx-2-temporal-upscaler-x2-1.0 \| An x2 temporal upscaler for the ltx-2 latents, used in multi stage (multiscale) pipelines for higher FPS \|

	## Model Details
	- Developed by: Lightricks
	- Model type: Diffusion-based audio-video foundation model
	- Language(s): English

	# Online demo
	LTX-2 is accessible right away via the following links:
	- [LTX-Studio text-to-video](https://app.ltx.studio/ltx-2-playground/t2v)
	- [LTX-Studio image-to-video](https://app.ltx.studio/ltx-2-playground/i2v)

	# Run locally

	## Direct use license
	You can use the models - full, distilled, upscalers and any derivatives of the models - for purposes under the [license](./LICENSE).

	## ComfyUI
	We recommend you use the built-in LTXVideo nodes that can be found in the ComfyUI Manager.
	For manual installation information, please refer to our [documentation site](https://docs.ltx.video/open-source-model/integration-tools/comfy-ui).

	## PyTorch codebase

	The [LTX-2 codebase](https://github.com/Lightricks/LTX-2) is a monorepo with several packages. From model definition in 'ltx-core' to pipelines in 'ltx-pipelines' and training capabilities in 'ltx-trainer'.
	The codebase was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7.

	### Installation

	```bash
	git clone https://github.com/Lightricks/LTX-2.git
	cd LTX-2

	# From the repository root
	uv sync
	source .venv/bin/activate
	```

	### Inference

	To use our model, please follow the instructions in our [ltx-pipelines](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/README.md) package.

	## Diffusers 🧨

	LTX-2 is supported in the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index) for image-to-video generation.

	## General tips:
	* Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
	* In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
	* For tips on writing effective prompts, please visit our [Prompting guide](https://ltx.video/blog/how-to-prompt-for-ltx-2)

	### Limitations
	- This model is not intended or able to provide factual information.
	- As a statistical model this checkpoint might amplify existing societal biases.
	- The model may fail to generate videos that matches the prompts perfectly.
	- Prompt following is heavily influenced by the prompting-style.
	- The model may generate content that is inappropriate or offensive.
	- When generating audio without speech, the audio may be of lower quality.

	# Train the model

	The base (dev) model is fully trainable.

	It's extremely easy to reproduce the LoRAs and IC-LoRAs we publish with the model by following the instructions on the [LTX-2 Trainer Readme](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-trainer/README.md).

	Training for motion, style or likeness (sound+appearance) can take less than an hour in many settings.