Instructions to use Lightricks/ltxv-spatial-upscaler-0.9.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Lightricks/ltxv-spatial-upscaler-0.9.7 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]LTX Video Spatial Upscaler 0.9.7 Model Card
This model card focuses on the LTX Video Spatial Upscaler 0.9.7, a component model designed to work in conjunction with the LTX-Video generation models. The main LTX-Video codebase is available here.
LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216Γ704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image+text-to-video usecases.
The LTX Video Spatial Upscaler is a diffusion-based model that enhances the spatial resolution of videos. It is specifically trained to upscale the latent representations of videos generated by LTX Video models.
This upscaler model is compatible with and can be used to improve the output quality of videos generated by both:
Lightricks/LTX-Video-0.9.7-devLightricks/LTX-Video-0.9.7-distilled
Model Details
- Developed by: Lightricks
- Model type: Latent Diffusion Video Spatial Upscaler
- Input: Latent video frames from an LTX Video model.
- Output: Higher-resolution latent video frames.
- Compatibility: can be used with
Lightricks/LTX-Video-0.9.7-devandLightricks/LTX-Video-0.9.7-distilled.
Usage
Direct use
You can use the model for purposes under the license:
- 2B version 0.9: license
- 2B version 0.9.1 license
- 2B version 0.9.5 license
- 2B version 0.9.6-dev license
- 2B version 0.9.6-distilled license
- 13B version 0.9.7-dev license
- 13B version 0.9.7-dev-fp8 license
- 13B version 0.9.7-distilled license
- 13B version 0.9.7-distilled-fp8 license
- 13B version 0.9.7-distilled-lora128 license
- Temporal upscaler version 0.9.7 license
- Spatial upscaler version 0.9.7 license
General tips:
- The model works on resolutions that are divisible by 32 and number of frames that are divisible by 8 + 1 (e.g. 257). In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input will be padded with -1 and then cropped to the desired resolution and number of frames.
- The model works best on resolutions under 720 x 1280 and number of frames below 257.
- Prompts should be in English. The more elaborate the better. Good prompt looks like
The turquoise waves crash against the dark, jagged rocks of the shore, sending white foam spraying into the air. The scene is dominated by the stark contrast between the bright blue water and the dark, almost black rocks. The water is a clear, turquoise color, and the waves are capped with white foam. The rocks are dark and jagged, and they are covered in patches of green moss. The shore is lined with lush green vegetation, including trees and bushes. In the background, there are rolling hills covered in dense forest. The sky is cloudy, and the light is dim.
Online demo
The model is accessible right away via the following links:
- LTX-Studio image-to-video
- Fal.ai text-to-video
- Fal.ai image-to-video
- Replicate text-to-video and image-to-video
ComfyUI
To use our model with ComfyUI, please follow the instructions at a dedicated ComfyUI repo.
Run locally
Installation
The codebase was tested with Python 3.10.5, CUDA version 12.2, and supports PyTorch >= 2.1.2.
git clone https://github.com/Lightricks/LTX-Video.git
cd LTX-Video
# create env
python -m venv env
source env/bin/activate
python -m pip install -e .\[inference-script\]
Inference
To use our model, please follow the inference code in inference.py:
Diffusers π§¨
LTX Video is compatible with the Diffusers Python library. It supports both text-to-video and image-to-video generation.
Make sure you install diffusers before trying out the examples below.
pip install -U git+https://github.com/huggingface/diffusers
The LTX Video Spatial Upscaler is used via the LTXLatentUpsamplePipeline in the diffusers library. It is intended to be part of a multi-stage generation process.
Below is an example demonstrating how to use the spatial upsampler with a base LTX Video model (either the 'dev' or 'distilled' version).
import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_image
# Choose your base LTX Video model:
# base_model_id = "Lightricks/LTX-Video-0.9.7-dev"
base_model_id = "Lightricks/LTX-Video-0.9.7-distilled" # Using distilled for this example
# 0. Load base model and upsampler
pipe = LTXConditionPipeline.from_pretrained(base_model_id, torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained(
"Lightricks/ltxv-spatial-upscaler-0.9.7",
vae=pipe.vae,
torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe_upsample.to("cuda")
def round_to_nearest_resolution_acceptable_by_vae(height, width):
height = height - (height % pipe.vae_temporal_compression_ratio)
width = width - (width % pipe.vae_temporal_compression_ratio)
return height, width
video = load_video(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
)[:21] # Use only the first 21 frames as conditioning
condition1 = LTXVideoCondition(video=video, frame_index=0)
prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
expected_height, expected_width = 768, 1152
downscale_factor = 2 / 3
num_frames = 161
# Part 1. Generate video at smaller resolution
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
latents = pipe(
conditions=[condition1],
prompt=prompt,
negative_prompt=negative_prompt,
width=downscaled_width,
height=downscaled_height,
num_frames=num_frames,
num_inference_steps=30,
generator=torch.Generator().manual_seed(0),
output_type="latent",
).frames
# Part 2. Upscale generated video using latent upsampler with fewer inference steps
# The available latent upsampler upscales the height/width by 2x
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
upscaled_latents = pipe_upsample(
latents=latents,
output_type="latent"
).frames
# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
video = pipe(
conditions=[condition1],
prompt=prompt,
negative_prompt=negative_prompt,
width=upscaled_width,
height=upscaled_height,
num_frames=num_frames,
denoise_strength=0.4, # Effectively, 4 inference steps out of 10
num_inference_steps=10,
latents=upscaled_latents,
decode_timestep=0.05,
image_cond_noise_scale=0.025,
generator=torch.Generator().manual_seed(0),
output_type="pil",
).frames[0]
# Part 4. Downscale the video to the expected resolution
video = [frame.resize((expected_width, expected_height)) for frame in video]
export_to_video(video, "output.mp4", fps=24)
for more details and inference examples using 𧨠diffusers, check out the diffusers documentation
Diffusers also supports directly loading from the original LTX checkpoints using the from_single_file() method. Check out this section to learn more.
To learn more, check out the official documentation.
Limitations
- This model is not intended or able to provide factual information.
- As a statistical model this checkpoint might amplify existing societal biases.
- The model may fail to generate videos that matches the prompts perfectly.
- Prompt following is heavily influenced by the prompting-style.
- Downloads last month
- 1,140















