Update README.md
Browse files
README.md
CHANGED
|
@@ -18,13 +18,12 @@ We provide a model for both text-to-video as well as image+text-to-video usecase
|
|
| 18 |
|
| 19 |
<img src="./media/trailer.gif" alt="trailer" width="512">
|
| 20 |
|
| 21 |
-
|
| 22 |
-
| | | |
|
| 23 |
-
|
| 24 |
-
| <br><details style="max-width: 300px; margin: auto;"><summary>The camera pans across a cityscape of tall buildings...</summary>The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery.</details> | <br><details style="max-width: 300px; margin: auto;"><summary>A man walks towards a window, looks out, and then turns around...</summary>A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage.</details> | <br><details style="max-width: 300px; margin: auto;"><summary>Two police officers in dark blue uniforms and matching hats...</summary>Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show.</details> | <br><details style="max-width: 300px; margin: auto;"><summary>A woman with short brown hair, wearing a maroon sleeveless top...</summary>A woman with short brown hair, wearing a maroon sleeveless top and a silver necklace, walks through a room while talking, then a woman with pink hair and a white shirt appears in the doorway and yells. The first woman walks from left to right, her expression serious; she has light skin and her eyebrows are slightly furrowed. The second woman stands in the doorway, her mouth open in a yell; she has light skin and her eyes are wide. The room is dimly lit, with a bookshelf visible in the background. The camera follows the first woman as she walks, then cuts to a close-up of the second woman's face. The scene is captured in real-life footage.</details> |
|
| 28 |
|
| 29 |
# Models & Workflows
|
| 30 |
|
|
@@ -72,8 +71,8 @@ You can use the model for purposes under the license:
|
|
| 72 |
The model is accessible right away via the following links:
|
| 73 |
- [LTX-Studio image-to-video (13B-mix)](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b)
|
| 74 |
- [LTX-Studio image-to-video (13B distilled)](https://app.ltx.studio/motion-workspace?videoModel=ltxv)
|
| 75 |
-
- [Fal.ai
|
| 76 |
-
- [Fal.ai image-to-video](https://fal.ai/models/fal-ai/ltx-video/image-to-video)
|
| 77 |
- [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
|
| 78 |
|
| 79 |
### ComfyUI
|
|
@@ -99,16 +98,20 @@ python -m pip install -e .\[inference-script\]
|
|
| 99 |
|
| 100 |
To use our model, please follow the inference code in [inference.py](https://github.com/Lightricks/LTX-Video/blob/main/inference.py):
|
| 101 |
|
| 102 |
-
|
|
|
|
| 103 |
|
| 104 |
```bash
|
| 105 |
-
python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
|
| 106 |
```
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
```bash
|
| 111 |
-
python inference.py --prompt "PROMPT" --
|
| 112 |
```
|
| 113 |
|
| 114 |
### Diffusers 🧨
|
|
@@ -123,12 +126,14 @@ pip install -U git+https://github.com/huggingface/diffusers
|
|
| 123 |
|
| 124 |
Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
|
| 125 |
|
| 126 |
-
|
|
|
|
|
|
|
| 127 |
```py
|
| 128 |
import torch
|
| 129 |
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
|
| 130 |
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
|
| 131 |
-
from diffusers.utils import export_to_video
|
| 132 |
|
| 133 |
pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
|
| 134 |
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
|
|
@@ -141,17 +146,21 @@ def round_to_nearest_resolution_acceptable_by_vae(height, width):
|
|
| 141 |
width = width - (width % pipe.vae_spatial_compression_ratio)
|
| 142 |
return height, width
|
| 143 |
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
|
| 146 |
-
expected_height, expected_width =
|
| 147 |
downscale_factor = 2 / 3
|
| 148 |
-
num_frames =
|
| 149 |
|
| 150 |
# Part 1. Generate video at smaller resolution
|
| 151 |
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
|
| 152 |
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
|
| 153 |
latents = pipe(
|
| 154 |
-
conditions=
|
| 155 |
prompt=prompt,
|
| 156 |
negative_prompt=negative_prompt,
|
| 157 |
width=downscaled_width,
|
|
@@ -172,6 +181,7 @@ upscaled_latents = pipe_upsample(
|
|
| 172 |
|
| 173 |
# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
|
| 174 |
video = pipe(
|
|
|
|
| 175 |
prompt=prompt,
|
| 176 |
negative_prompt=negative_prompt,
|
| 177 |
width=upscaled_width,
|
|
@@ -192,13 +202,12 @@ video = [frame.resize((expected_width, expected_height)) for frame in video]
|
|
| 192 |
export_to_video(video, "output.mp4", fps=24)
|
| 193 |
```
|
| 194 |
|
| 195 |
-
###
|
| 196 |
-
|
| 197 |
```py
|
| 198 |
import torch
|
| 199 |
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
|
| 200 |
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
|
| 201 |
-
from diffusers.utils import export_to_video
|
| 202 |
|
| 203 |
pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
|
| 204 |
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
|
|
@@ -211,21 +220,17 @@ def round_to_nearest_resolution_acceptable_by_vae(height, width):
|
|
| 211 |
width = width - (width % pipe.vae_spatial_compression_ratio)
|
| 212 |
return height, width
|
| 213 |
|
| 214 |
-
|
| 215 |
-
video = load_video(export_to_video([image])) # compress the image using video compression as the model was trained on videos
|
| 216 |
-
condition1 = LTXVideoCondition(video=video, frame_index=0)
|
| 217 |
-
|
| 218 |
-
prompt = "A cute little penguin takes out a book and starts reading it"
|
| 219 |
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
|
| 220 |
-
expected_height, expected_width =
|
| 221 |
downscale_factor = 2 / 3
|
| 222 |
-
num_frames =
|
| 223 |
|
| 224 |
# Part 1. Generate video at smaller resolution
|
| 225 |
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
|
| 226 |
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
|
| 227 |
latents = pipe(
|
| 228 |
-
conditions=
|
| 229 |
prompt=prompt,
|
| 230 |
negative_prompt=negative_prompt,
|
| 231 |
width=downscaled_width,
|
|
@@ -246,7 +251,6 @@ upscaled_latents = pipe_upsample(
|
|
| 246 |
|
| 247 |
# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
|
| 248 |
video = pipe(
|
| 249 |
-
conditions=[condition1],
|
| 250 |
prompt=prompt,
|
| 251 |
negative_prompt=negative_prompt,
|
| 252 |
width=upscaled_width,
|
|
|
|
| 18 |
|
| 19 |
<img src="./media/trailer.gif" alt="trailer" width="512">
|
| 20 |
|
| 21 |
+
### Image-to-video examples
|
| 22 |
+
| | | |
|
| 23 |
+
|:---:|:---:|:---:|
|
| 24 |
+
|  |  |  |
|
| 25 |
+
|  |  |  |
|
| 26 |
+
|  |  |  |
|
|
|
|
| 27 |
|
| 28 |
# Models & Workflows
|
| 29 |
|
|
|
|
| 71 |
The model is accessible right away via the following links:
|
| 72 |
- [LTX-Studio image-to-video (13B-mix)](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b)
|
| 73 |
- [LTX-Studio image-to-video (13B distilled)](https://app.ltx.studio/motion-workspace?videoModel=ltxv)
|
| 74 |
+
- [Fal.ai image-to-video (13B full)](https://fal.ai/models/fal-ai/ltx-video-13b-dev/image-to-video)
|
| 75 |
+
- [Fal.ai image-to-video (13B distilled)](https://fal.ai/models/fal-ai/ltx-video-13b-distilled/image-to-video)
|
| 76 |
- [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
|
| 77 |
|
| 78 |
### ComfyUI
|
|
|
|
| 98 |
|
| 99 |
To use our model, please follow the inference code in [inference.py](https://github.com/Lightricks/LTX-Video/blob/main/inference.py):
|
| 100 |
|
| 101 |
+
|
| 102 |
+
#### For image-to-video generation:
|
| 103 |
|
| 104 |
```bash
|
| 105 |
+
python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
|
| 106 |
```
|
| 107 |
|
| 108 |
+
#### For video generation with multiple conditions:
|
| 109 |
+
|
| 110 |
+
You can now generate a video conditioned on a set of images and/or short video segments.
|
| 111 |
+
Simply provide a list of paths to the images or video segments you want to condition on, along with their target frame numbers in the generated video. You can also specify the conditioning strength for each item (default: 1.0).
|
| 112 |
|
| 113 |
```bash
|
| 114 |
+
python inference.py --prompt "PROMPT" --conditioning_media_paths IMAGE_OR_VIDEO_PATH_1 IMAGE_OR_VIDEO_PATH_2 --conditioning_start_frames TARGET_FRAME_1 TARGET_FRAME_2 --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
|
| 115 |
```
|
| 116 |
|
| 117 |
### Diffusers 🧨
|
|
|
|
| 126 |
|
| 127 |
Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
|
| 128 |
|
| 129 |
+
|
| 130 |
+
### For image-to-video:
|
| 131 |
+
|
| 132 |
```py
|
| 133 |
import torch
|
| 134 |
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
|
| 135 |
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
|
| 136 |
+
from diffusers.utils import export_to_video, load_image, load_video
|
| 137 |
|
| 138 |
pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
|
| 139 |
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
|
|
|
|
| 146 |
width = width - (width % pipe.vae_spatial_compression_ratio)
|
| 147 |
return height, width
|
| 148 |
|
| 149 |
+
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
|
| 150 |
+
video = load_video(export_to_video([image])) # compress the image using video compression as the model was trained on videos
|
| 151 |
+
condition1 = LTXVideoCondition(video=video, frame_index=0)
|
| 152 |
+
|
| 153 |
+
prompt = "A cute little penguin takes out a book and starts reading it"
|
| 154 |
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
|
| 155 |
+
expected_height, expected_width = 480, 832
|
| 156 |
downscale_factor = 2 / 3
|
| 157 |
+
num_frames = 96
|
| 158 |
|
| 159 |
# Part 1. Generate video at smaller resolution
|
| 160 |
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
|
| 161 |
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
|
| 162 |
latents = pipe(
|
| 163 |
+
conditions=[condition1],
|
| 164 |
prompt=prompt,
|
| 165 |
negative_prompt=negative_prompt,
|
| 166 |
width=downscaled_width,
|
|
|
|
| 181 |
|
| 182 |
# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
|
| 183 |
video = pipe(
|
| 184 |
+
conditions=[condition1],
|
| 185 |
prompt=prompt,
|
| 186 |
negative_prompt=negative_prompt,
|
| 187 |
width=upscaled_width,
|
|
|
|
| 202 |
export_to_video(video, "output.mp4", fps=24)
|
| 203 |
```
|
| 204 |
|
| 205 |
+
### text-to-video:
|
|
|
|
| 206 |
```py
|
| 207 |
import torch
|
| 208 |
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
|
| 209 |
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
|
| 210 |
+
from diffusers.utils import export_to_video
|
| 211 |
|
| 212 |
pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
|
| 213 |
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
|
|
|
|
| 220 |
width = width - (width % pipe.vae_spatial_compression_ratio)
|
| 221 |
return height, width
|
| 222 |
|
| 223 |
+
prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
|
|
|
|
|
|
|
|
|
|
|
|
|
| 224 |
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
|
| 225 |
+
expected_height, expected_width = 512, 704
|
| 226 |
downscale_factor = 2 / 3
|
| 227 |
+
num_frames = 121
|
| 228 |
|
| 229 |
# Part 1. Generate video at smaller resolution
|
| 230 |
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
|
| 231 |
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
|
| 232 |
latents = pipe(
|
| 233 |
+
conditions=None,
|
| 234 |
prompt=prompt,
|
| 235 |
negative_prompt=negative_prompt,
|
| 236 |
width=downscaled_width,
|
|
|
|
| 251 |
|
| 252 |
# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
|
| 253 |
video = pipe(
|
|
|
|
| 254 |
prompt=prompt,
|
| 255 |
negative_prompt=negative_prompt,
|
| 256 |
width=upscaled_width,
|