jacobitterman commited on
Commit
9cafd79
·
verified ·
1 Parent(s): 8d597b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -31
README.md CHANGED
@@ -18,13 +18,12 @@ We provide a model for both text-to-video as well as image+text-to-video usecase
18
 
19
  <img src="./media/trailer.gif" alt="trailer" width="512">
20
 
21
-
22
- | | | | |
23
- |:---:|:---:|:---:|:---:|
24
- | ![example1](./media/ltx-video_example_00001.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with long brown hair and light skin smiles at another woman...</summary>A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage.</details> | ![example2](./media/ltx-video_example_00002.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman walks away from a white Jeep parked on a city street at night...</summary>A woman walks away from a white Jeep parked on a city street at night, then ascends a staircase and knocks on a door. The woman, wearing a dark jacket and jeans, walks away from the Jeep parked on the left side of the street, her back to the camera; she walks at a steady pace, her arms swinging slightly by her sides; the street is dimly lit, with streetlights casting pools of light on the wet pavement; a man in a dark jacket and jeans walks past the Jeep in the opposite direction; the camera follows the woman from behind as she walks up a set of stairs towards a building with a green door; she reaches the top of the stairs and turns left, continuing to walk towards the building; she reaches the door and knocks on it with her right hand; the camera remains stationary, focused on the doorway; the scene is captured in real-life footage.</details> | ![example3](./media/ltx-video_example_00003.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with blonde hair styled up, wearing a black dress...</summary>A woman with blonde hair styled up, wearing a black dress with sequins and pearl earrings, looks down with a sad expression on her face. The camera remains stationary, focused on the woman's face. The lighting is dim, casting soft shadows on her face. The scene appears to be from a movie or TV show.</details> | ![example4](./media/ltx-video_example_00004.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The camera pans over a snow-covered mountain range...</summary>The camera pans over a snow-covered mountain range, revealing a vast expanse of snow-capped peaks and valleys.The mountains are covered in a thick layer of snow, with some areas appearing almost white while others have a slightly darker, almost grayish hue. The peaks are jagged and irregular, with some rising sharply into the sky while others are more rounded. The valleys are deep and narrow, with steep slopes that are also covered in snow. The trees in the foreground are mostly bare, with only a few leaves remaining on their branches. The sky is overcast, with thick clouds obscuring the sun. The overall impression is one of peace and tranquility, with the snow-covered mountains standing as a testament to the power and beauty of nature.</details> |
25
- | ![example5](./media/ltx-video_example_00005.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with light skin, wearing a blue jacket and a black hat...</summary>A woman with light skin, wearing a blue jacket and a black hat with a veil, looks down and to her right, then back up as she speaks; she has brown hair styled in an updo, light brown eyebrows, and is wearing a white collared shirt under her jacket; the camera remains stationary on her face as she speaks; the background is out of focus, but shows trees and people in period clothing; the scene is captured in real-life footage.</details> | ![example6](./media/ltx-video_example_00006.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man in a dimly lit room talks on a vintage telephone...</summary>A man in a dimly lit room talks on a vintage telephone, hangs up, and looks down with a sad expression. He holds the black rotary phone to his right ear with his right hand, his left hand holding a rocks glass with amber liquid. He wears a brown suit jacket over a white shirt, and a gold ring on his left ring finger. His short hair is neatly combed, and he has light skin with visible wrinkles around his eyes. The camera remains stationary, focused on his face and upper body. The room is dark, lit only by a warm light source off-screen to the left, casting shadows on the wall behind him. The scene appears to be from a movie.</details> | ![example7](./media/ltx-video_example_00007.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A prison guard unlocks and opens a cell door...</summary>A prison guard unlocks and opens a cell door to reveal a young man sitting at a table with a woman. The guard, wearing a dark blue uniform with a badge on his left chest, unlocks the cell door with a key held in his right hand and pulls it open; he has short brown hair, light skin, and a neutral expression. The young man, wearing a black and white striped shirt, sits at a table covered with a white tablecloth, facing the woman; he has short brown hair, light skin, and a neutral expression. The woman, wearing a dark blue shirt, sits opposite the young man, her face turned towards him; she has short blonde hair and light skin. The camera remains stationary, capturing the scene from a medium distance, positioned slightly to the right of the guard. The room is dimly lit, with a single light fixture illuminating the table and the two figures. The walls are made of large, grey concrete blocks, and a metal door is visible in the background. The scene is captured in real-life footage.</details> | ![example8](./media/ltx-video_example_00008.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with blood on her face and a white tank top...</summary>A woman with blood on her face and a white tank top looks down and to her right, then back up as she speaks. She has dark hair pulled back, light skin, and her face and chest are covered in blood. The camera angle is a close-up, focused on the woman's face and upper torso. The lighting is dim and blue-toned, creating a somber and intense atmosphere. The scene appears to be from a movie or TV show.</details> |
26
- | ![example9](./media/ltx-video_example_00009.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man with graying hair, a beard, and a gray shirt...</summary>A man with graying hair, a beard, and a gray shirt looks down and to his right, then turns his head to the left. The camera angle is a close-up, focused on the man's face. The lighting is dim, with a greenish tint. The scene appears to be real-life footage. Step</details> | ![example10](./media/ltx-video_example_00010.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A clear, turquoise river flows through a rocky canyon...</summary>A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility.</details> | ![example11](./media/ltx-video_example_00011.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man in a suit enters a room and speaks to two women...</summary>A man in a suit enters a room and speaks to two women sitting on a couch. The man, wearing a dark suit with a gold tie, enters the room from the left and walks towards the center of the frame. He has short gray hair, light skin, and a serious expression. He places his right hand on the back of a chair as he approaches the couch. Two women are seated on a light-colored couch in the background. The woman on the left wears a light blue sweater and has short blonde hair. The woman on the right wears a white sweater and has short blonde hair. The camera remains stationary, focusing on the man as he enters the room. The room is brightly lit, with warm tones reflecting off the walls and furniture. The scene appears to be from a film or television show.</details> | ![example12](./media/ltx-video_example_00012.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The waves crash against the jagged rocks of the shoreline...</summary>The waves crash against the jagged rocks of the shoreline, sending spray high into the air.The rocks are a dark gray color, with sharp edges and deep crevices. The water is a clear blue-green, with white foam where the waves break against the rocks. The sky is a light gray, with a few white clouds dotting the horizon.</details> |
27
- | ![example13](./media/ltx-video_example_00013.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The camera pans across a cityscape of tall buildings...</summary>The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery.</details> | ![example14](./media/ltx-video_example_00014.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man walks towards a window, looks out, and then turns around...</summary>A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage.</details> | ![example15](./media/ltx-video_example_00015.gif)<br><details style="max-width: 300px; margin: auto;"><summary>Two police officers in dark blue uniforms and matching hats...</summary>Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show.</details> | ![example16](./media/ltx-video_example_00016.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with short brown hair, wearing a maroon sleeveless top...</summary>A woman with short brown hair, wearing a maroon sleeveless top and a silver necklace, walks through a room while talking, then a woman with pink hair and a white shirt appears in the doorway and yells. The first woman walks from left to right, her expression serious; she has light skin and her eyebrows are slightly furrowed. The second woman stands in the doorway, her mouth open in a yell; she has light skin and her eyes are wide. The room is dimly lit, with a bookshelf visible in the background. The camera follows the first woman as she walks, then cuts to a close-up of the second woman's face. The scene is captured in real-life footage.</details> |
28
 
29
  # Models & Workflows
30
 
@@ -72,8 +71,8 @@ You can use the model for purposes under the license:
72
  The model is accessible right away via the following links:
73
  - [LTX-Studio image-to-video (13B-mix)](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b)
74
  - [LTX-Studio image-to-video (13B distilled)](https://app.ltx.studio/motion-workspace?videoModel=ltxv)
75
- - [Fal.ai text-to-video](https://fal.ai/models/fal-ai/ltx-video)
76
- - [Fal.ai image-to-video](https://fal.ai/models/fal-ai/ltx-video/image-to-video)
77
  - [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
78
 
79
  ### ComfyUI
@@ -99,16 +98,20 @@ python -m pip install -e .\[inference-script\]
99
 
100
  To use our model, please follow the inference code in [inference.py](https://github.com/Lightricks/LTX-Video/blob/main/inference.py):
101
 
102
- ##### For text-to-video generation:
 
103
 
104
  ```bash
105
- python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
106
  ```
107
 
108
- ##### For image-to-video generation:
 
 
 
109
 
110
  ```bash
111
- python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
112
  ```
113
 
114
  ### Diffusers 🧨
@@ -123,12 +126,14 @@ pip install -U git+https://github.com/huggingface/diffusers
123
 
124
  Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
125
 
126
- ### text-to-video:
 
 
127
  ```py
128
  import torch
129
  from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
130
  from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
131
- from diffusers.utils import export_to_video
132
 
133
  pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
134
  pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
@@ -141,17 +146,21 @@ def round_to_nearest_resolution_acceptable_by_vae(height, width):
141
  width = width - (width % pipe.vae_spatial_compression_ratio)
142
  return height, width
143
 
144
- prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
 
 
 
 
145
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
146
- expected_height, expected_width = 512, 704
147
  downscale_factor = 2 / 3
148
- num_frames = 121
149
 
150
  # Part 1. Generate video at smaller resolution
151
  downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
152
  downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
153
  latents = pipe(
154
- conditions=None,
155
  prompt=prompt,
156
  negative_prompt=negative_prompt,
157
  width=downscaled_width,
@@ -172,6 +181,7 @@ upscaled_latents = pipe_upsample(
172
 
173
  # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
174
  video = pipe(
 
175
  prompt=prompt,
176
  negative_prompt=negative_prompt,
177
  width=upscaled_width,
@@ -192,13 +202,12 @@ video = [frame.resize((expected_width, expected_height)) for frame in video]
192
  export_to_video(video, "output.mp4", fps=24)
193
  ```
194
 
195
- ### For image-to-video:
196
-
197
  ```py
198
  import torch
199
  from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
200
  from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
201
- from diffusers.utils import export_to_video, load_image, load_video
202
 
203
  pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
204
  pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
@@ -211,21 +220,17 @@ def round_to_nearest_resolution_acceptable_by_vae(height, width):
211
  width = width - (width % pipe.vae_spatial_compression_ratio)
212
  return height, width
213
 
214
- image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
215
- video = load_video(export_to_video([image])) # compress the image using video compression as the model was trained on videos
216
- condition1 = LTXVideoCondition(video=video, frame_index=0)
217
-
218
- prompt = "A cute little penguin takes out a book and starts reading it"
219
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
220
- expected_height, expected_width = 480, 832
221
  downscale_factor = 2 / 3
222
- num_frames = 96
223
 
224
  # Part 1. Generate video at smaller resolution
225
  downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
226
  downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
227
  latents = pipe(
228
- conditions=[condition1],
229
  prompt=prompt,
230
  negative_prompt=negative_prompt,
231
  width=downscaled_width,
@@ -246,7 +251,6 @@ upscaled_latents = pipe_upsample(
246
 
247
  # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
248
  video = pipe(
249
- conditions=[condition1],
250
  prompt=prompt,
251
  negative_prompt=negative_prompt,
252
  width=upscaled_width,
 
18
 
19
  <img src="./media/trailer.gif" alt="trailer" width="512">
20
 
21
+ ### Image-to-video examples
22
+ | | | |
23
+ |:---:|:---:|:---:|
24
+ | ![example1](./media/ltx-video_i2v_example_00001.gif) | ![example2](./media/ltx-video_i2v_example_00002.gif) | ![example3](./media/ltx-video_i2v_example_00003.gif) |
25
+ | ![example4](./media/ltx-video_i2v_example_00004.gif) | ![example5](./media/ltx-video_i2v_example_00005.gif) | ![example6](./media/ltx-video_i2v_example_00006.gif) |
26
+ | ![example7](./media/ltx-video_i2v_example_00007.gif) | ![example8](./media/ltx-video_i2v_example_00008.gif) | ![example9](./media/ltx-video_i2v_example_00009.gif) |
 
27
 
28
  # Models & Workflows
29
 
 
71
  The model is accessible right away via the following links:
72
  - [LTX-Studio image-to-video (13B-mix)](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b)
73
  - [LTX-Studio image-to-video (13B distilled)](https://app.ltx.studio/motion-workspace?videoModel=ltxv)
74
+ - [Fal.ai image-to-video (13B full)](https://fal.ai/models/fal-ai/ltx-video-13b-dev/image-to-video)
75
+ - [Fal.ai image-to-video (13B distilled)](https://fal.ai/models/fal-ai/ltx-video-13b-distilled/image-to-video)
76
  - [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
77
 
78
  ### ComfyUI
 
98
 
99
  To use our model, please follow the inference code in [inference.py](https://github.com/Lightricks/LTX-Video/blob/main/inference.py):
100
 
101
+
102
+ #### For image-to-video generation:
103
 
104
  ```bash
105
+ python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
106
  ```
107
 
108
+ #### For video generation with multiple conditions:
109
+
110
+ You can now generate a video conditioned on a set of images and/or short video segments.
111
+ Simply provide a list of paths to the images or video segments you want to condition on, along with their target frame numbers in the generated video. You can also specify the conditioning strength for each item (default: 1.0).
112
 
113
  ```bash
114
+ python inference.py --prompt "PROMPT" --conditioning_media_paths IMAGE_OR_VIDEO_PATH_1 IMAGE_OR_VIDEO_PATH_2 --conditioning_start_frames TARGET_FRAME_1 TARGET_FRAME_2 --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
115
  ```
116
 
117
  ### Diffusers 🧨
 
126
 
127
  Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
128
 
129
+
130
+ ### For image-to-video:
131
+
132
  ```py
133
  import torch
134
  from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
135
  from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
136
+ from diffusers.utils import export_to_video, load_image, load_video
137
 
138
  pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
139
  pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
 
146
  width = width - (width % pipe.vae_spatial_compression_ratio)
147
  return height, width
148
 
149
+ image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
150
+ video = load_video(export_to_video([image])) # compress the image using video compression as the model was trained on videos
151
+ condition1 = LTXVideoCondition(video=video, frame_index=0)
152
+
153
+ prompt = "A cute little penguin takes out a book and starts reading it"
154
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
155
+ expected_height, expected_width = 480, 832
156
  downscale_factor = 2 / 3
157
+ num_frames = 96
158
 
159
  # Part 1. Generate video at smaller resolution
160
  downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
161
  downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
162
  latents = pipe(
163
+ conditions=[condition1],
164
  prompt=prompt,
165
  negative_prompt=negative_prompt,
166
  width=downscaled_width,
 
181
 
182
  # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
183
  video = pipe(
184
+ conditions=[condition1],
185
  prompt=prompt,
186
  negative_prompt=negative_prompt,
187
  width=upscaled_width,
 
202
  export_to_video(video, "output.mp4", fps=24)
203
  ```
204
 
205
+ ### text-to-video:
 
206
  ```py
207
  import torch
208
  from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
209
  from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
210
+ from diffusers.utils import export_to_video
211
 
212
  pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
213
  pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
 
220
  width = width - (width % pipe.vae_spatial_compression_ratio)
221
  return height, width
222
 
223
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
 
 
 
 
224
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
225
+ expected_height, expected_width = 512, 704
226
  downscale_factor = 2 / 3
227
+ num_frames = 121
228
 
229
  # Part 1. Generate video at smaller resolution
230
  downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
231
  downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
232
  latents = pipe(
233
+ conditions=None,
234
  prompt=prompt,
235
  negative_prompt=negative_prompt,
236
  width=downscaled_width,
 
251
 
252
  # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
253
  video = pipe(
 
254
  prompt=prompt,
255
  negative_prompt=negative_prompt,
256
  width=upscaled_width,