| <!--Copyright 2024 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| --> | |
| <Tip warning={true}> | |
| 🧪 This pipeline is for research purposes only. | |
| </Tip> | |
| # Text-to-video | |
| [ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang. | |
| The abstract from the paper is: | |
| *This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.* | |
| You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense). | |
| ## Usage example | |
| ### `text-to-video-ms-1.7b` | |
| Let's start by generating a short video with the default length of 16 frames (2s at 8 fps): | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| from diffusers.utils import export_to_video | |
| pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") | |
| pipe = pipe.to("cuda") | |
| prompt = "Spiderman is surfing" | |
| video_frames = pipe(prompt).frames[0] | |
| video_path = export_to_video(video_frames) | |
| video_path | |
| ``` | |
| Diffusers supports different optimization techniques to improve the latency | |
| and memory footprint of a pipeline. Since videos are often more memory-heavy than images, | |
| we can enable CPU offloading and VAE slicing to keep the memory footprint at bay. | |
| Let's generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing: | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| from diffusers.utils import export_to_video | |
| pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") | |
| pipe.enable_model_cpu_offload() | |
| # memory optimization | |
| pipe.enable_vae_slicing() | |
| prompt = "Darth Vader surfing a wave" | |
| video_frames = pipe(prompt, num_frames=64).frames[0] | |
| video_path = export_to_video(video_frames) | |
| video_path | |
| ``` | |
| It just takes **7 GBs of GPU memory** to generate the 64 video frames using PyTorch 2.0, "fp16" precision and the techniques mentioned above. | |
| We can also use a different scheduler easily, using the same method we'd use for Stable Diffusion: | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler | |
| from diffusers.utils import export_to_video | |
| pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") | |
| pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) | |
| pipe.enable_model_cpu_offload() | |
| prompt = "Spiderman is surfing" | |
| video_frames = pipe(prompt, num_inference_steps=25).frames[0] | |
| video_path = export_to_video(video_frames) | |
| video_path | |
| ``` | |
| Here are some sample outputs: | |
| <table> | |
| <tr> | |
| <td><center> | |
| An astronaut riding a horse. | |
| <br> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astr.gif" | |
| alt="An astronaut riding a horse." | |
| style="width: 300px;" /> | |
| </center></td> | |
| <td ><center> | |
| Darth vader surfing in waves. | |
| <br> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vader.gif" | |
| alt="Darth vader surfing in waves." | |
| style="width: 300px;" /> | |
| </center></td> | |
| </tr> | |
| </table> | |
| ### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL` | |
| Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`. | |
| One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`], | |
| which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL). | |
| ```py | |
| import torch | |
| from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler | |
| from diffusers.utils import export_to_video | |
| from PIL import Image | |
| pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16) | |
| pipe.enable_model_cpu_offload() | |
| # memory optimization | |
| pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) | |
| pipe.enable_vae_slicing() | |
| prompt = "Darth Vader surfing a wave" | |
| video_frames = pipe(prompt, num_frames=24).frames[0] | |
| video_path = export_to_video(video_frames) | |
| video_path | |
| ``` | |
| Now the video can be upscaled: | |
| ```py | |
| pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16) | |
| pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) | |
| pipe.enable_model_cpu_offload() | |
| # memory optimization | |
| pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) | |
| pipe.enable_vae_slicing() | |
| video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames] | |
| video_frames = pipe(prompt, video=video, strength=0.6).frames[0] | |
| video_path = export_to_video(video_frames) | |
| video_path | |
| ``` | |
| Here are some sample outputs: | |
| <table> | |
| <tr> | |
| <td ><center> | |
| Darth vader surfing in waves. | |
| <br> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/darthvader_cerpense.gif" | |
| alt="Darth vader surfing in waves." | |
| style="width: 576px;" /> | |
| </center></td> | |
| </tr> | |
| </table> | |
| ## Tips | |
| Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. | |
| Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. | |
| <Tip> | |
| Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. | |
| </Tip> | |
| ## TextToVideoSDPipeline | |
| [[autodoc]] TextToVideoSDPipeline | |
| - all | |
| - __call__ | |
| ## VideoToVideoSDPipeline | |
| [[autodoc]] VideoToVideoSDPipeline | |
| - all | |
| - __call__ | |
| ## TextToVideoSDPipelineOutput | |
| [[autodoc]] pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput | |