| | --- |
| | license: mit |
| | library_name: diffusers |
| | pipeline_tag: text-to-video |
| | --- |
| | |
| | This repository contains a pruned and isolated pipeline for Stage 2 of [StreamingT2V](https://streamingt2v.github.io/), dubbed "VidXTend." |
| |
|
| | This model's primary purpose is extending 16-frame 256px x 256x animations by 8 frames at a time (one second at 8fps.) |
| |
|
| | ``` |
| | @article{henschel2024streamingt2v, |
| | title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text}, |
| | author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey}, |
| | journal={arXiv preprint arXiv:2403.14773}, |
| | year={2024} |
| | } |
| | ``` |
| |
|
| | Code: https://github.com/Picsart-AI-Research/StreamingT2V |
| |
|
| | # Usage |
| |
|
| | ## Installation |
| |
|
| | First, install the VidXTend package into your python environment. If you're creating a new environment for VidXTend, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU. |
| |
|
| | ```sh |
| | pip install git+https://github.com/painebenjamin/vidxtend.git |
| | ``` |
| |
|
| | ## Command-Line |
| |
|
| | A command-line utility `vidxtend` is installed with the package. |
| |
|
| | ```sh |
| | Usage: vidxtend [OPTIONS] VIDEO PROMPT |
| | |
| | Run VidXtend on a video file, concatenating the generated frames to the end |
| | of the video. |
| | |
| | Options: |
| | -fps, --frame-rate INTEGER Video FPS. Will default to the input FPS. |
| | -s, --seconds FLOAT The total number of seconds to add to the |
| | video. Multiply this number by frame rate to |
| | determine total number of new frames |
| | generated. [default: 1.0] |
| | -np, --negative-prompt TEXT Negative prompt for the diffusion process. |
| | -cfg, --guidance-scale FLOAT Guidance scale for the diffusion process. |
| | [default: 7.5] |
| | -ns, --num-inference-steps INTEGER |
| | Number of diffusion steps. [default: 50] |
| | -r, --seed INTEGER Random seed. |
| | -m, --model TEXT HuggingFace model name. |
| | -nh, --no-half Do not use half precision. |
| | -no, --no-offload Do not offload to the CPU to preserve GPU |
| | memory. |
| | -ns, --no-slicing Do not use VAE slicing. |
| | -g, --gpu-id INTEGER GPU ID to use. |
| | -sf, --model-single-file Download and use a single file instead of a |
| | directory. |
| | -cf, --config-file TEXT Config file to use when using the model- |
| | single-file option. Accepts a path or a |
| | filename in the same directory as the single |
| | file. Will download from the repository |
| | passed in the model option if not provided. |
| | [default: config.json] |
| | -mf, --model-filename TEXT The model file to download when using the |
| | model-single-file option. [default: |
| | vidxtend.safetensors] |
| | -rs, --remote-subfolder TEXT Remote subfolder to download from when using |
| | the model-single-file option. |
| | -cd, --cache-dir DIRECTORY Cache directory to download to. Default uses |
| | the huggingface cache. |
| | -o, --output FILE Output file. [default: output.mp4] |
| | -f, --fit [actual|cover|contain|stretch] |
| | Image fit mode. [default: cover] |
| | -a, --anchor [top-left|top-center|top-right|center-left|center-center|center-right|bottom-left|bottom-center|bottom-right] |
| | Image anchor point. [default: top-left] |
| | --help Show this message and exit. |
| | ``` |
| |
|
| | ## Python |
| |
|
| | You can create the pipeline, automatically pulling the weights from this repository, either as individual models: |
| |
|
| | ```py |
| | from vidxtend import VidXTendPipeline |
| | pipeline = VidXTendPipeline.from_pretrained( |
| | "benjamin-paine/vidxtend", |
| | torch_dtype=torch.float16, |
| | variant="fp16", |
| | ) |
| | ``` |
| |
|
| | Or, as a single file: |
| |
|
| | ```py |
| | from vidxtend import VidXTendPipeline |
| | pipeline = VidXTendPipeline.from_single_file( |
| | "benjamin-paine/vidxtend", |
| | torch_dtype=torch.float16, |
| | variant="fp16", |
| | ) |
| | ``` |
| |
|
| | Use these methods to improve performance: |
| |
|
| | ``` |
| | pipeline.enable_model_cpu_offload() |
| | pipeline.enable_vae_slicing() |
| | pipeline.set_use_memory_efficient_attention_xformers() |
| | ``` |
| |
|
| | Usage is as follows: |
| |
|
| | ``` |
| | # Assume images is a list of PIL Images |
| | |
| | new_frames = pipeline( |
| | prompt=prompt, |
| | negative_prompt=None, # Optionally use negative prompt |
| | image=images[-8:], # Use final 8 frames of video |
| | input_frames_conditioning=images[:1], # Use first frame of video |
| | eta=1.0, |
| | guidance_scale=7.5, |
| | output_type="pil" |
| | ).frames[8:] # Remove the first 8 frames from the output as they were used as guide for final 8 |
| | ``` |