vidxtend / README.md

Improve model card: Add pipeline tag, library name, link to code, and correct license (#2)

f571250 verified 10 months ago

5.01 kB

	---
	license: mit
	library_name: diffusers
	pipeline_tag: text-to-video
	---

	This repository contains a pruned and isolated pipeline for Stage 2 of [StreamingT2V](https://streamingt2v.github.io/), dubbed "VidXTend."

	This model's primary purpose is extending 16-frame 256px x 256x animations by 8 frames at a time (one second at 8fps.)

	```
	@article{henschel2024streamingt2v,
	title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text},
	author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
	journal={arXiv preprint arXiv:2403.14773},
	year={2024}
	}
	```

	Code: https://github.com/Picsart-AI-Research/StreamingT2V

	# Usage

	## Installation

	First, install the VidXTend package into your python environment. If you're creating a new environment for VidXTend, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.

	```sh
	pip install git+https://github.com/painebenjamin/vidxtend.git
	```

	## Command-Line

	A command-line utility `vidxtend` is installed with the package.

	```sh
	Usage: vidxtend [OPTIONS] VIDEO PROMPT

	Run VidXtend on a video file, concatenating the generated frames to the end
	of the video.

	Options:
	-fps, --frame-rate INTEGER Video FPS. Will default to the input FPS.
	-s, --seconds FLOAT The total number of seconds to add to the
	video. Multiply this number by frame rate to
	determine total number of new frames
	generated. [default: 1.0]
	-np, --negative-prompt TEXT Negative prompt for the diffusion process.
	-cfg, --guidance-scale FLOAT Guidance scale for the diffusion process.
	[default: 7.5]
	-ns, --num-inference-steps INTEGER
	Number of diffusion steps. [default: 50]
	-r, --seed INTEGER Random seed.
	-m, --model TEXT HuggingFace model name.
	-nh, --no-half Do not use half precision.
	-no, --no-offload Do not offload to the CPU to preserve GPU
	memory.
	-ns, --no-slicing Do not use VAE slicing.
	-g, --gpu-id INTEGER GPU ID to use.
	-sf, --model-single-file Download and use a single file instead of a
	directory.
	-cf, --config-file TEXT Config file to use when using the model-
	single-file option. Accepts a path or a
	filename in the same directory as the single
	file. Will download from the repository
	passed in the model option if not provided.
	[default: config.json]
	-mf, --model-filename TEXT The model file to download when using the
	model-single-file option. [default:
	vidxtend.safetensors]
	-rs, --remote-subfolder TEXT Remote subfolder to download from when using
	the model-single-file option.
	-cd, --cache-dir DIRECTORY Cache directory to download to. Default uses
	the huggingface cache.
	-o, --output FILE Output file. [default: output.mp4]
	-f, --fit [actual\|cover\|contain\|stretch]
	Image fit mode. [default: cover]
	-a, --anchor [top-left\|top-center\|top-right\|center-left\|center-center\|center-right\|bottom-left\|bottom-center\|bottom-right]
	Image anchor point. [default: top-left]
	--help Show this message and exit.
	```

	## Python

	You can create the pipeline, automatically pulling the weights from this repository, either as individual models:

	```py
	from vidxtend import VidXTendPipeline
	pipeline = VidXTendPipeline.from_pretrained(
	"benjamin-paine/vidxtend",
	torch_dtype=torch.float16,
	variant="fp16",
	)
	```

	Or, as a single file:

	```py
	from vidxtend import VidXTendPipeline
	pipeline = VidXTendPipeline.from_single_file(
	"benjamin-paine/vidxtend",
	torch_dtype=torch.float16,
	variant="fp16",
	)
	```

	Use these methods to improve performance:

	```
	pipeline.enable_model_cpu_offload()
	pipeline.enable_vae_slicing()
	pipeline.set_use_memory_efficient_attention_xformers()
	```

	Usage is as follows:

	```
	# Assume images is a list of PIL Images

	new_frames = pipeline(
	prompt=prompt,
	negative_prompt=None, # Optionally use negative prompt
	image=images[-8:], # Use final 8 frames of video
	input_frames_conditioning=images[:1], # Use first frame of video
	eta=1.0,
	guidance_scale=7.5,
	output_type="pil"
	).frames[8:] # Remove the first 8 frames from the output as they were used as guide for final 8
	```