Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_12229 /en /api /pipelines /easyanimate.md

rtrm

28 days ago

preview code

download

raw

17.7 kB

	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.
	-->

	# EasyAnimate
	[EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI.

	The description from it's GitHub page:
	EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.

	This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai).

	There are two official EasyAnimate checkpoints for text-to-video and video-to-video.

	\| checkpoints \| recommended inference dtype \|
	\|:---:\|:---:\|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) \| torch.float16 \|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) \| torch.float16 \|

	There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.

	\| checkpoints \| recommended inference dtype \|
	\|:---:\|:---:\|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) \| torch.float16 \|

	There are two official EasyAnimate checkpoints available for control-to-video.

	\| checkpoints \| recommended inference dtype \|
	\|:---:\|:---:\|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) \| torch.float16 \|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) \| torch.float16 \|

	For the EasyAnimateV5.1 series:
	- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
	- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.

	## Quantization

	Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

	Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [EasyAnimatePipeline](/docs/diffusers/pr_12229/en/api/pipelines/easyanimate#diffusers.EasyAnimatePipeline) for inference with bitsandbytes.

	```py
	import torch
	from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
	from diffusers.utils import export_to_video

	quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
	transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
	"alibaba-pai/EasyAnimateV5.1-12b-zh",
	subfolder="transformer",
	quantization_config=quant_config,
	torch_dtype=torch.float16,
	)

	pipeline = EasyAnimatePipeline.from_pretrained(
	"alibaba-pai/EasyAnimateV5.1-12b-zh",
	transformer=transformer_8bit,
	torch_dtype=torch.float16,
	device_map="balanced",
	)

	prompt = "A cat walks on the grass, realistic style."
	negative_prompt = "bad detailed"
	video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
	export_to_video(video, "cat.mp4", fps=8)
	```

	## EasyAnimatePipeline[[diffusers.EasyAnimatePipeline]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class diffusers.EasyAnimatePipeline</name><anchor>diffusers.EasyAnimatePipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L186</source><parameters>[{"name": "vae", "val": ": AutoencoderKLMagvit"}, {"name": "text_encoder", "val": ": typing.Union[transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration, transformers.models.bert.modeling_bert.BertModel]"}, {"name": "tokenizer", "val": ": typing.Union[transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer, transformers.models.bert.tokenization_bert.BertTokenizer]"}, {"name": "transformer", "val": ": EasyAnimateTransformer3DModel"}, {"name": "scheduler", "val": ": FlowMatchEulerDiscreteScheduler"}]</parameters><paramsdesc>- vae ([AutoencoderKLMagvit](/docs/diffusers/pr_12229/en/api/models/autoencoderkl_magvit#diffusers.AutoencoderKLMagvit)) --
	Variational Auto-Encoder (VAE) Model to encode and decode video to and from latent representations.
	- text_encoder (Optional[`~transformers.Qwen2VLForConditionalGeneration`, `~transformers.BertModel`]) --
	EasyAnimate uses [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.
	- tokenizer (Optional[`~transformers.Qwen2Tokenizer`, `~transformers.BertTokenizer`]) --
	A `Qwen2Tokenizer` or `BertTokenizer` to tokenize text.
	- transformer ([EasyAnimateTransformer3DModel](/docs/diffusers/pr_12229/en/api/models/easyanimate_transformer3d#diffusers.EasyAnimateTransformer3DModel)) --
	The EasyAnimate model designed by EasyAnimate Team.
	- scheduler ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_12229/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) --
	A scheduler to be used in combination with EasyAnimate to denoise the encoded image latents.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Pipeline for text-to-video generation using EasyAnimate.

	This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12229/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
	library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

	EasyAnimate uses one text encoder [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.





	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>__call__</name><anchor>diffusers.EasyAnimatePipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L524</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "num_frames", "val": ": typing.Optional[int] = 49"}, {"name": "height", "val": ": typing.Optional[int] = 512"}, {"name": "width", "val": ": typing.Optional[int] = 512"}, {"name": "num_inference_steps", "val": ": typing.Optional[int] = 50"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = 5.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "eta", "val": ": typing.Optional[float] = 0.0"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[int]] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "guidance_rescale", "val": ": float = 0.0"}]</parameters><rettype>[StableDiffusionPipelineOutput](/docs/diffusers/pr_12229/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) or `tuple`</rettype><retdesc>If `return_dict` is `True`, [StableDiffusionPipelineOutput](/docs/diffusers/pr_12229/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) is returned,
	otherwise a `tuple` is returned where the first element is a list with the generated images and the
	second element is a list of `bool`s indicating whether the corresponding generated image contains
	"not-safe-for-work" (nsfw) content.</retdesc></docstring>

	Generates images or video using the EasyAnimate pipeline based on the provided prompts.


	<ExampleCodeBlock anchor="diffusers.EasyAnimatePipeline.__call__.example">

	Examples:
	```python
	>>> import torch
	>>> from diffusers import EasyAnimatePipeline
	>>> from diffusers.utils import export_to_video

	>>> # Models: "alibaba-pai/EasyAnimateV5.1-12b-zh"
	>>> pipe = EasyAnimatePipeline.from_pretrained(
	... "alibaba-pai/EasyAnimateV5.1-7b-zh-diffusers", torch_dtype=torch.float16
	... ).to("cuda")
	>>> prompt = (
	... "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
	... "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
	... "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
	... "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
	... "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
	... "atmosphere of this unique musical performance."
	... )
	>>> sample_size = (512, 512)
	>>> video = pipe(
	... prompt=prompt,
	... guidance_scale=6,
	... negative_prompt="bad detailed",
	... height=sample_size[0],
	... width=sample_size[1],
	... num_inference_steps=50,
	... ).frames[0]
	>>> export_to_video(video, "output.mp4", fps=8)
	```

	</ExampleCodeBlock>

	prompt (`str` or `List[str]`, optional):
	Text prompts to guide the image or video generation. If not provided, use `prompt_embeds` instead.
	num_frames (`int`, optional):
	Length of the generated video (in frames).
	height (`int`, optional):
	Height of the generated image in pixels.
	width (`int`, optional):
	Width of the generated image in pixels.
	num_inference_steps (`int`, optional, defaults to 50):
	Number of denoising steps during generation. More steps generally yield higher quality images but slow
	down inference.
	guidance_scale (`float`, optional, defaults to 5.0):
	Encourages the model to align outputs with prompts. A higher value may decrease image quality.
	negative_prompt (`str` or `List[str]`, optional):
	Prompts indicating what to exclude in generation. If not specified, use `negative_prompt_embeds`.
	num_images_per_prompt (`int`, optional, defaults to 1):
	Number of images to generate for each prompt.
	eta (`float`, optional, defaults to 0.0):
	Applies to DDIM scheduling. Controlled by the eta parameter from the related literature.
	generator (`torch.Generator` or `List[torch.Generator]`, optional):
	A generator to ensure reproducibility in image generation.
	latents (`torch.Tensor`, optional):
	Predefined latent tensors to condition generation.
	prompt_embeds (`torch.Tensor`, optional):
	Text embeddings for the prompts. Overrides prompt string inputs for more flexibility.
	negative_prompt_embeds (`torch.Tensor`, optional):
	Embeddings for negative prompts. Overrides string inputs if defined.
	prompt_attention_mask (`torch.Tensor`, optional):
	Attention mask for the primary prompt embeddings.
	negative_prompt_attention_mask (`torch.Tensor`, optional):
	Attention mask for negative prompt embeddings.
	output_type (`str`, optional, defaults to "latent"):
	Format of the generated output, either as a PIL image or as a NumPy array.
	return_dict (`bool`, optional, defaults to `True`):
	If `True`, returns a structured output. Otherwise returns a simple tuple.
	callback_on_step_end (`Callable`, optional):
	Functions called at the end of each denoising step.
	callback_on_step_end_tensor_inputs (`List[str]`, optional):
	Tensor names to be included in callback function calls.
	guidance_rescale (`float`, optional, defaults to 0.0):
	Adjusts noise levels based on guidance scale.
	original_size (`Tuple[int, int]`, optional, defaults to `(1024, 1024)`):
	Original dimensions of the output.
	target_size (`Tuple[int, int]`, optional):
	Desired output dimensions for calculations.
	crops_coords_top_left (`Tuple[int, int]`, optional, defaults to `(0, 0)`):
	Coordinates for cropping.






	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>encode_prompt</name><anchor>diffusers.EasyAnimatePipeline.encode_prompt</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L241</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]]"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "do_classifier_free_guidance", "val": ": bool = True"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "device", "val": ": typing.Optional[torch.device] = None"}, {"name": "dtype", "val": ": typing.Optional[torch.dtype] = None"}, {"name": "max_sequence_length", "val": ": int = 256"}]</parameters><paramsdesc>- prompt (`str` or `List[str]`, optional) --
	prompt to be encoded
	- device -- (`torch.device`):
	torch device
	- dtype (`torch.dtype`) --
	torch dtype
	- num_images_per_prompt (`int`) --
	number of images that should be generated per prompt
	- do_classifier_free_guidance (`bool`) --
	whether to use classifier free guidance or not
	- negative_prompt (`str` or `List[str]`, optional) --
	The prompt or prompts not to guide the image generation. If not defined, one has to pass
	`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
	less than `1`).
	- prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
	provided, text embeddings will be generated from `prompt` input argument.
	- negative_prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
	argument.
	- prompt_attention_mask (`torch.Tensor`, optional) --
	Attention mask for the prompt. Required when `prompt_embeds` is passed directly.
	- negative_prompt_attention_mask (`torch.Tensor`, optional) --
	Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly.
	- max_sequence_length (`int`, optional) -- maximum sequence length to use for the prompt.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Encodes the prompt into text encoder hidden states.




	</div></div>

	## EasyAnimatePipelineOutput[[diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput</name><anchor>diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_output.py#L9</source><parameters>[{"name": "frames", "val": ": Tensor"}]</parameters><paramsdesc>- frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]) --
	List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
	denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
	`(batch_size, num_frames, channels, height, width)`.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Output class for EasyAnimate pipelines.




	</div>

	<EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/easyanimate.md" />

Xet Storage Details

Size:: 17.7 kB
Xet hash:: ea2a2deedc665421a0a6383c5d283bf3b1294f19b27749f0970a170a5356a088

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.