Buckets:

rtrm's picture
|
download
raw
17.7 kB
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->
# EasyAnimate
[EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI.
The description from it's GitHub page:
*EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.*
This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai).
There are two official EasyAnimate checkpoints for text-to-video and video-to-video.
| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) | torch.float16 |
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |
There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.
| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |
There are two official EasyAnimate checkpoints available for control-to-video.
| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) | torch.float16 |
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) | torch.float16 |
For the EasyAnimateV5.1 series:
- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.
## Quantization
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [EasyAnimatePipeline](/docs/diffusers/pr_12229/en/api/pipelines/easyanimate#diffusers.EasyAnimatePipeline) for inference with bitsandbytes.
```py
import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
from diffusers.utils import export_to_video
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
"alibaba-pai/EasyAnimateV5.1-12b-zh",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.float16,
)
pipeline = EasyAnimatePipeline.from_pretrained(
"alibaba-pai/EasyAnimateV5.1-12b-zh",
transformer=transformer_8bit,
torch_dtype=torch.float16,
device_map="balanced",
)
prompt = "A cat walks on the grass, realistic style."
negative_prompt = "bad detailed"
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=8)
```
## EasyAnimatePipeline[[diffusers.EasyAnimatePipeline]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class diffusers.EasyAnimatePipeline</name><anchor>diffusers.EasyAnimatePipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L186</source><parameters>[{"name": "vae", "val": ": AutoencoderKLMagvit"}, {"name": "text_encoder", "val": ": typing.Union[transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration, transformers.models.bert.modeling_bert.BertModel]"}, {"name": "tokenizer", "val": ": typing.Union[transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer, transformers.models.bert.tokenization_bert.BertTokenizer]"}, {"name": "transformer", "val": ": EasyAnimateTransformer3DModel"}, {"name": "scheduler", "val": ": FlowMatchEulerDiscreteScheduler"}]</parameters><paramsdesc>- **vae** ([AutoencoderKLMagvit](/docs/diffusers/pr_12229/en/api/models/autoencoderkl_magvit#diffusers.AutoencoderKLMagvit)) --
Variational Auto-Encoder (VAE) Model to encode and decode video to and from latent representations.
- **text_encoder** (Optional[`~transformers.Qwen2VLForConditionalGeneration`, `~transformers.BertModel`]) --
EasyAnimate uses [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.
- **tokenizer** (Optional[`~transformers.Qwen2Tokenizer`, `~transformers.BertTokenizer`]) --
A `Qwen2Tokenizer` or `BertTokenizer` to tokenize text.
- **transformer** ([EasyAnimateTransformer3DModel](/docs/diffusers/pr_12229/en/api/models/easyanimate_transformer3d#diffusers.EasyAnimateTransformer3DModel)) --
The EasyAnimate model designed by EasyAnimate Team.
- **scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_12229/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) --
A scheduler to be used in combination with EasyAnimate to denoise the encoded image latents.</paramsdesc><paramgroups>0</paramgroups></docstring>
Pipeline for text-to-video generation using EasyAnimate.
This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12229/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
EasyAnimate uses one text encoder [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>__call__</name><anchor>diffusers.EasyAnimatePipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L524</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "num_frames", "val": ": typing.Optional[int] = 49"}, {"name": "height", "val": ": typing.Optional[int] = 512"}, {"name": "width", "val": ": typing.Optional[int] = 512"}, {"name": "num_inference_steps", "val": ": typing.Optional[int] = 50"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = 5.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "eta", "val": ": typing.Optional[float] = 0.0"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[int]] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "guidance_rescale", "val": ": float = 0.0"}]</parameters><rettype>[StableDiffusionPipelineOutput](/docs/diffusers/pr_12229/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) or `tuple`</rettype><retdesc>If `return_dict` is `True`, [StableDiffusionPipelineOutput](/docs/diffusers/pr_12229/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) is returned,
otherwise a `tuple` is returned where the first element is a list with the generated images and the
second element is a list of `bool`s indicating whether the corresponding generated image contains
"not-safe-for-work" (nsfw) content.</retdesc></docstring>
Generates images or video using the EasyAnimate pipeline based on the provided prompts.
<ExampleCodeBlock anchor="diffusers.EasyAnimatePipeline.__call__.example">
Examples:
```python
>>> import torch
>>> from diffusers import EasyAnimatePipeline
>>> from diffusers.utils import export_to_video
>>> # Models: "alibaba-pai/EasyAnimateV5.1-12b-zh"
>>> pipe = EasyAnimatePipeline.from_pretrained(
... "alibaba-pai/EasyAnimateV5.1-7b-zh-diffusers", torch_dtype=torch.float16
... ).to("cuda")
>>> prompt = (
... "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
... "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
... "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
... "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
... "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
... "atmosphere of this unique musical performance."
... )
>>> sample_size = (512, 512)
>>> video = pipe(
... prompt=prompt,
... guidance_scale=6,
... negative_prompt="bad detailed",
... height=sample_size[0],
... width=sample_size[1],
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=8)
```
</ExampleCodeBlock>
prompt (`str` or `List[str]`, *optional*):
Text prompts to guide the image or video generation. If not provided, use `prompt_embeds` instead.
num_frames (`int`, *optional*):
Length of the generated video (in frames).
height (`int`, *optional*):
Height of the generated image in pixels.
width (`int`, *optional*):
Width of the generated image in pixels.
num_inference_steps (`int`, *optional*, defaults to 50):
Number of denoising steps during generation. More steps generally yield higher quality images but slow
down inference.
guidance_scale (`float`, *optional*, defaults to 5.0):
Encourages the model to align outputs with prompts. A higher value may decrease image quality.
negative_prompt (`str` or `List[str]`, *optional*):
Prompts indicating what to exclude in generation. If not specified, use `negative_prompt_embeds`.
num_images_per_prompt (`int`, *optional*, defaults to 1):
Number of images to generate for each prompt.
eta (`float`, *optional*, defaults to 0.0):
Applies to DDIM scheduling. Controlled by the eta parameter from the related literature.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
A generator to ensure reproducibility in image generation.
latents (`torch.Tensor`, *optional*):
Predefined latent tensors to condition generation.
prompt_embeds (`torch.Tensor`, *optional*):
Text embeddings for the prompts. Overrides prompt string inputs for more flexibility.
negative_prompt_embeds (`torch.Tensor`, *optional*):
Embeddings for negative prompts. Overrides string inputs if defined.
prompt_attention_mask (`torch.Tensor`, *optional*):
Attention mask for the primary prompt embeddings.
negative_prompt_attention_mask (`torch.Tensor`, *optional*):
Attention mask for negative prompt embeddings.
output_type (`str`, *optional*, defaults to "latent"):
Format of the generated output, either as a PIL image or as a NumPy array.
return_dict (`bool`, *optional*, defaults to `True`):
If `True`, returns a structured output. Otherwise returns a simple tuple.
callback_on_step_end (`Callable`, *optional*):
Functions called at the end of each denoising step.
callback_on_step_end_tensor_inputs (`List[str]`, *optional*):
Tensor names to be included in callback function calls.
guidance_rescale (`float`, *optional*, defaults to 0.0):
Adjusts noise levels based on guidance scale.
original_size (`Tuple[int, int]`, *optional*, defaults to `(1024, 1024)`):
Original dimensions of the output.
target_size (`Tuple[int, int]`, *optional*):
Desired output dimensions for calculations.
crops_coords_top_left (`Tuple[int, int]`, *optional*, defaults to `(0, 0)`):
Coordinates for cropping.
</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>encode_prompt</name><anchor>diffusers.EasyAnimatePipeline.encode_prompt</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L241</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]]"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "do_classifier_free_guidance", "val": ": bool = True"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "device", "val": ": typing.Optional[torch.device] = None"}, {"name": "dtype", "val": ": typing.Optional[torch.dtype] = None"}, {"name": "max_sequence_length", "val": ": int = 256"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) --
prompt to be encoded
- **device** -- (`torch.device`):
torch device
- **dtype** (`torch.dtype`) --
torch dtype
- **num_images_per_prompt** (`int`) --
number of images that should be generated per prompt
- **do_classifier_free_guidance** (`bool`) --
whether to use classifier free guidance or not
- **negative_prompt** (`str` or `List[str]`, *optional*) --
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
less than `1`).
- **prompt_embeds** (`torch.Tensor`, *optional*) --
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
- **negative_prompt_embeds** (`torch.Tensor`, *optional*) --
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
argument.
- **prompt_attention_mask** (`torch.Tensor`, *optional*) --
Attention mask for the prompt. Required when `prompt_embeds` is passed directly.
- **negative_prompt_attention_mask** (`torch.Tensor`, *optional*) --
Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly.
- **max_sequence_length** (`int`, *optional*) -- maximum sequence length to use for the prompt.</paramsdesc><paramgroups>0</paramgroups></docstring>
Encodes the prompt into text encoder hidden states.
</div></div>
## EasyAnimatePipelineOutput[[diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput</name><anchor>diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/easyanimate/pipeline_output.py#L9</source><parameters>[{"name": "frames", "val": ": Tensor"}]</parameters><paramsdesc>- **frames** (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]) --
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
`(batch_size, num_frames, channels, height, width)`.</paramsdesc><paramgroups>0</paramgroups></docstring>
Output class for EasyAnimate pipelines.
</div>
<EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/easyanimate.md" />

Xet Storage Details

Size:
17.7 kB
·
Xet hash:
ea2a2deedc665421a0a6383c5d283bf3b1294f19b27749f0970a170a5356a088

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.