File size: 4,652 Bytes
1c57990
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
library_name: diffusers
tags:
- modular-diffusers
- diffusers
- helios-pyramid
---
This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

**Pipeline Type**: HeliosPyramidAutoBlocks

**Description**: Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios.

This pipeline uses a 4-block architecture that can be customized and extended.

## Example Usage

[TODO]

## Pipeline Architecture

This modular pipeline is composed of the following blocks:

1. **text_encoder** (`HeliosTextEncoderStep`)
   - Text Encoder step that generates text embeddings to guide the video generation
2. **vae_encoder** (`HeliosPyramidAutoVaeEncoderStep`)
   - Encoder step that encodes video or image inputs. This is an auto pipeline block.
3. **denoise** (`HeliosPyramidAutoCoreDenoiseStep`)
   - Pyramid core denoise step that selects the appropriate denoising block.
4. **decode** (`HeliosDecodeStep`)
   - Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output. 

## Model Components

1. text_encoder (`UMT5EncoderModel`)
2. tokenizer (`AutoTokenizer`)
3. guider (`ClassifierFreeGuidance`)
4. vae (`AutoencoderKLWan`)
5. video_processor (`VideoProcessor`)
6. transformer (`HeliosTransformer3DModel`)
7. scheduler (`HeliosScheduler`) 

## Workflow Input Specification

<details>
<summary><strong>text2video</strong></summary>

- `prompt` (`str`): The prompt or prompts to guide image generation.

</details>

<details>
<summary><strong>image2video</strong></summary>

- `prompt` (`str`): The prompt or prompts to guide image generation.
- `image` (`Image | list`): Reference image(s) for denoising. Can be a single image or list of images.

</details>

<details>
<summary><strong>video2video</strong></summary>

- `prompt` (`str`): The prompt or prompts to guide image generation.
- `video` (`None`): Input video for video-to-video generation

</details>


## Input/Output Specification

**Inputs:**

- `prompt` (`str`): The prompt or prompts to guide image generation.
- `negative_prompt` (`str`, *optional*): The prompt or prompts not to guide the image generation.
- `max_sequence_length` (`int`, *optional*, defaults to `512`): Maximum sequence length for prompt encoding.
- `video` (`None`, *optional*): Input video for video-to-video generation
- `height` (`int`, *optional*, defaults to `384`): The height in pixels of the generated image.
- `width` (`int`, *optional*, defaults to `640`): The width in pixels of the generated image.
- `num_latent_frames_per_chunk` (`int`, *optional*, defaults to `9`): Number of latent frames per temporal chunk.
- `generator` (`Generator`, *optional*): Torch generator for deterministic generation.
- `image` (`Image | list`, *optional*): Reference image(s) for denoising. Can be a single image or list of images.
- `num_videos_per_prompt` (`int`, *optional*, defaults to `1`): Number of videos to generate per prompt.
- `image_latents` (`Tensor`, *optional*): image latents used to guide the image generation. Can be generated from vae_encoder step.
- `video_latents` (`Tensor`, *optional*): Encoded video latents for V2V generation.
- `image_noise_sigma_min` (`float`, *optional*, defaults to `0.111`): Minimum sigma for image latent noise.
- `image_noise_sigma_max` (`float`, *optional*, defaults to `0.135`): Maximum sigma for image latent noise.
- `video_noise_sigma_min` (`float`, *optional*, defaults to `0.111`): Minimum sigma for video latent noise.
- `video_noise_sigma_max` (`float`, *optional*, defaults to `0.135`): Maximum sigma for video latent noise.
- `num_frames` (`int`, *optional*, defaults to `132`): Total number of video frames to generate.
- `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context.
- `keep_first_frame` (`bool`, *optional*, defaults to `True`): Whether to keep the first frame as a prefix in history.
- `pyramid_num_inference_steps_list` (`list`, *optional*, defaults to `[10, 10, 10]`): Number of denoising steps per pyramid stage.
- `latents` (`Tensor`, *optional*): Pre-generated noisy latents for image generation.
- `**denoiser_input_fields` (`None`, *optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
- `attention_kwargs` (`dict`, *optional*): Additional kwargs for attention processors.
- `fake_image_latents` (`Tensor`, *optional*): Fake image latents used as history seed for I2V generation.
- `output_type` (`str`, *optional*, defaults to `np`): Output format: 'pil', 'np', 'pt'.

**Outputs:**

- `videos` (`list`): The generated videos.