YiYiXu HF Staff commited on
Commit
29d265a
·
verified ·
1 Parent(s): 3352350

Upload HeliosModularPipeline

Browse files
Files changed (2) hide show
  1. README.md +89 -0
  2. modular_model_index.json +75 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: diffusers
3
+ tags:
4
+ - modular-diffusers
5
+ - diffusers
6
+ - helios
7
+ - text-to-image
8
+ ---
9
+ This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
10
+
11
+ **Pipeline Type**: HeliosAutoBlocks
12
+
13
+ **Description**: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.
14
+
15
+ This pipeline uses a 4-block architecture that can be customized and extended.
16
+
17
+ ## Example Usage
18
+
19
+ [TODO]
20
+
21
+ ## Pipeline Architecture
22
+
23
+ This modular pipeline is composed of the following blocks:
24
+
25
+ 1. **text_encoder** (`HeliosTextEncoderStep`)
26
+ - Text Encoder step that generates text embeddings to guide the video generation
27
+ 2. **vae_encoder** (`HeliosAutoVaeEncoderStep`)
28
+ - Encoder step that encodes video or image inputs. This is an auto pipeline block.
29
+ - *video_encoder*: `HeliosVideoVaeEncoderStep`
30
+ - Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation.
31
+ - *image_encoder*: `HeliosImageVaeEncoderStep`
32
+ - Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation.
33
+ 3. **denoise** (`HeliosAutoCoreDenoiseStep`)
34
+ - Core denoise step that selects the appropriate denoising block.
35
+ - *video2video*: `HeliosV2VCoreDenoiseStep`
36
+ - V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
37
+ - *image2video*: `HeliosI2VCoreDenoiseStep`
38
+ - I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
39
+ - *text2video*: `HeliosCoreDenoiseStep`
40
+ - Denoise block that takes encoded conditions and runs the chunk-based denoising process.
41
+ 4. **decode** (`HeliosDecodeStep`)
42
+ - Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.
43
+
44
+ ## Model Components
45
+
46
+ 1. text_encoder (`UMT5EncoderModel`)
47
+ 2. tokenizer (`AutoTokenizer`)
48
+ 3. guider (`ClassifierFreeGuidance`)
49
+ 4. vae (`AutoencoderKLWan`)
50
+ 5. video_processor (`VideoProcessor`)
51
+ 6. transformer (`HeliosTransformer3DModel`)
52
+ 7. scheduler (`HeliosScheduler`)
53
+
54
+ ## Input/Output Specification
55
+
56
+ ### Inputs **Required:**
57
+
58
+ - `prompt` (`str`): The prompt or prompts to guide image generation.
59
+ - `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context.
60
+ - `sigmas` (`list`): Custom sigmas for the denoising process.
61
+
62
+ **Optional:**
63
+
64
+ - `negative_prompt` (`str`): The prompt or prompts not to guide the image generation.
65
+ - `max_sequence_length` (`int`), default: `512`: Maximum sequence length for prompt encoding.
66
+ - `video` (`Any`): Input video for video-to-video generation
67
+ - `height` (`int`), default: `384`: The height in pixels of the generated image.
68
+ - `width` (`int`), default: `640`: The width in pixels of the generated image.
69
+ - `num_latent_frames_per_chunk` (`int`), default: `9`: Number of latent frames per temporal chunk.
70
+ - `generator` (`Generator`): Torch generator for deterministic generation.
71
+ - `image` (`PIL.Image.Image | list[PIL.Image.Image]`): Reference image(s) for denoising. Can be a single image or list of images.
72
+ - `num_videos_per_prompt` (`int`), default: `1`: Number of videos to generate per prompt.
73
+ - `image_latents` (`Tensor`): image latents used to guide the image generation. Can be generated from vae_encoder step.
74
+ - `video_latents` (`Tensor`): Encoded video latents for V2V generation.
75
+ - `image_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for image latent noise.
76
+ - `image_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for image latent noise.
77
+ - `video_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for video latent noise.
78
+ - `video_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for video latent noise.
79
+ - `num_frames` (`int`), default: `132`: Total number of video frames to generate.
80
+ - `keep_first_frame` (`bool`), default: `True`: Whether to keep the first frame as a prefix in history.
81
+ - `num_inference_steps` (`int`), default: `50`: The number of denoising steps.
82
+ - `latents` (`Tensor`): Pre-generated noisy latents for image generation.
83
+ - `timesteps` (`Tensor`): Timesteps for the denoising process.
84
+ - `None` (`Any`): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
85
+ - `attention_kwargs` (`dict`): Additional kwargs for attention processors.
86
+ - `fake_image_latents` (`Tensor`): Fake image latents used as history seed for I2V generation.
87
+ - `output_type` (`str`), default: `np`: Output format: 'pil', 'np', 'pt'.
88
+
89
+ ### Outputs - `videos` (`list`): The generated videos.
modular_model_index.json ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_blocks_class_name": "HeliosAutoBlocks",
3
+ "_class_name": "HeliosModularPipeline",
4
+ "_diffusers_version": "0.37.0.dev0",
5
+ "scheduler": [
6
+ "diffusers",
7
+ "HeliosScheduler",
8
+ {
9
+ "pretrained_model_name_or_path": null,
10
+ "revision": null,
11
+ "subfolder": "",
12
+ "type_hint": [
13
+ "diffusers",
14
+ "HeliosScheduler"
15
+ ],
16
+ "variant": null
17
+ }
18
+ ],
19
+ "text_encoder": [
20
+ null,
21
+ null,
22
+ {
23
+ "pretrained_model_name_or_path": null,
24
+ "revision": null,
25
+ "subfolder": "",
26
+ "type_hint": [
27
+ "transformers",
28
+ "UMT5EncoderModel"
29
+ ],
30
+ "variant": null
31
+ }
32
+ ],
33
+ "tokenizer": [
34
+ null,
35
+ null,
36
+ {
37
+ "pretrained_model_name_or_path": null,
38
+ "revision": null,
39
+ "subfolder": "",
40
+ "type_hint": [
41
+ "transformers",
42
+ "AutoTokenizer"
43
+ ],
44
+ "variant": null
45
+ }
46
+ ],
47
+ "transformer": [
48
+ null,
49
+ null,
50
+ {
51
+ "pretrained_model_name_or_path": null,
52
+ "revision": null,
53
+ "subfolder": "",
54
+ "type_hint": [
55
+ "diffusers",
56
+ "HeliosTransformer3DModel"
57
+ ],
58
+ "variant": null
59
+ }
60
+ ],
61
+ "vae": [
62
+ null,
63
+ null,
64
+ {
65
+ "pretrained_model_name_or_path": null,
66
+ "revision": null,
67
+ "subfolder": "",
68
+ "type_hint": [
69
+ "diffusers",
70
+ "AutoencoderKLWan"
71
+ ],
72
+ "variant": null
73
+ }
74
+ ]
75
+ }