Upload HeliosModularPipeline
Browse files- README.md +89 -0
- modular_model_index.json +75 -0
README.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: diffusers
|
| 3 |
+
tags:
|
| 4 |
+
- modular-diffusers
|
| 5 |
+
- diffusers
|
| 6 |
+
- helios
|
| 7 |
+
- text-to-image
|
| 8 |
+
---
|
| 9 |
+
This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
|
| 10 |
+
|
| 11 |
+
**Pipeline Type**: HeliosAutoBlocks
|
| 12 |
+
|
| 13 |
+
**Description**: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.
|
| 14 |
+
|
| 15 |
+
This pipeline uses a 4-block architecture that can be customized and extended.
|
| 16 |
+
|
| 17 |
+
## Example Usage
|
| 18 |
+
|
| 19 |
+
[TODO]
|
| 20 |
+
|
| 21 |
+
## Pipeline Architecture
|
| 22 |
+
|
| 23 |
+
This modular pipeline is composed of the following blocks:
|
| 24 |
+
|
| 25 |
+
1. **text_encoder** (`HeliosTextEncoderStep`)
|
| 26 |
+
- Text Encoder step that generates text embeddings to guide the video generation
|
| 27 |
+
2. **vae_encoder** (`HeliosAutoVaeEncoderStep`)
|
| 28 |
+
- Encoder step that encodes video or image inputs. This is an auto pipeline block.
|
| 29 |
+
- *video_encoder*: `HeliosVideoVaeEncoderStep`
|
| 30 |
+
- Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation.
|
| 31 |
+
- *image_encoder*: `HeliosImageVaeEncoderStep`
|
| 32 |
+
- Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation.
|
| 33 |
+
3. **denoise** (`HeliosAutoCoreDenoiseStep`)
|
| 34 |
+
- Core denoise step that selects the appropriate denoising block.
|
| 35 |
+
- *video2video*: `HeliosV2VCoreDenoiseStep`
|
| 36 |
+
- V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
|
| 37 |
+
- *image2video*: `HeliosI2VCoreDenoiseStep`
|
| 38 |
+
- I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
|
| 39 |
+
- *text2video*: `HeliosCoreDenoiseStep`
|
| 40 |
+
- Denoise block that takes encoded conditions and runs the chunk-based denoising process.
|
| 41 |
+
4. **decode** (`HeliosDecodeStep`)
|
| 42 |
+
- Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.
|
| 43 |
+
|
| 44 |
+
## Model Components
|
| 45 |
+
|
| 46 |
+
1. text_encoder (`UMT5EncoderModel`)
|
| 47 |
+
2. tokenizer (`AutoTokenizer`)
|
| 48 |
+
3. guider (`ClassifierFreeGuidance`)
|
| 49 |
+
4. vae (`AutoencoderKLWan`)
|
| 50 |
+
5. video_processor (`VideoProcessor`)
|
| 51 |
+
6. transformer (`HeliosTransformer3DModel`)
|
| 52 |
+
7. scheduler (`HeliosScheduler`)
|
| 53 |
+
|
| 54 |
+
## Input/Output Specification
|
| 55 |
+
|
| 56 |
+
### Inputs **Required:**
|
| 57 |
+
|
| 58 |
+
- `prompt` (`str`): The prompt or prompts to guide image generation.
|
| 59 |
+
- `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context.
|
| 60 |
+
- `sigmas` (`list`): Custom sigmas for the denoising process.
|
| 61 |
+
|
| 62 |
+
**Optional:**
|
| 63 |
+
|
| 64 |
+
- `negative_prompt` (`str`): The prompt or prompts not to guide the image generation.
|
| 65 |
+
- `max_sequence_length` (`int`), default: `512`: Maximum sequence length for prompt encoding.
|
| 66 |
+
- `video` (`Any`): Input video for video-to-video generation
|
| 67 |
+
- `height` (`int`), default: `384`: The height in pixels of the generated image.
|
| 68 |
+
- `width` (`int`), default: `640`: The width in pixels of the generated image.
|
| 69 |
+
- `num_latent_frames_per_chunk` (`int`), default: `9`: Number of latent frames per temporal chunk.
|
| 70 |
+
- `generator` (`Generator`): Torch generator for deterministic generation.
|
| 71 |
+
- `image` (`PIL.Image.Image | list[PIL.Image.Image]`): Reference image(s) for denoising. Can be a single image or list of images.
|
| 72 |
+
- `num_videos_per_prompt` (`int`), default: `1`: Number of videos to generate per prompt.
|
| 73 |
+
- `image_latents` (`Tensor`): image latents used to guide the image generation. Can be generated from vae_encoder step.
|
| 74 |
+
- `video_latents` (`Tensor`): Encoded video latents for V2V generation.
|
| 75 |
+
- `image_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for image latent noise.
|
| 76 |
+
- `image_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for image latent noise.
|
| 77 |
+
- `video_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for video latent noise.
|
| 78 |
+
- `video_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for video latent noise.
|
| 79 |
+
- `num_frames` (`int`), default: `132`: Total number of video frames to generate.
|
| 80 |
+
- `keep_first_frame` (`bool`), default: `True`: Whether to keep the first frame as a prefix in history.
|
| 81 |
+
- `num_inference_steps` (`int`), default: `50`: The number of denoising steps.
|
| 82 |
+
- `latents` (`Tensor`): Pre-generated noisy latents for image generation.
|
| 83 |
+
- `timesteps` (`Tensor`): Timesteps for the denoising process.
|
| 84 |
+
- `None` (`Any`): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
|
| 85 |
+
- `attention_kwargs` (`dict`): Additional kwargs for attention processors.
|
| 86 |
+
- `fake_image_latents` (`Tensor`): Fake image latents used as history seed for I2V generation.
|
| 87 |
+
- `output_type` (`str`), default: `np`: Output format: 'pil', 'np', 'pt'.
|
| 88 |
+
|
| 89 |
+
### Outputs - `videos` (`list`): The generated videos.
|
modular_model_index.json
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_blocks_class_name": "HeliosAutoBlocks",
|
| 3 |
+
"_class_name": "HeliosModularPipeline",
|
| 4 |
+
"_diffusers_version": "0.37.0.dev0",
|
| 5 |
+
"scheduler": [
|
| 6 |
+
"diffusers",
|
| 7 |
+
"HeliosScheduler",
|
| 8 |
+
{
|
| 9 |
+
"pretrained_model_name_or_path": null,
|
| 10 |
+
"revision": null,
|
| 11 |
+
"subfolder": "",
|
| 12 |
+
"type_hint": [
|
| 13 |
+
"diffusers",
|
| 14 |
+
"HeliosScheduler"
|
| 15 |
+
],
|
| 16 |
+
"variant": null
|
| 17 |
+
}
|
| 18 |
+
],
|
| 19 |
+
"text_encoder": [
|
| 20 |
+
null,
|
| 21 |
+
null,
|
| 22 |
+
{
|
| 23 |
+
"pretrained_model_name_or_path": null,
|
| 24 |
+
"revision": null,
|
| 25 |
+
"subfolder": "",
|
| 26 |
+
"type_hint": [
|
| 27 |
+
"transformers",
|
| 28 |
+
"UMT5EncoderModel"
|
| 29 |
+
],
|
| 30 |
+
"variant": null
|
| 31 |
+
}
|
| 32 |
+
],
|
| 33 |
+
"tokenizer": [
|
| 34 |
+
null,
|
| 35 |
+
null,
|
| 36 |
+
{
|
| 37 |
+
"pretrained_model_name_or_path": null,
|
| 38 |
+
"revision": null,
|
| 39 |
+
"subfolder": "",
|
| 40 |
+
"type_hint": [
|
| 41 |
+
"transformers",
|
| 42 |
+
"AutoTokenizer"
|
| 43 |
+
],
|
| 44 |
+
"variant": null
|
| 45 |
+
}
|
| 46 |
+
],
|
| 47 |
+
"transformer": [
|
| 48 |
+
null,
|
| 49 |
+
null,
|
| 50 |
+
{
|
| 51 |
+
"pretrained_model_name_or_path": null,
|
| 52 |
+
"revision": null,
|
| 53 |
+
"subfolder": "",
|
| 54 |
+
"type_hint": [
|
| 55 |
+
"diffusers",
|
| 56 |
+
"HeliosTransformer3DModel"
|
| 57 |
+
],
|
| 58 |
+
"variant": null
|
| 59 |
+
}
|
| 60 |
+
],
|
| 61 |
+
"vae": [
|
| 62 |
+
null,
|
| 63 |
+
null,
|
| 64 |
+
{
|
| 65 |
+
"pretrained_model_name_or_path": null,
|
| 66 |
+
"revision": null,
|
| 67 |
+
"subfolder": "",
|
| 68 |
+
"type_hint": [
|
| 69 |
+
"diffusers",
|
| 70 |
+
"AutoencoderKLWan"
|
| 71 |
+
],
|
| 72 |
+
"variant": null
|
| 73 |
+
}
|
| 74 |
+
]
|
| 75 |
+
}
|