Upload HeliosModularPipeline

Browse files

Files changed (2) hide show

README.md +89 -0
modular_model_index.json +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,89 @@

+---
+library_name: diffusers
+tags:
+- modular-diffusers
+- diffusers
+- helios
+- text-to-image
+---
+This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
+**Pipeline Type**: HeliosAutoBlocks
+**Description**: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.
+This pipeline uses a 4-block architecture that can be customized and extended.
+## Example Usage
+[TODO]
+## Pipeline Architecture
+This modular pipeline is composed of the following blocks:
+1. **text_encoder** (`HeliosTextEncoderStep`)
+   - Text Encoder step that generates text embeddings to guide the video generation
+2. **vae_encoder** (`HeliosAutoVaeEncoderStep`)
+   - Encoder step that encodes video or image inputs. This is an auto pipeline block.
+   - *video_encoder*: `HeliosVideoVaeEncoderStep`
+     - Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation.
+   - *image_encoder*: `HeliosImageVaeEncoderStep`
+     - Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation.
+3. **denoise** (`HeliosAutoCoreDenoiseStep`)
+   - Core denoise step that selects the appropriate denoising block.
+   - *video2video*: `HeliosV2VCoreDenoiseStep`
+     - V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
+   - *image2video*: `HeliosI2VCoreDenoiseStep`
+     - I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
+   - *text2video*: `HeliosCoreDenoiseStep`
+     - Denoise block that takes encoded conditions and runs the chunk-based denoising process.
+4. **decode** (`HeliosDecodeStep`)
+   - Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.
+## Model Components
+1. text_encoder (`UMT5EncoderModel`)
+2. tokenizer (`AutoTokenizer`)
+3. guider (`ClassifierFreeGuidance`)
+4. vae (`AutoencoderKLWan`)
+5. video_processor (`VideoProcessor`)
+6. transformer (`HeliosTransformer3DModel`)
+7. scheduler (`HeliosScheduler`)
+## Input/Output Specification
+### Inputs **Required:**
+- `prompt` (`str`): The prompt or prompts to guide image generation.
+- `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context.
+- `sigmas` (`list`): Custom sigmas for the denoising process.
+**Optional:**
+- `negative_prompt` (`str`): The prompt or prompts not to guide the image generation.
+- `max_sequence_length` (`int`), default: `512`: Maximum sequence length for prompt encoding.
+- `video` (`Any`): Input video for video-to-video generation
+- `height` (`int`), default: `384`: The height in pixels of the generated image.
+- `width` (`int`), default: `640`: The width in pixels of the generated image.
+- `num_latent_frames_per_chunk` (`int`), default: `9`: Number of latent frames per temporal chunk.
+- `generator` (`Generator`): Torch generator for deterministic generation.
+- `image` (`PIL.Image.Image | list[PIL.Image.Image]`): Reference image(s) for denoising. Can be a single image or list of images.
+- `num_videos_per_prompt` (`int`), default: `1`: Number of videos to generate per prompt.
+- `image_latents` (`Tensor`): image latents used to guide the image generation. Can be generated from vae_encoder step.
+- `video_latents` (`Tensor`): Encoded video latents for V2V generation.
+- `image_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for image latent noise.
+- `image_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for image latent noise.
+- `video_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for video latent noise.
+- `video_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for video latent noise.
+- `num_frames` (`int`), default: `132`: Total number of video frames to generate.
+- `keep_first_frame` (`bool`), default: `True`: Whether to keep the first frame as a prefix in history.
+- `num_inference_steps` (`int`), default: `50`: The number of denoising steps.
+- `latents` (`Tensor`): Pre-generated noisy latents for image generation.
+- `timesteps` (`Tensor`): Timesteps for the denoising process.
+- `None` (`Any`): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+- `attention_kwargs` (`dict`): Additional kwargs for attention processors.
+- `fake_image_latents` (`Tensor`): Fake image latents used as history seed for I2V generation.
+- `output_type` (`str`), default: `np`: Output format: 'pil', 'np', 'pt'.
+### Outputs - `videos` (`list`): The generated videos.

modular_model_index.json ADDED Viewed

	@@ -0,0 +1,75 @@

+{
+  "_blocks_class_name": "HeliosAutoBlocks",
+  "_class_name": "HeliosModularPipeline",
+  "_diffusers_version": "0.37.0.dev0",
+  "scheduler": [
+    "diffusers",
+    "HeliosScheduler",
+    {
+      "pretrained_model_name_or_path": null,
+      "revision": null,
+      "subfolder": "",
+      "type_hint": [
+        "diffusers",
+        "HeliosScheduler"
+      ],
+      "variant": null
+    }
+  ],
+  "text_encoder": [
+    null,
+    null,
+    {
+      "pretrained_model_name_or_path": null,
+      "revision": null,
+      "subfolder": "",
+      "type_hint": [
+        "transformers",
+        "UMT5EncoderModel"
+      ],
+      "variant": null
+    }
+  ],
+  "tokenizer": [
+    null,
+    null,
+    {
+      "pretrained_model_name_or_path": null,
+      "revision": null,
+      "subfolder": "",
+      "type_hint": [
+        "transformers",
+        "AutoTokenizer"
+      ],
+      "variant": null
+    }
+  ],
+  "transformer": [
+    null,
+    null,
+    {
+      "pretrained_model_name_or_path": null,
+      "revision": null,
+      "subfolder": "",
+      "type_hint": [
+        "diffusers",
+        "HeliosTransformer3DModel"
+      ],
+      "variant": null
+    }
+  ],
+  "vae": [
+    null,
+    null,
+    {
+      "pretrained_model_name_or_path": null,
+      "revision": null,
+      "subfolder": "",
+      "type_hint": [
+        "diffusers",
+        "AutoencoderKLWan"
+      ],
+      "variant": null
+    }
+  ]
+}