| | --- |
| | library_name: diffusers |
| | tags: |
| | - modular-diffusers |
| | - diffusers |
| | - helios-pyramid |
| | --- |
| | This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework. |
| |
|
| | **Pipeline Type**: HeliosPyramidAutoBlocks |
| |
|
| | **Description**: Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios. |
| |
|
| | This pipeline uses a 4-block architecture that can be customized and extended. |
| |
|
| | ## Example Usage |
| |
|
| | [TODO] |
| |
|
| | ## Pipeline Architecture |
| |
|
| | This modular pipeline is composed of the following blocks: |
| |
|
| | 1. **text_encoder** (`HeliosTextEncoderStep`) |
| | - Text Encoder step that generates text embeddings to guide the video generation |
| | 2. **vae_encoder** (`HeliosPyramidAutoVaeEncoderStep`) |
| | - Encoder step that encodes video or image inputs. This is an auto pipeline block. |
| | 3. **denoise** (`HeliosPyramidAutoCoreDenoiseStep`) |
| | - Pyramid core denoise step that selects the appropriate denoising block. |
| | 4. **decode** (`HeliosDecodeStep`) |
| | - Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output. |
| |
|
| | ## Model Components |
| |
|
| | 1. text_encoder (`UMT5EncoderModel`) |
| | 2. tokenizer (`AutoTokenizer`) |
| | 3. guider (`ClassifierFreeGuidance`) |
| | 4. vae (`AutoencoderKLWan`) |
| | 5. video_processor (`VideoProcessor`) |
| | 6. transformer (`HeliosTransformer3DModel`) |
| | 7. scheduler (`HeliosScheduler`) |
| |
|
| | ## Workflow Input Specification |
| |
|
| | <details> |
| | <summary><strong>text2video</strong></summary> |
| |
|
| | - `prompt` (`str`): The prompt or prompts to guide image generation. |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><strong>image2video</strong></summary> |
| |
|
| | - `prompt` (`str`): The prompt or prompts to guide image generation. |
| | - `image` (`Image | list`): Reference image(s) for denoising. Can be a single image or list of images. |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><strong>video2video</strong></summary> |
| |
|
| | - `prompt` (`str`): The prompt or prompts to guide image generation. |
| | - `video` (`None`): Input video for video-to-video generation |
| |
|
| | </details> |
| |
|
| |
|
| | ## Input/Output Specification |
| |
|
| | **Inputs:** |
| |
|
| | - `prompt` (`str`): The prompt or prompts to guide image generation. |
| | - `negative_prompt` (`str`, *optional*): The prompt or prompts not to guide the image generation. |
| | - `max_sequence_length` (`int`, *optional*, defaults to `512`): Maximum sequence length for prompt encoding. |
| | - `video` (`None`, *optional*): Input video for video-to-video generation |
| | - `height` (`int`, *optional*, defaults to `384`): The height in pixels of the generated image. |
| | - `width` (`int`, *optional*, defaults to `640`): The width in pixels of the generated image. |
| | - `num_latent_frames_per_chunk` (`int`, *optional*, defaults to `9`): Number of latent frames per temporal chunk. |
| | - `generator` (`Generator`, *optional*): Torch generator for deterministic generation. |
| | - `image` (`Image | list`, *optional*): Reference image(s) for denoising. Can be a single image or list of images. |
| | - `num_videos_per_prompt` (`int`, *optional*, defaults to `1`): Number of videos to generate per prompt. |
| | - `image_latents` (`Tensor`, *optional*): image latents used to guide the image generation. Can be generated from vae_encoder step. |
| | - `video_latents` (`Tensor`, *optional*): Encoded video latents for V2V generation. |
| | - `image_noise_sigma_min` (`float`, *optional*, defaults to `0.111`): Minimum sigma for image latent noise. |
| | - `image_noise_sigma_max` (`float`, *optional*, defaults to `0.135`): Maximum sigma for image latent noise. |
| | - `video_noise_sigma_min` (`float`, *optional*, defaults to `0.111`): Minimum sigma for video latent noise. |
| | - `video_noise_sigma_max` (`float`, *optional*, defaults to `0.135`): Maximum sigma for video latent noise. |
| | - `num_frames` (`int`, *optional*, defaults to `132`): Total number of video frames to generate. |
| | - `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context. |
| | - `keep_first_frame` (`bool`, *optional*, defaults to `True`): Whether to keep the first frame as a prefix in history. |
| | - `pyramid_num_inference_steps_list` (`list`, *optional*, defaults to `[10, 10, 10]`): Number of denoising steps per pyramid stage. |
| | - `latents` (`Tensor`, *optional*): Pre-generated noisy latents for image generation. |
| | - `**denoiser_input_fields` (`None`, *optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc. |
| | - `attention_kwargs` (`dict`, *optional*): Additional kwargs for attention processors. |
| | - `fake_image_latents` (`Tensor`, *optional*): Fake image latents used as history seed for I2V generation. |
| | - `output_type` (`str`, *optional*, defaults to `np`): Output format: 'pil', 'np', 'pt'. |
| |
|
| | **Outputs:** |
| |
|
| | - `videos` (`list`): The generated videos. |
| |
|