modular / README.md

YiYiXu HF Staff

Upload HeliosModularPipeline

d4e5429 verified 13 days ago

4.74 kB

	---
	library_name: diffusers
	tags:
	- modular-diffusers
	- diffusers
	- helios
	- text-to-image
	---
	This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

	Pipeline Type: HeliosAutoBlocks

	Description: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.

	This pipeline uses a 4-block architecture that can be customized and extended.

	## Example Usage

	[TODO]

	## Pipeline Architecture

	This modular pipeline is composed of the following blocks:

	1. text_encoder (`HeliosTextEncoderStep`)
	- Text Encoder step that generates text embeddings to guide the video generation
	2. vae_encoder (`HeliosAutoVaeEncoderStep`)
	- Encoder step that encodes video or image inputs. This is an auto pipeline block.
	- video_encoder: `HeliosVideoVaeEncoderStep`
	- Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation.
	- image_encoder: `HeliosImageVaeEncoderStep`
	- Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation.
	3. denoise (`HeliosAutoCoreDenoiseStep`)
	- Core denoise step that selects the appropriate denoising block.
	- video2video: `HeliosV2VCoreDenoiseStep`
	- V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
	- image2video: `HeliosI2VCoreDenoiseStep`
	- I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
	- text2video: `HeliosCoreDenoiseStep`
	- Denoise block that takes encoded conditions and runs the chunk-based denoising process.
	4. decode (`HeliosDecodeStep`)
	- Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.

	## Model Components

	1. text_encoder (`UMT5EncoderModel`)
	2. tokenizer (`AutoTokenizer`)
	3. guider (`ClassifierFreeGuidance`)
	4. vae (`AutoencoderKLWan`)
	5. video_processor (`VideoProcessor`)
	6. transformer (`HeliosTransformer3DModel`)
	7. scheduler (`HeliosScheduler`)

	## Input/Output Specification

	### Inputs Required:

	- `prompt` (`str`): The prompt or prompts to guide image generation.
	- `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context.
	- `sigmas` (`list`): Custom sigmas for the denoising process.

	Optional:

	- `negative_prompt` (`str`): The prompt or prompts not to guide the image generation.
	- `max_sequence_length` (`int`), default: `512`: Maximum sequence length for prompt encoding.
	- `video` (`Any`): Input video for video-to-video generation
	- `height` (`int`), default: `384`: The height in pixels of the generated image.
	- `width` (`int`), default: `640`: The width in pixels of the generated image.
	- `num_latent_frames_per_chunk` (`int`), default: `9`: Number of latent frames per temporal chunk.
	- `generator` (`Generator`): Torch generator for deterministic generation.
	- `image` (`PIL.Image.Image \| list[PIL.Image.Image]`): Reference image(s) for denoising. Can be a single image or list of images.
	- `num_videos_per_prompt` (`int`), default: `1`: Number of videos to generate per prompt.
	- `image_latents` (`Tensor`): image latents used to guide the image generation. Can be generated from vae_encoder step.
	- `video_latents` (`Tensor`): Encoded video latents for V2V generation.
	- `image_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for image latent noise.
	- `image_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for image latent noise.
	- `video_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for video latent noise.
	- `video_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for video latent noise.
	- `num_frames` (`int`), default: `132`: Total number of video frames to generate.
	- `keep_first_frame` (`bool`), default: `True`: Whether to keep the first frame as a prefix in history.
	- `num_inference_steps` (`int`), default: `50`: The number of denoising steps.
	- `latents` (`Tensor`): Pre-generated noisy latents for image generation.
	- `timesteps` (`Tensor`): Timesteps for the denoising process.
	- `None` (`Any`): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
	- `attention_kwargs` (`dict`): Additional kwargs for attention processors.
	- `fake_image_latents` (`Tensor`): Fake image latents used as history seed for I2V generation.
	- `output_type` (`str`), default: `np`: Output format: 'pil', 'np', 'pt'.

	### Outputs - `videos` (`list`): The generated videos.