YiYiXu HF Staff

Upload QwenImageLayeredModularPipeline

ea054f1 verified 22 days ago

5.03 kB

	---
	library_name: diffusers
	tags:
	- modular-diffusers
	- diffusers
	- qwenimage-layered
	- text-to-image
	---
	This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

	Pipeline Type: QwenImageLayeredAutoBlocks

	Description: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.

	This pipeline uses a 4-block architecture that can be customized and extended.

	## Example Usage

	[TODO]

	## Pipeline Architecture

	This modular pipeline is composed of the following blocks:

	1. text_encoder (`QwenImageLayeredTextEncoderStep`)
	- QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
	- resize: `QwenImageLayeredResizeStep`
	- Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
	- get_image_prompt: `QwenImageLayeredGetImagePromptStep`
	- Auto-caption step that generates a text prompt from the input image if none is provided.
	- encode: `QwenImageTextEncoderStep`
	- Text Encoder step that generates text embeddings to guide the image generation.
	2. vae_encoder (`QwenImageLayeredVaeEncoderStep`)
	- Vae encoder step that encode the image inputs into their latent representations.
	- resize: `QwenImageLayeredResizeStep`
	- Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
	- preprocess: `QwenImageEditProcessImagesInputStep`
	- Image Preprocess step. Images needs to be resized first.
	- encode: `QwenImageVaeEncoderStep`
	- VAE Encoder step that converts processed_image into latent representations image_latents.
	- permute: `QwenImageLayeredPermuteLatentsStep`
	- Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing.
	3. denoise (`QwenImageLayeredCoreDenoiseStep`)
	- Core denoising workflow for QwenImage-Layered img2img task.
	- input: `QwenImageLayeredInputStep`
	- Input step that prepares the inputs for the layered denoising step. It:
	- prepare_latents: `QwenImageLayeredPrepareLatentsStep`
	- Prepare initial random noise (B, layers+1, C, H, W) for the generation process
	- set_timesteps: `QwenImageLayeredSetTimestepsStep`
	- Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents.
	- prepare_rope_inputs: `QwenImageLayeredRoPEInputsStep`
	- Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step
	- denoise: `QwenImageLayeredDenoiseStep`
	- Denoise step that iteratively denoise the latents.
	- after_denoise: `QwenImageLayeredAfterDenoiseStep`
	- Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising.
	4. decode (`QwenImageLayeredDecoderStep`)
	- Decode unpacked latents (B, C, layers+1, H, W) into layer images.

	## Model Components

	1. image_resize_processor (`VaeImageProcessor`)
	2. text_encoder (`Qwen2_5_VLForConditionalGeneration`)
	3. processor (`Qwen2VLProcessor`)
	4. tokenizer (`Qwen2Tokenizer`): The tokenizer to use
	5. guider (`ClassifierFreeGuidance`)
	6. image_processor (`VaeImageProcessor`)
	7. vae (`AutoencoderKLQwenImage`)
	8. pachifier (`QwenImageLayeredPachifier`)
	9. scheduler (`FlowMatchEulerDiscreteScheduler`)
	10. transformer (`QwenImageTransformer2DModel`) ## Input/Output Specification

	Inputs:

	- `image` (`Image \| list`): Reference image(s) for denoising. Can be a single image or list of images.
	- `resolution` (`int`, optional, defaults to `640`): The target area to resize the image to, can be 1024 or 640
	- `prompt` (`str`, optional): The prompt or prompts to guide image generation.
	- `use_en_prompt` (`bool`, optional, defaults to `False`): Whether to use English prompt template
	- `negative_prompt` (`str`, optional): The prompt or prompts not to guide the image generation.
	- `max_sequence_length` (`int`, optional, defaults to `1024`): Maximum sequence length for prompt encoding.
	- `generator` (`Generator`, optional): Torch generator for deterministic generation.
	- `num_images_per_prompt` (`int`, optional, defaults to `1`): The number of images to generate per prompt.
	- `latents` (`Tensor`, optional): Pre-generated noisy latents for image generation.
	- `layers` (`int`, optional, defaults to `4`): Number of layers to extract from the image
	- `num_inference_steps` (`int`, optional, defaults to `50`): The number of denoising steps.
	- `sigmas` (`list`, optional): Custom sigmas for the denoising process.
	- `attention_kwargs` (`dict`, optional): Additional kwargs for attention processors.
	- `*denoiser_input_fields` (`None`, optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
	- `output_type` (`str`, optional, defaults to `pil`): Output format: 'pil', 'np', 'pt'.

	Outputs:

	- `images` (`list`): Generated images.