| --- |
| library_name: diffusers |
| tags: |
| - modular-diffusers |
| - diffusers |
| - qwenimage-layered |
| - text-to-image |
| --- |
| This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework. |
|
|
| **Pipeline Type**: QwenImageLayeredAutoBlocks |
|
|
| **Description**: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered. |
|
|
| This pipeline uses a 4-block architecture that can be customized and extended. |
|
|
| ## Example Usage |
|
|
| [TODO] |
|
|
| ## Pipeline Architecture |
|
|
| This modular pipeline is composed of the following blocks: |
|
|
| 1. **text_encoder** (`QwenImageLayeredTextEncoderStep`) |
| - QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided. |
| - *resize*: `QwenImageLayeredResizeStep` |
| - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio. |
| - *get_image_prompt*: `QwenImageLayeredGetImagePromptStep` |
| - Auto-caption step that generates a text prompt from the input image if none is provided. |
| - *encode*: `QwenImageTextEncoderStep` |
| - Text Encoder step that generates text embeddings to guide the image generation. |
| 2. **vae_encoder** (`QwenImageLayeredVaeEncoderStep`) |
| - Vae encoder step that encode the image inputs into their latent representations. |
| - *resize*: `QwenImageLayeredResizeStep` |
| - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio. |
| - *preprocess*: `QwenImageEditProcessImagesInputStep` |
| - Image Preprocess step. Images needs to be resized first. |
| - *encode*: `QwenImageVaeEncoderStep` |
| - VAE Encoder step that converts processed_image into latent representations image_latents. |
| - *permute*: `QwenImageLayeredPermuteLatentsStep` |
| - Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing. |
| 3. **denoise** (`QwenImageLayeredCoreDenoiseStep`) |
| - Core denoising workflow for QwenImage-Layered img2img task. |
| - *input*: `QwenImageLayeredInputStep` |
| - Input step that prepares the inputs for the layered denoising step. It: |
| - *prepare_latents*: `QwenImageLayeredPrepareLatentsStep` |
| - Prepare initial random noise (B, layers+1, C, H, W) for the generation process |
| - *set_timesteps*: `QwenImageLayeredSetTimestepsStep` |
| - Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents. |
| - *prepare_rope_inputs*: `QwenImageLayeredRoPEInputsStep` |
| - Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step |
| - *denoise*: `QwenImageLayeredDenoiseStep` |
| - Denoise step that iteratively denoise the latents. |
| - *after_denoise*: `QwenImageLayeredAfterDenoiseStep` |
| - Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising. |
| 4. **decode** (`QwenImageLayeredDecoderStep`) |
| - Decode unpacked latents (B, C, layers+1, H, W) into layer images. |
| |
| ## Model Components |
| |
| 1. image_resize_processor (`VaeImageProcessor`) |
| 2. text_encoder (`Qwen2_5_VLForConditionalGeneration`) |
| 3. processor (`Qwen2VLProcessor`) |
| 4. tokenizer (`Qwen2Tokenizer`): The tokenizer to use |
| 5. guider (`ClassifierFreeGuidance`) |
| 6. image_processor (`VaeImageProcessor`) |
| 7. vae (`AutoencoderKLQwenImage`) |
| 8. pachifier (`QwenImageLayeredPachifier`) |
| 9. scheduler (`FlowMatchEulerDiscreteScheduler`) |
| 10. transformer (`QwenImageTransformer2DModel`) ## Input/Output Specification |
| |
| **Inputs:** |
| |
| - `image` (`Image | list`): Reference image(s) for denoising. Can be a single image or list of images. |
| - `resolution` (`int`, *optional*, defaults to `640`): The target area to resize the image to, can be 1024 or 640 |
| - `prompt` (`str`, *optional*): The prompt or prompts to guide image generation. |
| - `use_en_prompt` (`bool`, *optional*, defaults to `False`): Whether to use English prompt template |
| - `negative_prompt` (`str`, *optional*): The prompt or prompts not to guide the image generation. |
| - `max_sequence_length` (`int`, *optional*, defaults to `1024`): Maximum sequence length for prompt encoding. |
| - `generator` (`Generator`, *optional*): Torch generator for deterministic generation. |
| - `num_images_per_prompt` (`int`, *optional*, defaults to `1`): The number of images to generate per prompt. |
| - `latents` (`Tensor`, *optional*): Pre-generated noisy latents for image generation. |
| - `layers` (`int`, *optional*, defaults to `4`): Number of layers to extract from the image |
| - `num_inference_steps` (`int`, *optional*, defaults to `50`): The number of denoising steps. |
| - `sigmas` (`list`, *optional*): Custom sigmas for the denoising process. |
| - `attention_kwargs` (`dict`, *optional*): Additional kwargs for attention processors. |
| - `**denoiser_input_fields` (`None`, *optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc. |
| - `output_type` (`str`, *optional*, defaults to `pil`): Output format: 'pil', 'np', 'pt'. |
| |
| **Outputs:** |
|
|
| - `images` (`list`): Generated images. |
|
|