This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: QwenImageAutoBlocks

Description: Auto Modular pipeline for text-to-image, image-to-image, inpainting, and controlnet tasks using QwenImage.

  • for image-to-image generation, you need to provide image
  • for inpainting, you need to provide mask_image and image, optionally you can provide padding_mask_crop.
  • to run the controlnet workflow, you need to provide control_image
  • for text-to-image generation, all you need to provide is prompt

This pipeline uses a 5-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. text_encoder (QwenImageAutoTextEncoderStep)
    • Text encoder step that encodes the text prompt into a text embedding. This is an auto pipeline block.
    • text_encoder: QwenImageTextEncoderStep
      • Text Encoder step that generates text embeddings to guide the image generation.
  2. vae_encoder (QwenImageAutoVaeEncoderStep)
    • Vae encoder step that encode the image inputs into their latent representations.
    • inpaint: QwenImageInpaintVaeEncoderStep
      • This step is used for processing image and mask inputs for inpainting tasks. It:
    • img2img: QwenImageImg2ImgVaeEncoderStep
      • Vae encoder step that preprocess andencode the image inputs into their latent representations.
  3. controlnet_vae_encoder (QwenImageOptionalControlNetVaeEncoderStep)
    • Vae encoder step that encode the image inputs into their latent representations.
    • controlnet: QwenImageControlNetVaeEncoderStep
      • VAE Encoder step that converts control_image into latent representations control_image_latents.
  4. denoise (QwenImageAutoCoreDenoiseStep)
    • Core step that performs the denoising process.
    • text2image: QwenImageCoreDenoiseStep
      • step that denoise noise into image for text2image task. It includes the denoise loop, as well as prepare the inputs (timesteps, latents, rope inputs etc.).
    • inpaint: QwenImageInpaintCoreDenoiseStep
      • Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for inpaint task.
    • img2img: QwenImageImg2ImgCoreDenoiseStep
      • Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for img2img task.
    • controlnet_text2image: QwenImageControlNetCoreDenoiseStep
      • step that denoise noise into image for text2image task. It includes the denoise loop, as well as prepare the inputs (timesteps, latents, rope inputs etc.).
    • controlnet_inpaint: QwenImageControlNetInpaintCoreDenoiseStep
      • Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for inpaint task.
    • controlnet_img2img: QwenImageControlNetImg2ImgCoreDenoiseStep
      • Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for img2img task.
  5. decode (QwenImageAutoDecodeStep)
    • Decode step that decode the latents into images.
    • inpaint_decode: QwenImageInpaintDecodeStep
      • Decode step that decodes the latents to images and postprocess the generated image, optional apply the mask overally to the original image.
    • decode: QwenImageDecodeStep
      • Decode step that decodes the latents to images and postprocess the generated image.

Conditional Execution

This pipeline contains blocks that are selected at runtime based on inputs:

  • Trigger Inputs: control_image, control_image_latents, image, image_latents, mask, mask_image, processed_mask_image, prompt

Model Components

  1. text_encoder (Qwen2_5_VLForConditionalGeneration): The text encoder to use
  2. tokenizer (Qwen2Tokenizer): The tokenizer to use
  3. guider (ClassifierFreeGuidance)
  4. image_mask_processor (InpaintProcessor)
  5. vae (AutoencoderKLQwenImage)
  6. image_processor (VaeImageProcessor)
  7. controlnet (QwenImageControlNetModel)
  8. control_image_processor (VaeImageProcessor)
  9. pachifier (QwenImagePachifier)
  10. scheduler (FlowMatchEulerDiscreteScheduler)
  11. transformer (QwenImageTransformer2DModel)

Input/Output Specification

Inputs Required:

  • prompt_embeds (Tensor): text embeddings used to guide the image generation. Can be generated from text_encoder step.
  • prompt_embeds_mask (Tensor): mask for the text embeddings. Can be generated from text_encoder step.
  • latents (Tensor): Pre-generated noisy latents for image generation.
  • num_inference_steps (int): The number of denoising steps.

Optional:

  • prompt (str): The prompt or prompts to guide image generation.
  • negative_prompt (str): The prompt or prompts not to guide the image generation.
  • max_sequence_length (int), default: 1024: Maximum sequence length for prompt encoding.
  • mask_image (Image): Mask image for inpainting.
  • image (Union): Reference image(s) for denoising. Can be a single image or list of images.
  • height (int): The height in pixels of the generated image.
  • width (int): The width in pixels of the generated image.
  • padding_mask_crop (int): Padding for mask cropping in inpainting.
  • generator (Generator): Torch generator for deterministic generation.
  • control_image (Image): Control image for ControlNet conditioning.
  • num_images_per_prompt (int), default: 1: The number of images to generate per prompt.
  • negative_prompt_embeds (Tensor): negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
  • negative_prompt_embeds_mask (Tensor): mask for the negative text embeddings. Can be generated from text_encoder step.
  • sigmas (List): Custom sigmas for the denoising process.
  • attention_kwargs (Dict): Additional kwargs for attention processors.
  • None (Any): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
  • image_latents (Tensor): image latents used to guide the image generation. Can be generated from vae_encoder step.
  • processed_mask_image (Tensor): The processed mask image
  • strength (float), default: 0.9: Strength for img2img/inpainting.
  • control_image_latents (Tensor): The control image latents to use for the denoising process. Can be generated in controlnet vae encoder step.
  • control_guidance_start (float), default: 0.0: When to start applying ControlNet.
  • control_guidance_end (float), default: 1.0: When to stop applying ControlNet.
  • controlnet_conditioning_scale (float), default: 1.0: Scale for ControlNet conditioning.
  • output_type (str), default: pil: Output format: 'pil', 'np', 'pt'.
  • mask_overlay_kwargs (Dict): The kwargs for the postprocess step to apply the mask overlay. generated in InpaintProcessImagesInputStep.

Outputs - images (List): Generated images.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support