This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: QwenImageAutoBlocks

Description: Auto Modular pipeline for text-to-image, image-to-image, inpainting, and controlnet tasks using QwenImage.

for image-to-image generation, you need to provide image
for inpainting, you need to provide mask_image and image, optionally you can provide padding_mask_crop.
to run the controlnet workflow, you need to provide control_image
for text-to-image generation, all you need to provide is prompt

This pipeline uses a 5-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

text_encoder (QwenImageAutoTextEncoderStep)
- Text encoder step that encodes the text prompt into a text embedding. This is an auto pipeline block.
- text_encoder: QwenImageTextEncoderStep
  - Text Encoder step that generates text embeddings to guide the image generation.
vae_encoder (QwenImageAutoVaeEncoderStep)
- Vae encoder step that encode the image inputs into their latent representations.
- inpaint: QwenImageInpaintVaeEncoderStep
  - This step is used for processing image and mask inputs for inpainting tasks. It:
- img2img: QwenImageImg2ImgVaeEncoderStep
  - Vae encoder step that preprocess andencode the image inputs into their latent representations.
controlnet_vae_encoder (QwenImageOptionalControlNetVaeEncoderStep)
- Vae encoder step that encode the image inputs into their latent representations.
- controlnet: QwenImageControlNetVaeEncoderStep
  - VAE Encoder step that converts control_image into latent representations control_image_latents.
denoise (QwenImageAutoCoreDenoiseStep)
- Core step that performs the denoising process.
- text2image: QwenImageCoreDenoiseStep
  - step that denoise noise into image for text2image task. It includes the denoise loop, as well as prepare the inputs (timesteps, latents, rope inputs etc.).
- inpaint: QwenImageInpaintCoreDenoiseStep
  - Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for inpaint task.
- img2img: QwenImageImg2ImgCoreDenoiseStep
  - Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for img2img task.
- controlnet_text2image: QwenImageControlNetCoreDenoiseStep
  - step that denoise noise into image for text2image task. It includes the denoise loop, as well as prepare the inputs (timesteps, latents, rope inputs etc.).
- controlnet_inpaint: QwenImageControlNetInpaintCoreDenoiseStep
  - Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for inpaint task.
- controlnet_img2img: QwenImageControlNetImg2ImgCoreDenoiseStep
  - Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for img2img task.
decode (QwenImageAutoDecodeStep)
- Decode step that decode the latents into images.
- inpaint_decode: QwenImageInpaintDecodeStep
  - Decode step that decodes the latents to images and postprocess the generated image, optional apply the mask overally to the original image.
- decode: QwenImageDecodeStep
  - Decode step that decodes the latents to images and postprocess the generated image.

Conditional Execution

This pipeline contains blocks that are selected at runtime based on inputs:

Trigger Inputs: control_image, control_image_latents, image, image_latents, mask, mask_image, processed_mask_image, prompt

Model Components

text_encoder (Qwen2_5_VLForConditionalGeneration): The text encoder to use
tokenizer (Qwen2Tokenizer): The tokenizer to use
guider (ClassifierFreeGuidance)
image_mask_processor (InpaintProcessor)
vae (AutoencoderKLQwenImage)
image_processor (VaeImageProcessor)
controlnet (QwenImageControlNetModel)
control_image_processor (VaeImageProcessor)
pachifier (QwenImagePachifier)
scheduler (FlowMatchEulerDiscreteScheduler)
transformer (QwenImageTransformer2DModel)

Input/Output Specification

Inputs Required:

prompt_embeds (Tensor): text embeddings used to guide the image generation. Can be generated from text_encoder step.
prompt_embeds_mask (Tensor): mask for the text embeddings. Can be generated from text_encoder step.
latents (Tensor): Pre-generated noisy latents for image generation.
num_inference_steps (int): The number of denoising steps.

Optional:

prompt (str): The prompt or prompts to guide image generation.
negative_prompt (str): The prompt or prompts not to guide the image generation.
max_sequence_length (int), default: 1024: Maximum sequence length for prompt encoding.
mask_image (Image): Mask image for inpainting.
image (Union): Reference image(s) for denoising. Can be a single image or list of images.
height (int): The height in pixels of the generated image.
width (int): The width in pixels of the generated image.
padding_mask_crop (int): Padding for mask cropping in inpainting.
generator (Generator): Torch generator for deterministic generation.
control_image (Image): Control image for ControlNet conditioning.
num_images_per_prompt (int), default: 1: The number of images to generate per prompt.
negative_prompt_embeds (Tensor): negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds_mask (Tensor): mask for the negative text embeddings. Can be generated from text_encoder step.
sigmas (List): Custom sigmas for the denoising process.
attention_kwargs (Dict): Additional kwargs for attention processors.
None (Any): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
image_latents (Tensor): image latents used to guide the image generation. Can be generated from vae_encoder step.
processed_mask_image (Tensor): The processed mask image
strength (float), default: 0.9: Strength for img2img/inpainting.
control_image_latents (Tensor): The control image latents to use for the denoising process. Can be generated in controlnet vae encoder step.
control_guidance_start (float), default: 0.0: When to start applying ControlNet.
control_guidance_end (float), default: 1.0: When to stop applying ControlNet.
controlnet_conditioning_scale (float), default: 1.0: Scale for ControlNet conditioning.
output_type (str), default: pil: Output format: 'pil', 'np', 'pt'.
mask_overlay_kwargs (Dict): The kwargs for the postprocess step to apply the mask overlay. generated in InpaintProcessImagesInputStep.

Outputs - `images` (`List`): Generated images.

Downloads last month: -