This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
Pipeline Type: QwenImageAutoBlocks
Description: Auto Modular pipeline for text-to-image, image-to-image, inpainting, and controlnet tasks using QwenImage.
- for image-to-image generation, you need to provide
image - for inpainting, you need to provide
mask_imageandimage, optionally you can providepadding_mask_crop. - to run the controlnet workflow, you need to provide
control_image - for text-to-image generation, all you need to provide is
prompt
This pipeline uses a 5-block architecture that can be customized and extended.
Example Usage
[TODO]
Pipeline Architecture
This modular pipeline is composed of the following blocks:
- text_encoder (
QwenImageAutoTextEncoderStep)- Text encoder step that encodes the text prompt into a text embedding. This is an auto pipeline block.
- text_encoder:
QwenImageTextEncoderStep- Text Encoder step that generates text embeddings to guide the image generation.
- vae_encoder (
QwenImageAutoVaeEncoderStep)- Vae encoder step that encode the image inputs into their latent representations.
- inpaint:
QwenImageInpaintVaeEncoderStep- This step is used for processing image and mask inputs for inpainting tasks. It:
- img2img:
QwenImageImg2ImgVaeEncoderStep- Vae encoder step that preprocess andencode the image inputs into their latent representations.
- controlnet_vae_encoder (
QwenImageOptionalControlNetVaeEncoderStep)- Vae encoder step that encode the image inputs into their latent representations.
- controlnet:
QwenImageControlNetVaeEncoderStep- VAE Encoder step that converts
control_imageinto latent representations control_image_latents.
- VAE Encoder step that converts
- denoise (
QwenImageAutoCoreDenoiseStep)- Core step that performs the denoising process.
- text2image:
QwenImageCoreDenoiseStep- step that denoise noise into image for text2image task. It includes the denoise loop, as well as prepare the inputs (timesteps, latents, rope inputs etc.).
- inpaint:
QwenImageInpaintCoreDenoiseStep- Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for inpaint task.
- img2img:
QwenImageImg2ImgCoreDenoiseStep- Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for img2img task.
- controlnet_text2image:
QwenImageControlNetCoreDenoiseStep- step that denoise noise into image for text2image task. It includes the denoise loop, as well as prepare the inputs (timesteps, latents, rope inputs etc.).
- controlnet_inpaint:
QwenImageControlNetInpaintCoreDenoiseStep- Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for inpaint task.
- controlnet_img2img:
QwenImageControlNetImg2ImgCoreDenoiseStep- Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step for img2img task.
- decode (
QwenImageAutoDecodeStep)- Decode step that decode the latents into images.
- inpaint_decode:
QwenImageInpaintDecodeStep- Decode step that decodes the latents to images and postprocess the generated image, optional apply the mask overally to the original image.
- decode:
QwenImageDecodeStep- Decode step that decodes the latents to images and postprocess the generated image.
Conditional Execution
This pipeline contains blocks that are selected at runtime based on inputs:
- Trigger Inputs:
control_image,control_image_latents,image,image_latents,mask,mask_image,processed_mask_image,prompt
Model Components
- text_encoder (
Qwen2_5_VLForConditionalGeneration): The text encoder to use - tokenizer (
Qwen2Tokenizer): The tokenizer to use - guider (
ClassifierFreeGuidance) - image_mask_processor (
InpaintProcessor) - vae (
AutoencoderKLQwenImage) - image_processor (
VaeImageProcessor) - controlnet (
QwenImageControlNetModel) - control_image_processor (
VaeImageProcessor) - pachifier (
QwenImagePachifier) - scheduler (
FlowMatchEulerDiscreteScheduler) - transformer (
QwenImageTransformer2DModel)
Input/Output Specification
Inputs Required:
prompt_embeds(Tensor): text embeddings used to guide the image generation. Can be generated from text_encoder step.prompt_embeds_mask(Tensor): mask for the text embeddings. Can be generated from text_encoder step.latents(Tensor): Pre-generated noisy latents for image generation.num_inference_steps(int): The number of denoising steps.
Optional:
prompt(str): The prompt or prompts to guide image generation.negative_prompt(str): The prompt or prompts not to guide the image generation.max_sequence_length(int), default:1024: Maximum sequence length for prompt encoding.mask_image(Image): Mask image for inpainting.image(Union): Reference image(s) for denoising. Can be a single image or list of images.height(int): The height in pixels of the generated image.width(int): The width in pixels of the generated image.padding_mask_crop(int): Padding for mask cropping in inpainting.generator(Generator): Torch generator for deterministic generation.control_image(Image): Control image for ControlNet conditioning.num_images_per_prompt(int), default:1: The number of images to generate per prompt.negative_prompt_embeds(Tensor): negative text embeddings used to guide the image generation. Can be generated from text_encoder step.negative_prompt_embeds_mask(Tensor): mask for the negative text embeddings. Can be generated from text_encoder step.sigmas(List): Custom sigmas for the denoising process.attention_kwargs(Dict): Additional kwargs for attention processors.None(Any): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.image_latents(Tensor): image latents used to guide the image generation. Can be generated from vae_encoder step.processed_mask_image(Tensor): The processed mask imagestrength(float), default:0.9: Strength for img2img/inpainting.control_image_latents(Tensor): The control image latents to use for the denoising process. Can be generated in controlnet vae encoder step.control_guidance_start(float), default:0.0: When to start applying ControlNet.control_guidance_end(float), default:1.0: When to stop applying ControlNet.controlnet_conditioning_scale(float), default:1.0: Scale for ControlNet conditioning.output_type(str), default:pil: Output format: 'pil', 'np', 'pt'.mask_overlay_kwargs(Dict): The kwargs for the postprocess step to apply the mask overlay. generated in InpaintProcessImagesInputStep.
Outputs - images (List): Generated images.
- Downloads last month
- -