This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: StableDiffusionXLAutoBlocks

Description: Auto Modular pipeline for text-to-image, image-to-image, inpainting, and controlnet tasks using Stable Diffusion XL.

This pipeline uses a 5-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. text_encoder (StableDiffusionXLTextEncoderStep)
    • Text Encoder step that generate text_embeddings to guide the image generation
  2. ip_adapter (StableDiffusionXLAutoIPAdapterStep)
    • Run IP Adapter step if ip_adapter_image is provided. This step should be placed before the 'input' step.
    • ip_adapter: StableDiffusionXLIPAdapterStep
      • IP Adapter step that prepares ip adapter image embeddings.
  3. vae_encoder (StableDiffusionXLAutoVaeEncoderStep)
    • Vae encoder step that encode the image inputs into their latent representations.
    • inpaint: StableDiffusionXLInpaintVaeEncoderStep
      • Vae encoder step that prepares the image and mask for the inpainting process
    • img2img: StableDiffusionXLVaeEncoderStep
      • Vae Encoder step that encode the input image into a latent representation
  4. denoise (StableDiffusionXLCoreDenoiseStep)
    • Core step that performs the denoising process.
    • input: StableDiffusionXLInputStep
      • Input processing step that:
    • before_denoise: StableDiffusionXLAutoBeforeDenoiseStep
      • Before denoise step that prepare the inputs for the denoise step.
    • controlnet_input: StableDiffusionXLAutoControlNetInputStep
      • Controlnet Input step that prepare the controlnet input.
    • denoise: StableDiffusionXLAutoDenoiseStep
      • Denoise step that iteratively denoise the latents. This is a auto pipeline block that works for text2img, img2img and inpainting tasks. And can be used with or without controlnet. - StableDiffusionXLAutoControlNetDenoiseStep (controlnet_denoise) is used when controlnet_cond is provided (support controlnet withtext2img, img2img and inpainting tasks). - StableDiffusionXLInpaintDenoiseStep (inpaint_denoise) is used when mask is provided (support inpainting tasks). - StableDiffusionXLDenoiseStep (denoise) is used when neither mask nor controlnet_cond are provided (support text2img and img2img tasks).
  5. decode (StableDiffusionXLAutoDecodeStep)
    • Decode step that decode the denoised latents into images outputs.
    • inpaint: StableDiffusionXLInpaintDecodeStep
      • Inpaint decode step that decode the denoised latents into images outputs.
    • non-inpaint: StableDiffusionXLDecodeStep
      • Step that decodes the denoised latents into images

Model Components

  1. text_encoder (CLIPTextModel)
  2. text_encoder_2 (CLIPTextModelWithProjection)
  3. tokenizer (CLIPTokenizer)
  4. tokenizer_2 (CLIPTokenizer)
  5. guider (ClassifierFreeGuidance)
  6. image_encoder (CLIPVisionModelWithProjection)
  7. feature_extractor (CLIPImageProcessor)
  8. unet (UNet2DConditionModel)
  9. vae (AutoencoderKL)
  10. image_processor (VaeImageProcessor)
  11. mask_processor (VaeImageProcessor)
  12. scheduler (EulerDiscreteScheduler)
  13. controlnet (ControlNetUnionModel)
  14. control_image_processor (VaeImageProcessor)

Configuration Parameters

force_zeros_for_empty_prompt (default: True) requires_aesthetics_score (default: False)

Input/Output Specification

Inputs Required:

  • latents (Any): No description provided

Optional:

  • prompt (Any): No description provided
  • prompt_2 (Any): No description provided
  • negative_prompt (Any): No description provided
  • negative_prompt_2 (Any): No description provided
  • cross_attention_kwargs (Any): No description provided
  • clip_skip (Any): No description provided
  • ip_adapter_image (PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]): The image(s) to be used as ip adapter
  • height (Any): No description provided
  • width (Any): No description provided
  • image (Any): No description provided
  • mask_image (Any): No description provided
  • padding_mask_crop (Any): No description provided
  • dtype (dtype): The dtype of the model inputs
  • generator (Any): No description provided
  • preprocess_kwargs (dict | None): A kwargs dictionary that if specified is passed along to the ImageProcessor as defined under self.image_processor in [diffusers.image_processor.VaeImageProcessor]
  • num_images_per_prompt (Any), default: 1: No description provided
  • ip_adapter_embeds (list): Pre-generated image embeddings for IP-Adapter. Can be generated from ip_adapter step.
  • negative_ip_adapter_embeds (list): Pre-generated negative image embeddings for IP-Adapter. Can be generated from ip_adapter step.
  • num_inference_steps (Any), default: 50: No description provided
  • timesteps (Any): No description provided
  • sigmas (Any): No description provided
  • denoising_end (Any): No description provided
  • strength (Any), default: 0.3: No description provided
  • denoising_start (Any): No description provided
  • image_latents (Tensor): The latents representing the reference image for image-to-image/inpainting generation. Can be generated in vae_encode step.
  • mask (Tensor): The mask for the inpainting generation. Can be generated in vae_encode step.
  • masked_image_latents (Tensor): The masked image latents for the inpainting generation (only for inpainting-specific unet). Can be generated in vae_encode step.
  • original_size (Any): No description provided
  • target_size (Any): No description provided
  • negative_original_size (Any): No description provided
  • negative_target_size (Any): No description provided
  • crops_coords_top_left (Any), default: (0, 0): No description provided
  • negative_crops_coords_top_left (Any), default: (0, 0): No description provided
  • aesthetic_score (Any), default: 6.0: No description provided
  • negative_aesthetic_score (Any), default: 2.0: No description provided
  • control_image (Any): No description provided
  • control_mode (Any): No description provided
  • control_guidance_start (Any), default: 0.0: No description provided
  • control_guidance_end (Any), default: 1.0: No description provided
  • controlnet_conditioning_scale (Any), default: 1.0: No description provided
  • guess_mode (Any), default: False: No description provided
  • crops_coords (tuple[int] | None): The crop coordinates to use for preprocess/postprocess the image and mask, for inpainting task only. Can be generated in vae_encode step.
  • controlnet_cond (Tensor): The control image to use for the denoising process. Can be generated in prepare_controlnet_inputs step.
  • conditioning_scale (float): The controlnet conditioning scale value to use for the denoising process. Can be generated in prepare_controlnet_inputs step.
  • controlnet_keep (list): The controlnet keep values to use for the denoising process. Can be generated in prepare_controlnet_inputs step.
  • None (Any): All conditional model inputs that need to be prepared with guider. It should contain prompt_embeds/negative_prompt_embeds, add_time_ids/negative_add_time_ids, pooled_prompt_embeds/negative_pooled_prompt_embeds, and ip_adapter_embeds/negative_ip_adapter_embeds (optional).please add kwargs_type=denoiser_input_fields to their parameter spec (OutputParam) when they are created and added to the pipeline state
  • eta (Any), default: 0.0: No description provided
  • output_type (Any), default: pil: No description provided

Outputs - images (list): Generated images.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support