This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: StableDiffusionXLAutoBlocks

Description: Auto Modular pipeline for text-to-image, image-to-image, inpainting, and controlnet tasks using Stable Diffusion XL.

This pipeline uses a 5-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

text_encoder (StableDiffusionXLTextEncoderStep)
- Text Encoder step that generate text_embeddings to guide the image generation
ip_adapter (StableDiffusionXLAutoIPAdapterStep)
- Run IP Adapter step if ip_adapter_image is provided. This step should be placed before the 'input' step.
- ip_adapter: StableDiffusionXLIPAdapterStep
  - IP Adapter step that prepares ip adapter image embeddings.
vae_encoder (StableDiffusionXLAutoVaeEncoderStep)
- Vae encoder step that encode the image inputs into their latent representations.
- inpaint: StableDiffusionXLInpaintVaeEncoderStep
  - Vae encoder step that prepares the image and mask for the inpainting process
- img2img: StableDiffusionXLVaeEncoderStep
  - Vae Encoder step that encode the input image into a latent representation
denoise (StableDiffusionXLCoreDenoiseStep)
- Core step that performs the denoising process.
- input: StableDiffusionXLInputStep
  - Input processing step that:
- before_denoise: StableDiffusionXLAutoBeforeDenoiseStep
  - Before denoise step that prepare the inputs for the denoise step.
- controlnet_input: StableDiffusionXLAutoControlNetInputStep
  - Controlnet Input step that prepare the controlnet input.
- denoise: StableDiffusionXLAutoDenoiseStep
  - Denoise step that iteratively denoise the latents. This is a auto pipeline block that works for text2img, img2img and inpainting tasks. And can be used with or without controlnet. - StableDiffusionXLAutoControlNetDenoiseStep (controlnet_denoise) is used when controlnet_cond is provided (support controlnet withtext2img, img2img and inpainting tasks). - StableDiffusionXLInpaintDenoiseStep (inpaint_denoise) is used when mask is provided (support inpainting tasks). - StableDiffusionXLDenoiseStep (denoise) is used when neither mask nor controlnet_cond are provided (support text2img and img2img tasks).
decode (StableDiffusionXLAutoDecodeStep)
- Decode step that decode the denoised latents into images outputs.
- inpaint: StableDiffusionXLInpaintDecodeStep
  - Inpaint decode step that decode the denoised latents into images outputs.
- non-inpaint: StableDiffusionXLDecodeStep
  - Step that decodes the denoised latents into images

Model Components

text_encoder (CLIPTextModel)
text_encoder_2 (CLIPTextModelWithProjection)
tokenizer (CLIPTokenizer)
tokenizer_2 (CLIPTokenizer)
guider (ClassifierFreeGuidance)
image_encoder (CLIPVisionModelWithProjection)
feature_extractor (CLIPImageProcessor)
unet (UNet2DConditionModel)
vae (AutoencoderKL)
image_processor (VaeImageProcessor)
mask_processor (VaeImageProcessor)
scheduler (EulerDiscreteScheduler)
controlnet (ControlNetUnionModel)
control_image_processor (VaeImageProcessor)

Configuration Parameters

force_zeros_for_empty_prompt (default: True) requires_aesthetics_score (default: False)

Input/Output Specification

Inputs Required:

latents (Any): No description provided

Optional:

prompt (Any): No description provided
prompt_2 (Any): No description provided
negative_prompt (Any): No description provided
negative_prompt_2 (Any): No description provided
cross_attention_kwargs (Any): No description provided
clip_skip (Any): No description provided
ip_adapter_image (PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]): The image(s) to be used as ip adapter
height (Any): No description provided
width (Any): No description provided
image (Any): No description provided
mask_image (Any): No description provided
padding_mask_crop (Any): No description provided
dtype (dtype): The dtype of the model inputs
generator (Any): No description provided
preprocess_kwargs (dict | None): A kwargs dictionary that if specified is passed along to the ImageProcessor as defined under self.image_processor in [diffusers.image_processor.VaeImageProcessor]
num_images_per_prompt (Any), default: 1: No description provided
ip_adapter_embeds (list): Pre-generated image embeddings for IP-Adapter. Can be generated from ip_adapter step.
negative_ip_adapter_embeds (list): Pre-generated negative image embeddings for IP-Adapter. Can be generated from ip_adapter step.
num_inference_steps (Any), default: 50: No description provided
timesteps (Any): No description provided
sigmas (Any): No description provided
denoising_end (Any): No description provided
strength (Any), default: 0.3: No description provided
denoising_start (Any): No description provided
image_latents (Tensor): The latents representing the reference image for image-to-image/inpainting generation. Can be generated in vae_encode step.
mask (Tensor): The mask for the inpainting generation. Can be generated in vae_encode step.
masked_image_latents (Tensor): The masked image latents for the inpainting generation (only for inpainting-specific unet). Can be generated in vae_encode step.
original_size (Any): No description provided
target_size (Any): No description provided
negative_original_size (Any): No description provided
negative_target_size (Any): No description provided
crops_coords_top_left (Any), default: (0, 0): No description provided
negative_crops_coords_top_left (Any), default: (0, 0): No description provided
aesthetic_score (Any), default: 6.0: No description provided
negative_aesthetic_score (Any), default: 2.0: No description provided
control_image (Any): No description provided
control_mode (Any): No description provided
control_guidance_start (Any), default: 0.0: No description provided
control_guidance_end (Any), default: 1.0: No description provided
controlnet_conditioning_scale (Any), default: 1.0: No description provided
guess_mode (Any), default: False: No description provided
crops_coords (tuple[int] | None): The crop coordinates to use for preprocess/postprocess the image and mask, for inpainting task only. Can be generated in vae_encode step.
controlnet_cond (Tensor): The control image to use for the denoising process. Can be generated in prepare_controlnet_inputs step.
conditioning_scale (float): The controlnet conditioning scale value to use for the denoising process. Can be generated in prepare_controlnet_inputs step.
controlnet_keep (list): The controlnet keep values to use for the denoising process. Can be generated in prepare_controlnet_inputs step.
None (Any): All conditional model inputs that need to be prepared with guider. It should contain prompt_embeds/negative_prompt_embeds, add_time_ids/negative_add_time_ids, pooled_prompt_embeds/negative_pooled_prompt_embeds, and ip_adapter_embeds/negative_ip_adapter_embeds (optional).please add kwargs_type=denoiser_input_fields to their parameter spec (OutputParam) when they are created and added to the pipeline state
eta (Any), default: 0.0: No description provided
output_type (Any), default: pil: No description provided

Outputs - `images` (`list`): Generated images.

Downloads last month: -