This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: SequentialPipelineBlocks

Description:

This pipeline uses a 9-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. depth (DepthProcessorBlock)
  2. text_encoder (StableDiffusionXLTextEncoderStep)
    • Text Encoder step that generate text_embeddings to guide the image generation
  3. denoise.input (StableDiffusionXLInputStep)
    • Input processing step that:
  4. denoise.before_denoise.set_timesteps (StableDiffusionXLSetTimestepsStep)
    • Step that sets the scheduler's timesteps for inference
  5. denoise.before_denoise.prepare_latents (StableDiffusionXLPrepareLatentsStep)
    • Prepare latents step that prepares the latents for the text-to-image generation process
  6. denoise.before_denoise.prepare_add_cond (StableDiffusionXLPrepareAdditionalConditioningStep)
    • Step that prepares the additional conditioning for the text-to-image generation process
  7. denoise.controlnet_input (StableDiffusionXLControlNetInputStep)
    • step that prepare inputs for controlnet
  8. denoise.denoise (StableDiffusionXLControlNetDenoiseStep)
    • Denoise step that iteratively denoise the latents with controlnet.
    • before_denoiser: StableDiffusionXLLoopBeforeDenoiser
      • step within the denoising loop that prepare the latent input for the denoiser. This block should be used to compose the sub_blocks attribute of a LoopSequentialPipelineBlocks object (e.g. StableDiffusionXLDenoiseLoopWrapper)
    • denoiser: StableDiffusionXLControlNetLoopDenoiser
      • step within the denoising loop that denoise the latents with guidance (with controlnet). This block should be used to compose the sub_blocks attribute of a LoopSequentialPipelineBlocks object (e.g. StableDiffusionXLDenoiseLoopWrapper)
    • after_denoiser: StableDiffusionXLLoopAfterDenoiser
      • step within the denoising loop that update the latents. This block should be used to compose the sub_blocks attribute of a LoopSequentialPipelineBlocks object (e.g. StableDiffusionXLDenoiseLoopWrapper)
  9. decode (StableDiffusionXLDecodeStep)
    • Step that decodes the denoised latents into images

Model Components

  1. depth_processor (DepthPreprocessor) [pretrained_model_name_or_path=depth-anything/Depth-Anything-V2-Large-hf]
  2. text_encoder (CLIPTextModel)
  3. text_encoder_2 (CLIPTextModelWithProjection)
  4. tokenizer (CLIPTokenizer)
  5. tokenizer_2 (CLIPTokenizer)
  6. guider (ClassifierFreeGuidance)
  7. scheduler (EulerDiscreteScheduler)
  8. vae (AutoencoderKL)
  9. unet (UNet2DConditionModel)
  10. controlnet (ControlNetModel)
  11. control_image_processor (VaeImageProcessor)
  12. image_processor (VaeImageProcessor)

Configuration Parameters

force_zeros_for_empty_prompt (default: True)

Input/Output Specification

Inputs Required:

  • image (Any): Image(s) to use to extract depth maps
  • control_image (Any): No description provided

Optional:

  • prompt (Any): No description provided
  • prompt_2 (Any): No description provided
  • negative_prompt (Any): No description provided
  • negative_prompt_2 (Any): No description provided
  • cross_attention_kwargs (Any): No description provided
  • clip_skip (Any): No description provided
  • num_images_per_prompt (Any), default: 1: No description provided
  • ip_adapter_embeds (list): Pre-generated image embeddings for IP-Adapter. Can be generated from ip_adapter step.
  • negative_ip_adapter_embeds (list): Pre-generated negative image embeddings for IP-Adapter. Can be generated from ip_adapter step.
  • num_inference_steps (Any), default: 50: No description provided
  • timesteps (Any): No description provided
  • sigmas (Any): No description provided
  • denoising_end (Any): No description provided
  • height (Any): No description provided
  • width (Any): No description provided
  • latents (Any): No description provided
  • generator (Any): No description provided
  • original_size (Any): No description provided
  • target_size (Any): No description provided
  • negative_original_size (Any): No description provided
  • negative_target_size (Any): No description provided
  • crops_coords_top_left (Any), default: (0, 0): No description provided
  • negative_crops_coords_top_left (Any), default: (0, 0): No description provided
  • control_guidance_start (Any), default: 0.0: No description provided
  • control_guidance_end (Any), default: 1.0: No description provided
  • controlnet_conditioning_scale (Any), default: 1.0: No description provided
  • guess_mode (Any), default: False: No description provided
  • crops_coords (tuple[int] | None): The crop coordinates to use for preprocess/postprocess the image and mask, for inpainting task only. Can be generated in vae_encode step.
  • None (Any): All conditional model inputs that need to be prepared with guider. It should contain prompt_embeds/negative_prompt_embeds, add_time_ids/negative_add_time_ids, pooled_prompt_embeds/negative_pooled_prompt_embeds, and ip_adapter_embeds/negative_ip_adapter_embeds (optional).please add kwargs_type=denoiser_input_fields to their parameter spec (OutputParam) when they are created and added to the pipeline state
  • eta (Any), default: 0.0: No description provided
  • output_type (Any), default: pil: No description provided

Outputs - prompt_embeds (Tensor): text embeddings used to guide the image generation

  • negative_prompt_embeds (Tensor): negative text embeddings used to guide the image generation
  • pooled_prompt_embeds (Tensor): pooled text embeddings used to guide the image generation
  • negative_pooled_prompt_embeds (Tensor): negative pooled text embeddings used to guide the image generation
  • batch_size (int): Number of prompts, the final batch size of model inputs should be batch_size * num_images_per_prompt
  • dtype (dtype): Data type of model tensor inputs (determined by prompt_embeds)
  • ip_adapter_embeds (list): image embeddings for IP-Adapter
  • negative_ip_adapter_embeds (list): negative image embeddings for IP-Adapter
  • timesteps (Tensor): The timesteps to use for inference
  • num_inference_steps (int): The number of denoising steps to perform at inference time
  • latents (Tensor): The initial latents to use for the denoising process
  • add_time_ids (Tensor): The time ids to condition the denoising process
  • negative_add_time_ids (Tensor): The negative time ids to condition the denoising process
  • timestep_cond (Tensor): The timestep cond to use for LCM
  • controlnet_cond (Tensor): The processed control image
  • control_guidance_start (list): The controlnet guidance start values
  • control_guidance_end (list): The controlnet guidance end values
  • conditioning_scale (list): The controlnet conditioning scale values
  • guess_mode (bool): Whether guess mode is used
  • controlnet_keep (list): The controlnet keep values
  • images (list[PIL.Image.Image] | list[torch.Tensor] | list[numpy.array]): The generated images, can be a PIL.Image.Image, torch.Tensor or a numpy array
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support