Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_13331 /en /api /pipelines /stable_cascade.md

rtrm

3 months ago

preview code

download

raw

37.4 kB

	# Stable Cascade

	This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
	difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
	important? The smaller the latent space, the faster you can run inference and the cheaper the training becomes.
	How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
	encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
	1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
	highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
	Diffusion 1.5.

	Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
	like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.

	The original codebase can be found at [Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade).

	## Model Overview
	Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
	hence the name "Stable Cascade".

	Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
	However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
	spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
	a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
	image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
	for generating the small 24 x 24 latents given a text prompt.

	The Stage C model operates on the small 24 x 24 latents and denoises the latents conditioned on text prompts. The model is also the largest component in the Cascade pipeline and is meant to be used with the `StableCascadePriorPipeline`

	The Stage B and Stage A models are used with the `StableCascadeDecoderPipeline` and are responsible for generating the final image given the small 24 x 24 latents.

	> [!WARNING]
	> There are some restrictions on data types that can be used with the Stable Cascade models. The official checkpoints for the `StableCascadePriorPipeline` do not support the `torch.float16` data type. Please use `torch.bfloat16` instead.
	>
	> In order to use the `torch.bfloat16` data type with the `StableCascadeDecoderPipeline` you need to have PyTorch 2.2.0 or higher installed. This also means that using the `StableCascadeCombinedPipeline` with `torch.bfloat16` requires PyTorch 2.2.0 or higher, since it calls the `StableCascadeDecoderPipeline` internally.
	>
	> If it is not possible to install PyTorch 2.2.0 or higher in your environment, the `StableCascadeDecoderPipeline` can be used on its own with the `torch.float16` data type. You can download the full precision or `bf16` variant weights for the pipeline and cast the weights to `torch.float16`.

	## Usage example

	```python
	import torch
	from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

	prompt = "an image of a shiba inu, donning a spacesuit and helmet"
	negative_prompt = ""

	prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
	decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)

	prior.enable_model_cpu_offload()
	prior_output = prior(
	prompt=prompt,
	height=1024,
	width=1024,
	negative_prompt=negative_prompt,
	guidance_scale=4.0,
	num_images_per_prompt=1,
	num_inference_steps=20
	)

	decoder.enable_model_cpu_offload()
	decoder_output = decoder(
	image_embeddings=prior_output.image_embeddings.to(torch.float16),
	prompt=prompt,
	negative_prompt=negative_prompt,
	guidance_scale=0.0,
	output_type="pil",
	num_inference_steps=10
	).images[0]
	decoder_output.save("cascade.png")
	```

	## Using the Lite Versions of the Stage B and Stage C models

	```python
	import torch
	from diffusers import (
	StableCascadeDecoderPipeline,
	StableCascadePriorPipeline,
	StableCascadeUNet,
	)

	prompt = "an image of a shiba inu, donning a spacesuit and helmet"
	negative_prompt = ""

	prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite")
	decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")

	prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet)
	decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)

	prior.enable_model_cpu_offload()
	prior_output = prior(
	prompt=prompt,
	height=1024,
	width=1024,
	negative_prompt=negative_prompt,
	guidance_scale=4.0,
	num_images_per_prompt=1,
	num_inference_steps=20
	)

	decoder.enable_model_cpu_offload()
	decoder_output = decoder(
	image_embeddings=prior_output.image_embeddings,
	prompt=prompt,
	negative_prompt=negative_prompt,
	guidance_scale=0.0,
	output_type="pil",
	num_inference_steps=10
	).images[0]
	decoder_output.save("cascade.png")
	```

	## Loading original checkpoints with `from_single_file`

	Loading the original format checkpoints is supported via `from_single_file` method in the StableCascadeUNet.

	```python
	import torch
	from diffusers import (
	StableCascadeDecoderPipeline,
	StableCascadePriorPipeline,
	StableCascadeUNet,
	)

	prompt = "an image of a shiba inu, donning a spacesuit and helmet"
	negative_prompt = ""

	prior_unet = StableCascadeUNet.from_single_file(
	"https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors",
	torch_dtype=torch.bfloat16
	)
	decoder_unet = StableCascadeUNet.from_single_file(
	"https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors",
	torch_dtype=torch.bfloat16
	)

	prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16)
	decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)

	prior.enable_model_cpu_offload()
	prior_output = prior(
	prompt=prompt,
	height=1024,
	width=1024,
	negative_prompt=negative_prompt,
	guidance_scale=4.0,
	num_images_per_prompt=1,
	num_inference_steps=20
	)

	decoder.enable_model_cpu_offload()
	decoder_output = decoder(
	image_embeddings=prior_output.image_embeddings,
	prompt=prompt,
	negative_prompt=negative_prompt,
	guidance_scale=0.0,
	output_type="pil",
	num_inference_steps=10
	).images[0]
	decoder_output.save("cascade-single-file.png")
	```

	## Uses

	### Direct Use

	The model is intended for research purposes for now. Possible research areas and tasks include

	- Research on generative models.
	- Safe deployment of models which have the potential to generate harmful content.
	- Probing and understanding the limitations and biases of generative models.
	- Generation of artworks and use in design and other artistic processes.
	- Applications in educational or creative tools.

	Excluded uses are described below.

	### Out-of-Scope Use

	The model was not trained to be factual or true representations of people or events,
	and therefore using the model to generate such content is out-of-scope for the abilities of this model.
	The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).

	## Limitations and Bias

	### Limitations
	- Faces and people in general may not be generated properly.
	- The autoencoding part of the model is lossy.

	## StableCascadeCombinedPipeline[[diffusers.StableCascadeCombinedPipeline]]

	#### diffusers.StableCascadeCombinedPipeline[[diffusers.StableCascadeCombinedPipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_combined.py#L45)

	Combined Pipeline for text-to-image generation using Stable Cascade.

	This model inherits from [DiffusionPipeline](/docs/diffusers/pr_13331/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
	library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

	__call__diffusers.StableCascadeCombinedPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_combined.py#L158[{"name": "prompt", "val": ": str \| list[str] \| None = None"}, {"name": "images", "val": ": torch.Tensor \| PIL.Image.Image \| list[torch.Tensor] \| list[PIL.Image.Image] = None"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "prior_num_inference_steps", "val": ": int = 60"}, {"name": "prior_guidance_scale", "val": ": float = 4.0"}, {"name": "num_inference_steps", "val": ": int = 12"}, {"name": "decoder_guidance_scale", "val": ": float = 0.0"}, {"name": "negative_prompt", "val": ": str \| list[str] \| None = None"}, {"name": "prompt_embeds", "val": ": torch.Tensor \| None = None"}, {"name": "prompt_embeds_pooled", "val": ": torch.Tensor \| None = None"}, {"name": "negative_prompt_embeds", "val": ": torch.Tensor \| None = None"}, {"name": "negative_prompt_embeds_pooled", "val": ": torch.Tensor \| None = None"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "generator", "val": ": torch._C.Generator \| list[torch._C.Generator] \| None = None"}, {"name": "latents", "val": ": torch.Tensor \| None = None"}, {"name": "output_type", "val": ": str \| None = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "prior_callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int], NoneType]] = None"}, {"name": "prior_callback_on_step_end_tensor_inputs", "val": ": list = ['latents']"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list = ['latents']"}]- prompt (`str` or `list[str]`) --
	The prompt or prompts to guide the image generation for the prior and decoder.
	- images (`torch.Tensor`, `PIL.Image.Image`, `list[torch.Tensor]`, `list[PIL.Image.Image]`, optional) --
	The images to guide the image generation for the prior.
	- negative_prompt (`str` or `list[str]`, optional) --
	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
	if `guidance_scale` is less than `1`).
	- prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, text embeddings will be generated from `prompt` input argument.
	- prompt_embeds_pooled (`torch.Tensor`, optional) --
	Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, text embeddings will be generated from `prompt` input argument.
	- negative_prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, e.g.
	prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt`
	input argument.
	- negative_prompt_embeds_pooled (`torch.Tensor`, optional) --
	Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, e.g.
	prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt`
	input argument.
	- num_images_per_prompt (`int`, optional, defaults to 1) --
	The number of images to generate per prompt.
	- height (`int`, optional, defaults to 512) --
	The height in pixels of the generated image.
	- width (`int`, optional, defaults to 512) --
	The width in pixels of the generated image.
	- prior_guidance_scale (`float`, optional, defaults to 4.0) --
	Guidance scale as defined in [Classifier-Free Diffusion
	Guidance](https://huggingface.co/papers/2207.12598). `prior_guidance_scale` is defined as `w` of
	equation 2. of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by
	setting `prior_guidance_scale > 1`. Higher guidance scale encourages to generate images that are
	closely linked to the text `prompt`, usually at the expense of lower image quality.
	- prior_num_inference_steps (`int \| dict[float, int]`, optional, defaults to 60) --
	The number of prior denoising steps. More denoising steps usually lead to a higher quality image at the
	expense of slower inference. For more specific timestep spacing, you can pass customized
	`prior_timesteps`
	- num_inference_steps (`int`, optional, defaults to 12) --
	The number of decoder denoising steps. More denoising steps usually lead to a higher quality image at
	the expense of slower inference. For more specific timestep spacing, you can pass customized
	`timesteps`
	- decoder_guidance_scale (`float`, optional, defaults to 0.0) --
	Guidance scale as defined in [Classifier-Free Diffusion
	Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
	of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
	`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
	the text `prompt`, usually at the expense of lower image quality.
	- generator (`torch.Generator` or `list[torch.Generator]`, optional) --
	One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
	to make generation deterministic.
	- latents (`torch.Tensor`, optional) --
	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
	generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
	tensor will be generated by sampling using the supplied random `generator`.
	- output_type (`str`, optional, defaults to `"pil"`) --
	The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
	(`np.array`) or `"pt"` (`torch.Tensor`).
	- return_dict (`bool`, optional, defaults to `True`) --
	Whether or not to return a [ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) instead of a plain tuple.
	- prior_callback_on_step_end (`Callable`, optional) --
	A function that calls at the end of each denoising steps during the inference. The function is called
	with the following arguments: `prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep:
	int, callback_kwargs: Dict)`.
	- prior_callback_on_step_end_tensor_inputs (`list`, optional) --
	The list of tensor inputs for the `prior_callback_on_step_end` function. The tensors specified in the
	list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in
	the `._callback_tensor_inputs` attribute of your pipeline class.
	- callback_on_step_end (`Callable`, optional) --
	A function that calls at the end of each denoising steps during the inference. The function is called
	with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
	callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
	`callback_on_step_end_tensor_inputs`.
	- callback_on_step_end_tensor_inputs (`list`, optional) --
	The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
	will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
	`._callback_tensor_inputs` attribute of your pipeline class.0[ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple` [ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) if `return_dict` is True,
	otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.

	Function invoked when calling the pipeline for generation.

	Examples:
	```py
	>>> import torch
	>>> from diffusers import StableCascadeCombinedPipeline

	>>> pipe = StableCascadeCombinedPipeline.from_pretrained(
	... "stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16
	... )
	>>> pipe.enable_model_cpu_offload()
	>>> prompt = "an image of a shiba inu, donning a spacesuit and helmet"
	>>> images = pipe(prompt=prompt)
	```

	Parameters:

	tokenizer (`CLIPTokenizer`) : The decoder tokenizer to be used for text inputs.

	text_encoder (`CLIPTextModelWithProjection`) : The decoder text encoder to be used for text inputs.

	decoder (`StableCascadeUNet`) : The decoder model to be used for decoder image generation pipeline.

	scheduler (`DDPMWuerstchenScheduler`) : The scheduler to be used for decoder image generation pipeline.

	vqgan (`PaellaVQModel`) : The VQGAN model to be used for decoder image generation pipeline.

	prior_prior (`StableCascadeUNet`) : The prior model to be used for prior pipeline.

	prior_text_encoder (`CLIPTextModelWithProjection`) : The prior text encoder to be used for text inputs.

	prior_tokenizer (`CLIPTokenizer`) : The prior tokenizer to be used for text inputs.

	prior_scheduler (`DDPMWuerstchenScheduler`) : The scheduler to be used for prior pipeline.

	prior_feature_extractor ([CLIPImageProcessor](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPImageProcessor)) : Model that extracts features from generated images to be used as inputs for the `image_encoder`.

	prior_image_encoder (`CLIPVisionModelWithProjection`) : Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).

	Returns:

	[ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple` [ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) if `return_dict` is True,
	otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
	#### enable_model_cpu_offload[[diffusers.StableCascadeCombinedPipeline.enable_model_cpu_offload]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_combined.py#L130)

	Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared
	to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward`
	method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with
	`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`.
	#### enable_sequential_cpu_offload[[diffusers.StableCascadeCombinedPipeline.enable_sequential_cpu_offload]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_combined.py#L140)

	Offloads all models (`unet`, `text_encoder`, `vae`, and `safety checker` state dicts) to CPU using 🤗
	Accelerate, significantly reducing memory usage. Models are moved to a `torch.device('meta')` and loaded on a
	GPU only when their specific submodule's `forward` method is called. Offloading happens on a submodule basis.
	Memory savings are higher than using `enable_model_cpu_offload`, but performance is lower.

	## StableCascadePriorPipeline[[diffusers.StableCascadePriorPipeline]]

	#### diffusers.StableCascadePriorPipeline[[diffusers.StableCascadePriorPipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_prior.py#L80)

	Pipeline for generating image prior for Stable Cascade.

	This model inherits from [DiffusionPipeline](/docs/diffusers/pr_13331/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
	library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

	__call__diffusers.StableCascadePriorPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_prior.py#L375[{"name": "prompt", "val": ": str \| list[str] \| None = None"}, {"name": "images", "val": ": torch.Tensor \| PIL.Image.Image \| list[torch.Tensor] \| list[PIL.Image.Image] = None"}, {"name": "height", "val": ": int = 1024"}, {"name": "width", "val": ": int = 1024"}, {"name": "num_inference_steps", "val": ": int = 20"}, {"name": "timesteps", "val": ": list = None"}, {"name": "guidance_scale", "val": ": float = 4.0"}, {"name": "negative_prompt", "val": ": str \| list[str] \| None = None"}, {"name": "prompt_embeds", "val": ": torch.Tensor \| None = None"}, {"name": "prompt_embeds_pooled", "val": ": torch.Tensor \| None = None"}, {"name": "negative_prompt_embeds", "val": ": torch.Tensor \| None = None"}, {"name": "negative_prompt_embeds_pooled", "val": ": torch.Tensor \| None = None"}, {"name": "image_embeds", "val": ": torch.Tensor \| None = None"}, {"name": "num_images_per_prompt", "val": ": int \| None = 1"}, {"name": "generator", "val": ": torch._C.Generator \| list[torch._C.Generator] \| None = None"}, {"name": "latents", "val": ": torch.Tensor \| None = None"}, {"name": "output_type", "val": ": str \| None = 'pt'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list = ['latents']"}]- prompt (`str` or `list[str]`) --
	The prompt or prompts to guide the image generation.
	- height (`int`, optional, defaults to 1024) --
	The height in pixels of the generated image.
	- width (`int`, optional, defaults to 1024) --
	The width in pixels of the generated image.
	- num_inference_steps (`int`, optional, defaults to 60) --
	The number of denoising steps. More denoising steps usually lead to a higher quality image at the
	expense of slower inference.
	- guidance_scale (`float`, optional, defaults to 8.0) --
	Guidance scale as defined in [Classifier-Free Diffusion
	Guidance](https://huggingface.co/papers/2207.12598). `decoder_guidance_scale` is defined as `w` of
	equation 2. of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by
	setting `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are
	closely linked to the text `prompt`, usually at the expense of lower image quality.
	- negative_prompt (`str` or `list[str]`, optional) --
	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
	if `decoder_guidance_scale` is less than `1`).
	- prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
	provided, text embeddings will be generated from `prompt` input argument.
	- prompt_embeds_pooled (`torch.Tensor`, optional) --
	Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting.
	If not provided, pooled text embeddings will be generated from `prompt` input argument.
	- negative_prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
	argument.
	- negative_prompt_embeds_pooled (`torch.Tensor`, optional) --
	Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, negative_prompt_embeds_pooled will be generated from `negative_prompt`
	input argument.
	- image_embeds (`torch.Tensor`, optional) --
	Pre-generated image embeddings. Can be used to easily tweak image inputs, e.g. prompt weighting. If
	not provided, image embeddings will be generated from `image` input argument if existing.
	- num_images_per_prompt (`int`, optional, defaults to 1) --
	The number of images to generate per prompt.
	- generator (`torch.Generator` or `list[torch.Generator]`, optional) --
	One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
	to make generation deterministic.
	- latents (`torch.Tensor`, optional) --
	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
	generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
	tensor will be generated by sampling using the supplied random `generator`.
	- output_type (`str`, optional, defaults to `"pil"`) --
	The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
	(`np.array`) or `"pt"` (`torch.Tensor`).
	- return_dict (`bool`, optional, defaults to `True`) --
	Whether or not to return a [ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) instead of a plain tuple.
	- callback_on_step_end (`Callable`, optional) --
	A function that calls at the end of each denoising steps during the inference. The function is called
	with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
	callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
	`callback_on_step_end_tensor_inputs`.
	- callback_on_step_end_tensor_inputs (`list`, optional) --
	The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
	will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
	`._callback_tensor_inputs` attribute of your pipeline class.0`StableCascadePriorPipelineOutput` or `tuple` `StableCascadePriorPipelineOutput` if `return_dict` is
	True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
	embeddings.

	Function invoked when calling the pipeline for generation.

	Examples:
	```py
	>>> import torch
	>>> from diffusers import StableCascadePriorPipeline

	>>> prior_pipe = StableCascadePriorPipeline.from_pretrained(
	... "stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16
	... ).to("cuda")

	>>> prompt = "an image of a shiba inu, donning a spacesuit and helmet"
	>>> prior_output = pipe(prompt)
	```

	Parameters:

	prior (`StableCascadeUNet`) : The Stable Cascade prior to approximate the image embedding from the text and/or image embedding.

	text_encoder (`CLIPTextModelWithProjection`) : Frozen text-encoder ([laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).

	feature_extractor ([CLIPImageProcessor](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPImageProcessor)) : Model that extracts features from generated images to be used as inputs for the `image_encoder`.

	image_encoder (`CLIPVisionModelWithProjection`) : Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).

	tokenizer (`CLIPTokenizer`) : Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).

	scheduler (`DDPMWuerstchenScheduler`) : A scheduler to be used in combination with `prior` to generate image embedding.

	resolution_multiple ('float', optional, defaults to 42.67) : Default resolution for multiple images generated.

	Returns:

	`StableCascadePriorPipelineOutput` or `tuple` `StableCascadePriorPipelineOutput` if `return_dict` is
	True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
	embeddings.

	## StableCascadePriorPipelineOutput[[diffusers.pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput]]

	#### diffusers.pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput[[diffusers.pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_prior.py#L60)

	Output class for WuerstchenPriorPipeline.

	Parameters:

	image_embeddings (`torch.Tensor` or `np.ndarray`) : Prior image embeddings for text prompt

	prompt_embeds (`torch.Tensor`) : Text embeddings for the prompt.

	negative_prompt_embeds (`torch.Tensor`) : Text embeddings for the negative prompt.

	## StableCascadeDecoderPipeline[[diffusers.StableCascadeDecoderPipeline]]

	#### diffusers.StableCascadeDecoderPipeline[[diffusers.StableCascadeDecoderPipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade.py#L58)

	Pipeline for generating images from the Stable Cascade model.

	This model inherits from [DiffusionPipeline](/docs/diffusers/pr_13331/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
	library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

	__call__diffusers.StableCascadeDecoderPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade.py#L304[{"name": "image_embeddings", "val": ": torch.Tensor \| list[torch.Tensor]"}, {"name": "prompt", "val": ": str \| list[str] = None"}, {"name": "num_inference_steps", "val": ": int = 10"}, {"name": "guidance_scale", "val": ": float = 0.0"}, {"name": "negative_prompt", "val": ": str \| list[str] \| None = None"}, {"name": "prompt_embeds", "val": ": torch.Tensor \| None = None"}, {"name": "prompt_embeds_pooled", "val": ": torch.Tensor \| None = None"}, {"name": "negative_prompt_embeds", "val": ": torch.Tensor \| None = None"}, {"name": "negative_prompt_embeds_pooled", "val": ": torch.Tensor \| None = None"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "generator", "val": ": torch._C.Generator \| list[torch._C.Generator] \| None = None"}, {"name": "latents", "val": ": torch.Tensor \| None = None"}, {"name": "output_type", "val": ": str \| None = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list = ['latents']"}]- image_embedding (`torch.Tensor` or `list[torch.Tensor]`) --
	Image Embeddings either extracted from an image or generated by a Prior Model.
	- prompt (`str` or `list[str]`) --
	The prompt or prompts to guide the image generation.
	- num_inference_steps (`int`, optional, defaults to 12) --
	The number of denoising steps. More denoising steps usually lead to a higher quality image at the
	expense of slower inference.
	- guidance_scale (`float`, optional, defaults to 0.0) --
	Guidance scale as defined in [Classifier-Free Diffusion
	Guidance](https://huggingface.co/papers/2207.12598). `decoder_guidance_scale` is defined as `w` of
	equation 2. of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by
	setting `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are
	closely linked to the text `prompt`, usually at the expense of lower image quality.
	- negative_prompt (`str` or `list[str]`, optional) --
	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
	if `decoder_guidance_scale` is less than `1`).
	- prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
	provided, text embeddings will be generated from `prompt` input argument.
	- prompt_embeds_pooled (`torch.Tensor`, optional) --
	Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting.
	If not provided, pooled text embeddings will be generated from `prompt` input argument.
	- negative_prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
	argument.
	- negative_prompt_embeds_pooled (`torch.Tensor`, optional) --
	Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, negative_prompt_embeds_pooled will be generated from `negative_prompt`
	input argument.
	- num_images_per_prompt (`int`, optional, defaults to 1) --
	The number of images to generate per prompt.
	- generator (`torch.Generator` or `list[torch.Generator]`, optional) --
	One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
	to make generation deterministic.
	- latents (`torch.Tensor`, optional) --
	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
	generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
	tensor will be generated by sampling using the supplied random `generator`.
	- output_type (`str`, optional, defaults to `"pil"`) --
	The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
	(`np.array`) or `"pt"` (`torch.Tensor`).
	- return_dict (`bool`, optional, defaults to `True`) --
	Whether or not to return a [ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) instead of a plain tuple.
	- callback_on_step_end (`Callable`, optional) --
	A function that calls at the end of each denoising steps during the inference. The function is called
	with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
	callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
	`callback_on_step_end_tensor_inputs`.
	- callback_on_step_end_tensor_inputs (`list`, optional) --
	The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
	will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
	`._callback_tensor_inputs` attribute of your pipeline class.0[ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple` [ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) if `return_dict` is True,
	otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
	embeddings.

	Function invoked when calling the pipeline for generation.

	Examples:
	```py
	>>> import torch
	>>> from diffusers import StableCascadePriorPipeline, StableCascadeDecoderPipeline

	>>> prior_pipe = StableCascadePriorPipeline.from_pretrained(
	... "stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16
	... ).to("cuda")
	>>> gen_pipe = StableCascadeDecoderPipeline.from_pretrain(
	... "stabilityai/stable-cascade", torch_dtype=torch.float16
	... ).to("cuda")

	>>> prompt = "an image of a shiba inu, donning a spacesuit and helmet"
	>>> prior_output = pipe(prompt)
	>>> images = gen_pipe(prior_output.image_embeddings, prompt=prompt)
	```

	Parameters:

	tokenizer (`CLIPTokenizer`) : The CLIP tokenizer.

	text_encoder (`CLIPTextModelWithProjection`) : The CLIP text encoder.

	decoder (`StableCascadeUNet`) : The Stable Cascade decoder unet.

	vqgan (`PaellaVQModel`) : The VQGAN model.

	scheduler (`DDPMWuerstchenScheduler`) : A scheduler to be used in combination with `prior` to generate image embedding.

	latent_dim_scale (float, `optional`, defaults to 10.67) : Multiplier to determine the VQ latent space size from the image embeddings. If the image embeddings are height=24 and width=24, the VQ latent shape needs to be height=int(2410.67)=256 and width=int(2410.67)=256 in order to match the training conditions.

	Returns:

	[ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple` [ImagePipelineOutput](/docs/diffusers/pr_13331/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) if `return_dict` is True,
	otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
	embeddings.

Xet Storage Details

Size:: 37.4 kB
Xet hash:: f8fb01c721790705e2aa003937152f3324a34a107fdc05330a2314a0f8167d83

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.