| <! |
|
|
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
|
|
| http://www.apache.org/licenses/LICENSE-2.0 |
|
|
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations under the License. |
| |
|
|
| # Stable unCLIP |
|
|
| Stable unCLIP checkpoints are finetuned from [stable diffusion 2.1](./stable_diffusion_2) checkpoints to condition on CLIP image embeddings. |
| Stable unCLIP also still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used |
| for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation. |
|
|
| To know more about the unCLIP process, check out the following paper: |
|
|
| [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. |
|
|
| ## Tips |
|
|
| Stable unCLIP takes a `noise_level` as input during inference. `noise_level` determines how much noise is added |
| to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, |
| we do not add any additional noise to the image embeddings i.e. `noise_level = 0`. |
|
|
| ### Available checkpoints: |
|
|
| * Image variation |
| * [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) |
| * [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) |
| * Text-to-image |
| * [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) |
|
|
| ### Text-to-Image Generation |
| Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha) |
|
|
| ```python |
| import torch |
| from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline |
| from diffusers.models import PriorTransformer |
| from transformers import CLIPTokenizer, CLIPTextModelWithProjection |
|
|
| prior_model_id = "kakaobrain/karlo-v1-alpha" |
| data_type = torch.float16 |
| prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type) |
|
|
| prior_text_model_id = "openai/clip-vit-large-patch14" |
| prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id) |
| prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type) |
| prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler") |
| prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config) |
|
|
| stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small" |
|
|
| pipe = StableUnCLIPPipeline.from_pretrained( |
| stable_unclip_model_id, |
| torch_dtype=data_type, |
| variant="fp16", |
| prior_tokenizer=prior_tokenizer, |
| prior_text_encoder=prior_text_model, |
| prior=prior, |
| prior_scheduler=prior_scheduler, |
| ) |
|
|
| pipe = pipe.to("cuda") |
| wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular" |
|
|
| images = pipe(prompt=wave_prompt).images |
| images[0].save("waves.png") |
| ``` |
| <Tip warning={true}> |
|
|
| For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. |
|
|
| </Tip> |
|
|
| ### Text guided Image-to-Image Variation |
|
|
| ```python |
| from diffusers import StableUnCLIPImg2ImgPipeline |
| from diffusers.utils import load_image |
| import torch |
|
|
| pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( |
| "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" |
| ) |
| pipe = pipe.to("cuda") |
|
|
| url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" |
| init_image = load_image(url) |
|
|
| images = pipe(init_image).images |
| images[0].save("variation_image.png") |
| ``` |
|
|
| Optionally, you can also pass a prompt to `pipe` such as: |
|
|
| ```python |
| prompt = "A fantasy landscape, trending on artstation" |
|
|
| images = pipe(init_image, prompt=prompt).images |
| images[0].save("variation_image_two.png") |
| ``` |
|
|
| ### Memory optimization |
|
|
| If you are short on GPU memory, you can enable smart CPU offloading so that models that are not needed |
| immediately for a computation can be offloaded to CPU: |
|
|
| ```python |
| from diffusers import StableUnCLIPImg2ImgPipeline |
| from diffusers.utils import load_image |
| import torch |
|
|
| pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( |
| "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" |
| ) |
| # Offload to CPU. |
| pipe.enable_model_cpu_offload() |
|
|
| url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" |
| init_image = load_image(url) |
|
|
| images = pipe(init_image).images |
| images[0] |
| ``` |
|
|
| Further memory optimizations are possible by enabling VAE slicing on the pipeline: |
|
|
| ```python |
| from diffusers import StableUnCLIPImg2ImgPipeline |
| from diffusers.utils import load_image |
| import torch |
|
|
| pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( |
| "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" |
| ) |
| pipe.enable_model_cpu_offload() |
| pipe.enable_vae_slicing() |
|
|
| url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" |
| init_image = load_image(url) |
|
|
| images = pipe(init_image).images |
| images[0] |
| ``` |
|
|
| ### StableUnCLIPPipeline |
|
|
| [[autodoc]] StableUnCLIPPipeline |
| - all |
| - __call__ |
| - enable_attention_slicing |
| - disable_attention_slicing |
| - enable_vae_slicing |
| - disable_vae_slicing |
| - enable_xformers_memory_efficient_attention |
| - disable_xformers_memory_efficient_attention |
|
|
|
|
| ### StableUnCLIPImg2ImgPipeline |
|
|
| [[autodoc]] StableUnCLIPImg2ImgPipeline |
| - all |
| - __call__ |
| - enable_attention_slicing |
| - disable_attention_slicing |
| - enable_vae_slicing |
| - disable_vae_slicing |
| - enable_xformers_memory_efficient_attention |
| - disable_xformers_memory_efficient_attention |
| |