| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| --> | |
| # DeepFloyd IF | |
| ## Overview | |
| DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. | |
| The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: | |
| - Stage 1: a base model that generates 64x64 px image based on text prompt, | |
| - Stage 2: a 64x64 px => 256x256 px super-resolution model, and a | |
| - Stage 3: a 256x256 px => 1024x1024 px super-resolution model | |
| Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, | |
| which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. | |
| Stage 3 is [Stability's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler). | |
| The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. | |
| Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis. | |
| ## Usage | |
| Before you can use IF, you need to accept its usage conditions. To do so: | |
| 1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in | |
| 2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). Accepting the license on the stage I model card will auto accept for the other IF models. | |
| 3. Make sure to login locally. Install `huggingface_hub` | |
| ```sh | |
| pip install huggingface_hub --upgrade | |
| ``` | |
| run the login function in a Python shell | |
| ```py | |
| from huggingface_hub import login | |
| login() | |
| ``` | |
| and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/security-tokens#what-are-user-access-tokens). | |
| Next we install `diffusers` and dependencies: | |
| ```sh | |
| pip install diffusers accelerate transformers safetensors | |
| ``` | |
| The following sections give more in-detail examples of how to use IF. Specifically: | |
| - [Text-to-Image Generation](#text-to-image-generation) | |
| - [Image-to-Image Generation](#text-guided-image-to-image-generation) | |
| - [Inpainting](#text-guided-inpainting-generation) | |
| - [Reusing model weights](#converting-between-different-pipelines) | |
| - [Speed optimization](#optimizing-for-speed) | |
| - [Memory optimization](#optimizing-for-memory) | |
| **Available checkpoints** | |
| - *Stage-1* | |
| - [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) | |
| - [DeepFloyd/IF-I-L-v1.0](https://huggingface.co/DeepFloyd/IF-I-L-v1.0) | |
| - [DeepFloyd/IF-I-M-v1.0](https://huggingface.co/DeepFloyd/IF-I-M-v1.0) | |
| - *Stage-2* | |
| - [DeepFloyd/IF-II-L-v1.0](https://huggingface.co/DeepFloyd/IF-II-L-v1.0) | |
| - [DeepFloyd/IF-II-M-v1.0](https://huggingface.co/DeepFloyd/IF-II-M-v1.0) | |
| - *Stage-3* | |
| - [stabilityai/stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | |
| **Demo** | |
| [](https://huggingface.co/spaces/DeepFloyd/IF) | |
| **Google Colab** | |
| [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb) | |
| ### Text-to-Image Generation | |
| By default diffusers makes use of [model cpu offloading](https://huggingface.co/docs/diffusers/optimization/fp16#model-offloading-for-fast-inference-and-memory-savings) | |
| to run the whole IF pipeline with as little as 14 GB of VRAM. | |
| ```python | |
| from diffusers import DiffusionPipeline | |
| from diffusers.utils import pt_to_pil | |
| import torch | |
| # stage 1 | |
| stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| stage_1.enable_model_cpu_offload() | |
| # stage 2 | |
| stage_2 = DiffusionPipeline.from_pretrained( | |
| "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 | |
| ) | |
| stage_2.enable_model_cpu_offload() | |
| # stage 3 | |
| safety_modules = { | |
| "feature_extractor": stage_1.feature_extractor, | |
| "safety_checker": stage_1.safety_checker, | |
| "watermarker": stage_1.watermarker, | |
| } | |
| stage_3 = DiffusionPipeline.from_pretrained( | |
| "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 | |
| ) | |
| stage_3.enable_model_cpu_offload() | |
| prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' | |
| generator = torch.manual_seed(1) | |
| # text embeds | |
| prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) | |
| # stage 1 | |
| image = stage_1( | |
| prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_I.png") | |
| # stage 2 | |
| image = stage_2( | |
| image=image, | |
| prompt_embeds=prompt_embeds, | |
| negative_prompt_embeds=negative_embeds, | |
| generator=generator, | |
| output_type="pt", | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_II.png") | |
| # stage 3 | |
| image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images | |
| image[0].save("./if_stage_III.png") | |
| ``` | |
| ### Text Guided Image-to-Image Generation | |
| The same IF model weights can be used for text-guided image-to-image translation or image variation. | |
| In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines. | |
| **Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines | |
| without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines). | |
| ```python | |
| from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline | |
| from diffusers.utils import pt_to_pil | |
| import torch | |
| from PIL import Image | |
| import requests | |
| from io import BytesIO | |
| # download image | |
| url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" | |
| response = requests.get(url) | |
| original_image = Image.open(BytesIO(response.content)).convert("RGB") | |
| original_image = original_image.resize((768, 512)) | |
| # stage 1 | |
| stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| stage_1.enable_model_cpu_offload() | |
| # stage 2 | |
| stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained( | |
| "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 | |
| ) | |
| stage_2.enable_model_cpu_offload() | |
| # stage 3 | |
| safety_modules = { | |
| "feature_extractor": stage_1.feature_extractor, | |
| "safety_checker": stage_1.safety_checker, | |
| "watermarker": stage_1.watermarker, | |
| } | |
| stage_3 = DiffusionPipeline.from_pretrained( | |
| "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 | |
| ) | |
| stage_3.enable_model_cpu_offload() | |
| prompt = "A fantasy landscape in style minecraft" | |
| generator = torch.manual_seed(1) | |
| # text embeds | |
| prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) | |
| # stage 1 | |
| image = stage_1( | |
| image=original_image, | |
| prompt_embeds=prompt_embeds, | |
| negative_prompt_embeds=negative_embeds, | |
| generator=generator, | |
| output_type="pt", | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_I.png") | |
| # stage 2 | |
| image = stage_2( | |
| image=image, | |
| original_image=original_image, | |
| prompt_embeds=prompt_embeds, | |
| negative_prompt_embeds=negative_embeds, | |
| generator=generator, | |
| output_type="pt", | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_II.png") | |
| # stage 3 | |
| image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images | |
| image[0].save("./if_stage_III.png") | |
| ``` | |
| ### Text Guided Inpainting Generation | |
| The same IF model weights can be used for text-guided image-to-image translation or image variation. | |
| In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines. | |
| **Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines | |
| without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines). | |
| ```python | |
| from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline | |
| from diffusers.utils import pt_to_pil | |
| import torch | |
| from PIL import Image | |
| import requests | |
| from io import BytesIO | |
| # download image | |
| url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" | |
| response = requests.get(url) | |
| original_image = Image.open(BytesIO(response.content)).convert("RGB") | |
| original_image = original_image | |
| # download mask | |
| url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" | |
| response = requests.get(url) | |
| mask_image = Image.open(BytesIO(response.content)) | |
| mask_image = mask_image | |
| # stage 1 | |
| stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| stage_1.enable_model_cpu_offload() | |
| # stage 2 | |
| stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained( | |
| "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 | |
| ) | |
| stage_2.enable_model_cpu_offload() | |
| # stage 3 | |
| safety_modules = { | |
| "feature_extractor": stage_1.feature_extractor, | |
| "safety_checker": stage_1.safety_checker, | |
| "watermarker": stage_1.watermarker, | |
| } | |
| stage_3 = DiffusionPipeline.from_pretrained( | |
| "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 | |
| ) | |
| stage_3.enable_model_cpu_offload() | |
| prompt = "blue sunglasses" | |
| generator = torch.manual_seed(1) | |
| # text embeds | |
| prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) | |
| # stage 1 | |
| image = stage_1( | |
| image=original_image, | |
| mask_image=mask_image, | |
| prompt_embeds=prompt_embeds, | |
| negative_prompt_embeds=negative_embeds, | |
| generator=generator, | |
| output_type="pt", | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_I.png") | |
| # stage 2 | |
| image = stage_2( | |
| image=image, | |
| original_image=original_image, | |
| mask_image=mask_image, | |
| prompt_embeds=prompt_embeds, | |
| negative_prompt_embeds=negative_embeds, | |
| generator=generator, | |
| output_type="pt", | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_II.png") | |
| # stage 3 | |
| image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images | |
| image[0].save("./if_stage_III.png") | |
| ``` | |
| ### Converting between different pipelines | |
| In addition to being loaded with `from_pretrained`, Pipelines can also be loaded directly from each other. | |
| ```python | |
| from diffusers import IFPipeline, IFSuperResolutionPipeline | |
| pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0") | |
| pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0") | |
| from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline | |
| pipe_1 = IFImg2ImgPipeline(**pipe_1.components) | |
| pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components) | |
| from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline | |
| pipe_1 = IFInpaintingPipeline(**pipe_1.components) | |
| pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components) | |
| ``` | |
| ### Optimizing for speed | |
| The simplest optimization to run IF faster is to move all model components to the GPU. | |
| ```py | |
| pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| pipe.to("cuda") | |
| ``` | |
| You can also run the diffusion process for a shorter number of timesteps. | |
| This can either be done with the `num_inference_steps` argument | |
| ```py | |
| pipe("<prompt>", num_inference_steps=30) | |
| ``` | |
| Or with the `timesteps` argument | |
| ```py | |
| from diffusers.pipelines.deepfloyd_if import fast27_timesteps | |
| pipe("<prompt>", timesteps=fast27_timesteps) | |
| ``` | |
| When doing image variation or inpainting, you can also decrease the number of timesteps | |
| with the strength argument. The strength argument is the amount of noise to add to | |
| the input image which also determines how many steps to run in the denoising process. | |
| A smaller number will vary the image less but run faster. | |
| ```py | |
| pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| pipe.to("cuda") | |
| image = pipe(image=image, prompt="<prompt>", strength=0.3).images | |
| ``` | |
| You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile` | |
| with IF and it might not give expected results. | |
| ```py | |
| import torch | |
| pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| pipe.to("cuda") | |
| pipe.text_encoder = torch.compile(pipe.text_encoder) | |
| pipe.unet = torch.compile(pipe.unet) | |
| ``` | |
| ### Optimizing for memory | |
| When optimizing for GPU memory, we can use the standard diffusers cpu offloading APIs. | |
| Either the model based CPU offloading, | |
| ```py | |
| pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| pipe.enable_model_cpu_offload() | |
| ``` | |
| or the more aggressive layer based CPU offloading. | |
| ```py | |
| pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) | |
| pipe.enable_sequential_cpu_offload() | |
| ``` | |
| Additionally, T5 can be loaded in 8bit precision | |
| ```py | |
| from transformers import T5EncoderModel | |
| text_encoder = T5EncoderModel.from_pretrained( | |
| "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" | |
| ) | |
| from diffusers import DiffusionPipeline | |
| pipe = DiffusionPipeline.from_pretrained( | |
| "DeepFloyd/IF-I-XL-v1.0", | |
| text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder | |
| unet=None, | |
| device_map="auto", | |
| ) | |
| prompt_embeds, negative_embeds = pipe.encode_prompt("<prompt>") | |
| ``` | |
| For CPU RAM constrained machines like google colab free tier where we can't load all | |
| model components to the CPU at once, we can manually only load the pipeline with | |
| the text encoder or unet when the respective model components are needed. | |
| ```py | |
| from diffusers import IFPipeline, IFSuperResolutionPipeline | |
| import torch | |
| import gc | |
| from transformers import T5EncoderModel | |
| from diffusers.utils import pt_to_pil | |
| text_encoder = T5EncoderModel.from_pretrained( | |
| "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" | |
| ) | |
| # text to image | |
| pipe = DiffusionPipeline.from_pretrained( | |
| "DeepFloyd/IF-I-XL-v1.0", | |
| text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder | |
| unet=None, | |
| device_map="auto", | |
| ) | |
| prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' | |
| prompt_embeds, negative_embeds = pipe.encode_prompt(prompt) | |
| # Remove the pipeline so we can re-load the pipeline with the unet | |
| del text_encoder | |
| del pipe | |
| gc.collect() | |
| torch.cuda.empty_cache() | |
| pipe = IFPipeline.from_pretrained( | |
| "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" | |
| ) | |
| generator = torch.Generator().manual_seed(0) | |
| image = pipe( | |
| prompt_embeds=prompt_embeds, | |
| negative_prompt_embeds=negative_embeds, | |
| output_type="pt", | |
| generator=generator, | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_I.png") | |
| # Remove the pipeline so we can load the super-resolution pipeline | |
| del pipe | |
| gc.collect() | |
| torch.cuda.empty_cache() | |
| # First super resolution | |
| pipe = IFSuperResolutionPipeline.from_pretrained( | |
| "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" | |
| ) | |
| generator = torch.Generator().manual_seed(0) | |
| image = pipe( | |
| image=image, | |
| prompt_embeds=prompt_embeds, | |
| negative_prompt_embeds=negative_embeds, | |
| output_type="pt", | |
| generator=generator, | |
| ).images | |
| pt_to_pil(image)[0].save("./if_stage_II.png") | |
| ``` | |
| ## Available Pipelines: | |
| | Pipeline | Tasks | Colab | |
| |---|---|:---:| | |
| | [pipeline_if.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - | | |
| | [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - | | |
| | [pipeline_if_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py) | *Image-to-Image Generation* | - | | |
| | [pipeline_if_img2img_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py) | *Image-to-Image Generation* | - | | |
| | [pipeline_if_inpainting.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py) | *Image-to-Image Generation* | - | | |
| | [pipeline_if_inpainting_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py) | *Image-to-Image Generation* | - | | |
| ## IFPipeline | |
| [[autodoc]] IFPipeline | |
| - all | |
| - __call__ | |
| ## IFSuperResolutionPipeline | |
| [[autodoc]] IFSuperResolutionPipeline | |
| - all | |
| - __call__ | |
| ## IFImg2ImgPipeline | |
| [[autodoc]] IFImg2ImgPipeline | |
| - all | |
| - __call__ | |
| ## IFImg2ImgSuperResolutionPipeline | |
| [[autodoc]] IFImg2ImgSuperResolutionPipeline | |
| - all | |
| - __call__ | |
| ## IFInpaintingPipeline | |
| [[autodoc]] IFInpaintingPipeline | |
| - all | |
| - __call__ | |
| ## IFInpaintingSuperResolutionPipeline | |
| [[autodoc]] IFInpaintingSuperResolutionPipeline | |
| - all | |
| - __call__ | |