| | --- |
| | license: openrail |
| | base_model: runwayml/stable-diffusion-v1-5 |
| | tags: |
| | - art |
| | - controlnet |
| | - stable-diffusion |
| | --- |
| | |
| | # Controlnet - *Normal Map Version* |
| |
|
| | ControlNet is a neural network structure to control diffusion models by adding extra conditions. |
| | This checkpoint corresponds to the ControlNet conditioned on **Normal Map Estimation**. |
| |
|
| | It can be used in combination with [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img). |
| |
|
| |  |
| |
|
| | ## Model Details |
| | - **Developed by:** Lvmin Zhang, Maneesh Agrawala |
| | - **Model type:** Diffusion-based text-to-image generation model |
| | - **Language(s):** English |
| | - **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based. |
| | - **Resources for more information:** [GitHub Repository](https://github.com/lllyasviel/ControlNet), [Paper](https://arxiv.org/abs/2302.05543). |
| | - **Cite as:** |
| |
|
| | @misc{zhang2023adding, |
| | title={Adding Conditional Control to Text-to-Image Diffusion Models}, |
| | author={Lvmin Zhang and Maneesh Agrawala}, |
| | year={2023}, |
| | eprint={2302.05543}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV} |
| | } |
| | |
| | ## Introduction |
| |
|
| | Controlnet was proposed in [*Adding Conditional Control to Text-to-Image Diffusion Models*](https://arxiv.org/abs/2302.05543) by |
| | Lvmin Zhang, Maneesh Agrawala. |
| |
|
| | The abstract reads as follows: |
| |
|
| | *We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. |
| | The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). |
| | Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. |
| | Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. |
| | We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. |
| | This may enrich the methods to control large diffusion models and further facilitate related applications.* |
| |
|
| | ## Released Checkpoints |
| |
|
| | The authors released 8 different checkpoints, each trained with [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) |
| | on a different type of conditioning: |
| |
|
| | | Model Name | Control Image Overview| Control Image Example | Generated Image Example | |
| | |---|---|---|---| |
| | |[lllyasviel/sd-controlnet-canny](https://huggingface.co/lllyasviel/sd-controlnet-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_canny.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_canny.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"/></a>| |
| | |[lllyasviel/sd-controlnet-depth](https://huggingface.co/lllyasviel/sd-controlnet-depth)<br/> *Trained with Midas depth estimation* |A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_depth.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_depth.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"/></a>| |
| | |[lllyasviel/sd-controlnet-hed](https://huggingface.co/lllyasviel/sd-controlnet-hed)<br/> *Trained with HED edge detection (soft edge)* |A monochrome image with white soft edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_hed.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_hed.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"/></a> | |
| | |[lllyasviel/sd-controlnet-mlsd](https://huggingface.co/lllyasviel/sd-controlnet-mlsd)<br/> *Trained with M-LSD line detection* |A monochrome image composed only of white straight lines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_mlsd.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_mlsd.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"/></a>| |
| | |[lllyasviel/sd-controlnet-normal](https://huggingface.co/lllyasviel/sd-controlnet-normal)<br/> *Trained with normal map* |A [normal mapped](https://en.wikipedia.org/wiki/Normal_mapping) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_normal.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_normal.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"/></a>| |
| | |[lllyasviel/sd-controlnet_openpose](https://huggingface.co/lllyasviel/sd-controlnet-openpose)<br/> *Trained with OpenPose bone image* |A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_openpose.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_openpose.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"/></a>| |
| | |[lllyasviel/sd-controlnet_scribble](https://huggingface.co/lllyasviel/sd-controlnet-scribble)<br/> *Trained with human scribbles* |A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_scribble.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_scribble.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"/></a> | |
| | |[lllyasviel/sd-controlnet_seg](https://huggingface.co/lllyasviel/sd-controlnet-seg)<br/>*Trained with semantic segmentation* |An [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)'s segmentation protocol image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_seg.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_seg.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"/></a> | |
| |
|
| |
|
| |
|
| | ## Example |
| |
|
| | It is recommended to use the checkpoint with [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) as the checkpoint |
| | has been trained on it. |
| | Experimentally, the checkpoint can be used with other diffusion models such as dreamboothed stable diffusion. |
| |
|
| | 1. Let's install `diffusers` and related packages: |
| |
|
| | ``` |
| | $ pip install diffusers transformers git+https://github.com/huggingface/accelerate.git |
| | ``` |
| |
|
| | 2. Run code: |
| |
|
| | ```py |
| | from PIL import Image |
| | from transformers import pipeline |
| | import numpy as np |
| | import cv2 |
| | from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler |
| | import torch |
| | from diffusers.utils import load_image |
| | |
| | image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-normal/resolve/main/images/toy.png").convert("RGB") |
| | |
| | depth_estimator = pipeline("depth-estimation", model ="Intel/dpt-hybrid-midas" ) |
| | |
| | image = depth_estimator(image)['predicted_depth'][0] |
| | |
| | image = image.numpy() |
| | |
| | image_depth = image.copy() |
| | image_depth -= np.min(image_depth) |
| | image_depth /= np.max(image_depth) |
| | |
| | bg_threhold = 0.4 |
| | |
| | x = cv2.Sobel(image, cv2.CV_32F, 1, 0, ksize=3) |
| | x[image_depth < bg_threhold] = 0 |
| | |
| | y = cv2.Sobel(image, cv2.CV_32F, 0, 1, ksize=3) |
| | y[image_depth < bg_threhold] = 0 |
| | |
| | z = np.ones_like(x) * np.pi * 2.0 |
| | |
| | image = np.stack([x, y, z], axis=2) |
| | image /= np.sum(image ** 2.0, axis=2, keepdims=True) ** 0.5 |
| | image = (image * 127.5 + 127.5).clip(0, 255).astype(np.uint8) |
| | image = Image.fromarray(image) |
| | |
| | controlnet = ControlNetModel.from_pretrained( |
| | "fusing/stable-diffusion-v1-5-controlnet-normal", torch_dtype=torch.float16 |
| | ) |
| | |
| | pipe = StableDiffusionControlNetPipeline.from_pretrained( |
| | "runwayml/stable-diffusion-v1-5", controlnet=controlnet, safety_checker=None, torch_dtype=torch.float16 |
| | ) |
| | |
| | pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) |
| | |
| | # Remove if you do not have xformers installed |
| | # see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers |
| | # for installation instructions |
| | pipe.enable_xformers_memory_efficient_attention() |
| | |
| | pipe.enable_model_cpu_offload() |
| | |
| | image = pipe("cute toy", image, num_inference_steps=20).images[0] |
| | |
| | image.save('images/toy_normal_out.png') |
| | ``` |
| |
|
| |  |
| |
|
| |  |
| |
|
| |  |
| |
|
| | ### Training |
| |
|
| | The normal model was trained from an initial model and then a further extended model. |
| |
|
| | The initial normal model was trained on 25,452 normal-image, caption pairs from DIODE. The image captions were generated by BLIP. The model was trained for 100 GPU-hours with Nvidia A100 80G using Stable Diffusion 1.5 as a base model. |
| |
|
| | The extended normal model further trained the initial normal model on "coarse" normal maps. The coarse normal maps were generated using Midas to compute a depth map and then performing normal-from-distance. The model was trained for 200 GPU-hours with Nvidia A100 80G using the initial normal model as a base model. |
| |
|
| | ### Blog post |
| |
|
| | For more information, please also have a look at the [official ControlNet Blog Post](https://huggingface.co/blog/controlnet). |
| |
|