Instructions to use cal54321/sd-controlnet-seg with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use cal54321/sd-controlnet-seg with Diffusers:
pip install -U diffusers transformers accelerate
from diffusers import ControlNetModel, StableDiffusionControlNetPipeline controlnet = ControlNetModel.from_pretrained("cal54321/sd-controlnet-seg") pipe = StableDiffusionControlNetPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet ) - Notebooks
- Google Colab
- Kaggle
| license: openrail | |
| base_model: runwayml/stable-diffusion-v1-5 | |
| tags: | |
| - art | |
| - controlnet | |
| - stable-diffusion | |
| - image-to-image | |
| duplicated_from: lllyasviel/sd-controlnet-seg | |
| # Controlnet - *Image Segmentation Version* | |
| ControlNet is a neural network structure to control diffusion models by adding extra conditions. | |
| This checkpoint corresponds to the ControlNet conditioned on **Image Segmentation**. | |
| It can be used in combination with [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img). | |
|  | |
| ## Model Details | |
| - **Developed by:** Lvmin Zhang, Maneesh Agrawala | |
| - **Model type:** Diffusion-based text-to-image generation model | |
| - **Language(s):** English | |
| - **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based. | |
| - **Resources for more information:** [GitHub Repository](https://github.com/lllyasviel/ControlNet), [Paper](https://arxiv.org/abs/2302.05543). | |
| - **Cite as:** | |
| @misc{zhang2023adding, | |
| title={Adding Conditional Control to Text-to-Image Diffusion Models}, | |
| author={Lvmin Zhang and Maneesh Agrawala}, | |
| year={2023}, | |
| eprint={2302.05543}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV} | |
| } | |
| ## Introduction | |
| Controlnet was proposed in [*Adding Conditional Control to Text-to-Image Diffusion Models*](https://arxiv.org/abs/2302.05543) by | |
| Lvmin Zhang, Maneesh Agrawala. | |
| The abstract reads as follows: | |
| *We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. | |
| The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). | |
| Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. | |
| Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. | |
| We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. | |
| This may enrich the methods to control large diffusion models and further facilitate related applications.* | |
| ## Released Checkpoints | |
| The authors released 8 different checkpoints, each trained with [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) | |
| on a different type of conditioning: | |
| | Model Name | Control Image Overview| Control Image Example | Generated Image Example | | |
| |---|---|---|---| | |
| |[lllyasviel/sd-controlnet-canny](https://huggingface.co/lllyasviel/sd-controlnet-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_canny.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_canny.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"/></a>| | |
| |[lllyasviel/sd-controlnet-depth](https://huggingface.co/lllyasviel/sd-controlnet-depth)<br/> *Trained with Midas depth estimation* |A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_depth.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_depth.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"/></a>| | |
| |[lllyasviel/sd-controlnet-hed](https://huggingface.co/lllyasviel/sd-controlnet-hed)<br/> *Trained with HED edge detection (soft edge)* |A monochrome image with white soft edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_hed.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_hed.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"/></a> | | |
| |[lllyasviel/sd-controlnet-mlsd](https://huggingface.co/lllyasviel/sd-controlnet-mlsd)<br/> *Trained with M-LSD line detection* |A monochrome image composed only of white straight lines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_mlsd.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_mlsd.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"/></a>| | |
| |[lllyasviel/sd-controlnet-normal](https://huggingface.co/lllyasviel/sd-controlnet-normal)<br/> *Trained with normal map* |A [normal mapped](https://en.wikipedia.org/wiki/Normal_mapping) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_normal.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_normal.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"/></a>| | |
| |[lllyasviel/sd-controlnet_openpose](https://huggingface.co/lllyasviel/sd-controlnet-openpose)<br/> *Trained with OpenPose bone image* |A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_openpose.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_openpose.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"/></a>| | |
| |[lllyasviel/sd-controlnet_scribble](https://huggingface.co/lllyasviel/sd-controlnet-scribble)<br/> *Trained with human scribbles* |A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_scribble.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_scribble.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"/></a> | | |
| |[lllyasviel/sd-controlnet_seg](https://huggingface.co/lllyasviel/sd-controlnet-seg)<br/>*Trained with semantic segmentation* |An [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)'s segmentation protocol image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_seg.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_seg.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"/></a> | | |
| ## Example | |
| It is recommended to use the checkpoint with [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) as the checkpoint | |
| has been trained on it. | |
| Experimentally, the checkpoint can be used with other diffusion models such as dreamboothed stable diffusion. | |
| 1. Let's install `diffusers` and related packages: | |
| ``` | |
| $ pip install diffusers transformers accelerate | |
| ``` | |
| 2. We'll need to make use of a color palette here as described in [semantic_segmentation](https://huggingface.co/docs/transformers/tasks/semantic_segmentation): | |
| ```py | |
| palette = np.asarray([ | |
| [0, 0, 0], | |
| [120, 120, 120], | |
| [180, 120, 120], | |
| [6, 230, 230], | |
| [80, 50, 50], | |
| [4, 200, 3], | |
| [120, 120, 80], | |
| [140, 140, 140], | |
| [204, 5, 255], | |
| [230, 230, 230], | |
| [4, 250, 7], | |
| [224, 5, 255], | |
| [235, 255, 7], | |
| [150, 5, 61], | |
| [120, 120, 70], | |
| [8, 255, 51], | |
| [255, 6, 82], | |
| [143, 255, 140], | |
| [204, 255, 4], | |
| [255, 51, 7], | |
| [204, 70, 3], | |
| [0, 102, 200], | |
| [61, 230, 250], | |
| [255, 6, 51], | |
| [11, 102, 255], | |
| [255, 7, 71], | |
| [255, 9, 224], | |
| [9, 7, 230], | |
| [220, 220, 220], | |
| [255, 9, 92], | |
| [112, 9, 255], | |
| [8, 255, 214], | |
| [7, 255, 224], | |
| [255, 184, 6], | |
| [10, 255, 71], | |
| [255, 41, 10], | |
| [7, 255, 255], | |
| [224, 255, 8], | |
| [102, 8, 255], | |
| [255, 61, 6], | |
| [255, 194, 7], | |
| [255, 122, 8], | |
| [0, 255, 20], | |
| [255, 8, 41], | |
| [255, 5, 153], | |
| [6, 51, 255], | |
| [235, 12, 255], | |
| [160, 150, 20], | |
| [0, 163, 255], | |
| [140, 140, 140], | |
| [250, 10, 15], | |
| [20, 255, 0], | |
| [31, 255, 0], | |
| [255, 31, 0], | |
| [255, 224, 0], | |
| [153, 255, 0], | |
| [0, 0, 255], | |
| [255, 71, 0], | |
| [0, 235, 255], | |
| [0, 173, 255], | |
| [31, 0, 255], | |
| [11, 200, 200], | |
| [255, 82, 0], | |
| [0, 255, 245], | |
| [0, 61, 255], | |
| [0, 255, 112], | |
| [0, 255, 133], | |
| [255, 0, 0], | |
| [255, 163, 0], | |
| [255, 102, 0], | |
| [194, 255, 0], | |
| [0, 143, 255], | |
| [51, 255, 0], | |
| [0, 82, 255], | |
| [0, 255, 41], | |
| [0, 255, 173], | |
| [10, 0, 255], | |
| [173, 255, 0], | |
| [0, 255, 153], | |
| [255, 92, 0], | |
| [255, 0, 255], | |
| [255, 0, 245], | |
| [255, 0, 102], | |
| [255, 173, 0], | |
| [255, 0, 20], | |
| [255, 184, 184], | |
| [0, 31, 255], | |
| [0, 255, 61], | |
| [0, 71, 255], | |
| [255, 0, 204], | |
| [0, 255, 194], | |
| [0, 255, 82], | |
| [0, 10, 255], | |
| [0, 112, 255], | |
| [51, 0, 255], | |
| [0, 194, 255], | |
| [0, 122, 255], | |
| [0, 255, 163], | |
| [255, 153, 0], | |
| [0, 255, 10], | |
| [255, 112, 0], | |
| [143, 255, 0], | |
| [82, 0, 255], | |
| [163, 255, 0], | |
| [255, 235, 0], | |
| [8, 184, 170], | |
| [133, 0, 255], | |
| [0, 255, 92], | |
| [184, 0, 255], | |
| [255, 0, 31], | |
| [0, 184, 255], | |
| [0, 214, 255], | |
| [255, 0, 112], | |
| [92, 255, 0], | |
| [0, 224, 255], | |
| [112, 224, 255], | |
| [70, 184, 160], | |
| [163, 0, 255], | |
| [153, 0, 255], | |
| [71, 255, 0], | |
| [255, 0, 163], | |
| [255, 204, 0], | |
| [255, 0, 143], | |
| [0, 255, 235], | |
| [133, 255, 0], | |
| [255, 0, 235], | |
| [245, 0, 255], | |
| [255, 0, 122], | |
| [255, 245, 0], | |
| [10, 190, 212], | |
| [214, 255, 0], | |
| [0, 204, 255], | |
| [20, 0, 255], | |
| [255, 255, 0], | |
| [0, 153, 255], | |
| [0, 41, 255], | |
| [0, 255, 204], | |
| [41, 0, 255], | |
| [41, 255, 0], | |
| [173, 0, 255], | |
| [0, 245, 255], | |
| [71, 0, 255], | |
| [122, 0, 255], | |
| [0, 255, 184], | |
| [0, 92, 255], | |
| [184, 255, 0], | |
| [0, 133, 255], | |
| [255, 214, 0], | |
| [25, 194, 194], | |
| [102, 255, 0], | |
| [92, 0, 255], | |
| ]) | |
| ``` | |
| 3. Having defined the color palette we can now run the whole segmentation + controlnet generation code: | |
| ```py | |
| from transformers import AutoImageProcessor, UperNetForSemanticSegmentation | |
| from PIL import Image | |
| import numpy as np | |
| import torch | |
| from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler | |
| from diffusers.utils import load_image | |
| image_processor = AutoImageProcessor.from_pretrained("openmmlab/upernet-convnext-small") | |
| image_segmentor = UperNetForSemanticSegmentation.from_pretrained("openmmlab/upernet-convnext-small") | |
| image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-seg/resolve/main/images/house.png").convert('RGB') | |
| pixel_values = image_processor(image, return_tensors="pt").pixel_values | |
| with torch.no_grad(): | |
| outputs = image_segmentor(pixel_values) | |
| seg = image_processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0] | |
| color_seg = np.zeros((seg.shape[0], seg.shape[1], 3), dtype=np.uint8) # height, width, 3 | |
| for label, color in enumerate(palette): | |
| color_seg[seg == label, :] = color | |
| color_seg = color_seg.astype(np.uint8) | |
| image = Image.fromarray(color_seg) | |
| controlnet = ControlNetModel.from_pretrained( | |
| "lllyasviel/sd-controlnet-seg", torch_dtype=torch.float16 | |
| ) | |
| pipe = StableDiffusionControlNetPipeline.from_pretrained( | |
| "runwayml/stable-diffusion-v1-5", controlnet=controlnet, safety_checker=None, torch_dtype=torch.float16 | |
| ) | |
| pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) | |
| # Remove if you do not have xformers installed | |
| # see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers | |
| # for installation instructions | |
| pipe.enable_xformers_memory_efficient_attention() | |
| pipe.enable_model_cpu_offload() | |
| image = pipe("house", image, num_inference_steps=20).images[0] | |
| image.save('./images/house_seg_out.png') | |
| ``` | |
|  | |
|  | |
|  | |
| ### Training | |
| The semantic segmentation model was trained on 164K segmentation-image, caption pairs from ADE20K. The model was trained for 200 GPU-hours with Nvidia A100 80G using Stable Diffusion 1.5 as a base model. | |
| ### Blog post | |
| For more information, please also have a look at the [official ControlNet Blog Post](https://huggingface.co/blog/controlnet). |