onethousand commited on
Commit
0a8c759
·
verified ·
1 Parent(s): 41bb7ed

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +233 -0
README.md ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - onethousand/FaceNormalSeg-ControlNet-dataset
5
+ pipeline_tag: text-to-image
6
+ base_model:
7
+ - SG161222/Realistic_Vision_V5.1_noVAE
8
+ library_name: diffusers
9
+ tags:
10
+ - face
11
+ ---
12
+
13
+ # AnimPortrait3D ControlNet
14
+
15
+
16
+ This ControlNet is used in the pipeline of [AnimPortrait3D](https://onethousandwu.com/AnimPortrait3D.github.io/) to align the diffusion model’s guidance with the underlying mesh.
17
+
18
+ ## Conditional Input
19
+ The input to this ControlNet is a concatenation of:
20
+ - **Normal map** (*num_channels=3*): Rendered from the underlying mesh.
21
+ - **Segmentation map** (*num_channels=1*): Includes segmented regions for teeth, eyes, and irises.
22
+
23
+ ## Functionality
24
+ This ControlNet can generate high-quality RGB images for facial, mouth, and eye regions, leveraging their respective conditional inputs.
25
+
26
+
27
+ ![img](./assets/controlnet.png)
28
+
29
+
30
+ ## Training Detials
31
+
32
+ The ControlNet is trained using the [Realistic Vision V5.1](https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE) diffusion model. The training process takes approximately two days on an NVIDIA TITAN RTX GPU, with a batch size of 4 and a learning rate of 1e-4. During training, the probability of randomly dropping conditioning inputs is set to 0.1.
33
+ The conditional input consists of a concatenated normal map and segmentation map, resulting in a 4-channel input (3 channels for the normal map and 1 for the segmentation map). The resolution of training images is fixed at 512x512. To ensure approximately balanced quantities of face, mouth, and eye data, we duplicate relevant samples. For data augmentation, we employ random resized cropping during training.
34
+
35
+
36
+ For ControlNet guidance on the **face** region, we recommend utilizing the complete text prompt describing the full avatar (e.g., "a teen boy, pensive look, dark hair, preppy sweater, collared shirt, moody room, 80s memorabilia"). However, for the **mouth** and **eye** regions, which typically lack person-specific features, we observe that detailed prompts degrade image quality. Consequently, we use more abstract text prompts paired with region-specific prefixes for these areas (e.g., "right eye region, a boy"), broadly categorizing the avatar.
37
+
38
+ ![img](./assets/prompt.png)
39
+
40
+
41
+
42
+ # Usage
43
+
44
+ ```
45
+ from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
46
+ from diffusers.utils import load_image
47
+ import torch
48
+ import numpy as np
49
+ from PIL import Image
50
+
51
+ base_model_path = "SG161222/Realistic_Vision_V5.1_noVAE"
52
+ controlnet_path = "onethousand/AnimPortrait3D_controlnet"
53
+
54
+ controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16).to("cuda")
55
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
56
+ base_model_path, controlnet=controlnet, torch_dtype=torch.float16
57
+ ).to("cuda")
58
+
59
+ control_image1 = load_image("./assets/face_normal.png").resize((512, 512))
60
+ control_image2 = load_image("./assets/face_seg.png").resize((512, 512))
61
+
62
+ control_image = np.concatenate((np.array(control_image1), np.array(control_image2)[:,:,:1]), axis=2)[None,...] / 255.0
63
+ print(control_image.shape)
64
+
65
+ prompt = "a Teen boy, pensive look, dark hair. Preppy sweater, collared shirt, moody room, 80s memorabilia"
66
+
67
+ # generate image
68
+ generator = torch.manual_seed(0)
69
+ image = pipe(
70
+ prompt, num_inference_steps=20, generator=generator, image=control_image,guidance_scale=7.5
71
+ ).images[0]
72
+
73
+ Image.fromarray(image).save("test.png")
74
+
75
+ ```
76
+
77
+
78
+
79
+
80
+ # ================================================= Original Readme of ControlNet =================================================
81
+
82
+
83
+ # Controlnet - v1.1 - *openpose Version*
84
+
85
+ **Controlnet v1.1** is the successor model of [Controlnet v1.0](https://huggingface.co/lllyasviel/ControlNet)
86
+ and was released in [lllyasviel/ControlNet-v1-1](https://huggingface.co/lllyasviel/ControlNet-v1-1) by [Lvmin Zhang](https://huggingface.co/lllyasviel).
87
+
88
+ This checkpoint is a conversion of [the original checkpoint](https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_openpose.pth) into `diffusers` format.
89
+ It can be used in combination with **Stable Diffusion**, such as [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5).
90
+
91
+
92
+ For more details, please also have a look at the [🧨 Diffusers docs](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/controlnet).
93
+
94
+
95
+ ControlNet is a neural network structure to control diffusion models by adding extra conditions.
96
+
97
+ ![img](./sd.png)
98
+
99
+ This checkpoint corresponds to the ControlNet conditioned on **openpose images**.
100
+
101
+ ## Model Details
102
+ - **Developed by:** Lvmin Zhang, Maneesh Agrawala
103
+ - **Model type:** Diffusion-based text-to-image generation model
104
+ - **Language(s):** English
105
+ - **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
106
+ - **Resources for more information:** [GitHub Repository](https://github.com/lllyasviel/ControlNet), [Paper](https://arxiv.org/abs/2302.05543).
107
+ - **Cite as:**
108
+
109
+ @misc{zhang2023adding,
110
+ title={Adding Conditional Control to Text-to-Image Diffusion Models},
111
+ author={Lvmin Zhang and Maneesh Agrawala},
112
+ year={2023},
113
+ eprint={2302.05543},
114
+ archivePrefix={arXiv},
115
+ primaryClass={cs.CV}
116
+ }
117
+
118
+ ## Introduction
119
+
120
+ Controlnet was proposed in [*Adding Conditional Control to Text-to-Image Diffusion Models*](https://arxiv.org/abs/2302.05543) by
121
+ Lvmin Zhang, Maneesh Agrawala.
122
+
123
+ The abstract reads as follows:
124
+
125
+ *We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions.
126
+ The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k).
127
+ Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices.
128
+ Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data.
129
+ We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.
130
+ This may enrich the methods to control large diffusion models and further facilitate related applications.*
131
+
132
+ ## Example
133
+
134
+ It is recommended to use the checkpoint with [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) as the checkpoint
135
+ has been trained on it.
136
+ Experimentally, the checkpoint can be used with other diffusion models such as dreamboothed stable diffusion.
137
+
138
+ **Note**: If you want to process an image to create the auxiliary conditioning, external dependencies are required as shown below:
139
+
140
+ 1. Install https://github.com/patrickvonplaten/controlnet_aux
141
+
142
+ ```sh
143
+ $ pip install controlnet_aux==0.3.0
144
+ ```
145
+
146
+ 2. Let's install `diffusers` and related packages:
147
+
148
+ ```
149
+ $ pip install diffusers transformers accelerate
150
+ ```
151
+
152
+ 3. Run code:
153
+
154
+ ```python
155
+ import torch
156
+ import os
157
+ from huggingface_hub import HfApi
158
+ from pathlib import Path
159
+ from diffusers.utils import load_image
160
+ from PIL import Image
161
+ import numpy as np
162
+ from controlnet_aux import OpenposeDetector
163
+
164
+ from diffusers import (
165
+ ControlNetModel,
166
+ StableDiffusionControlNetPipeline,
167
+ UniPCMultistepScheduler,
168
+ )
169
+
170
+ checkpoint = "lllyasviel/control_v11p_sd15_openpose"
171
+
172
+ image = load_image(
173
+ "https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/input.png"
174
+ )
175
+
176
+ prompt = "chef in the kitchen"
177
+
178
+ processor = OpenposeDetector.from_pretrained('lllyasviel/ControlNet')
179
+
180
+ control_image = processor(image, hand_and_face=True)
181
+ control_image.save("./images/control.png")
182
+
183
+ controlnet = ControlNetModel.from_pretrained(checkpoint, torch_dtype=torch.float16)
184
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
185
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
186
+ )
187
+
188
+ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
189
+ pipe.enable_model_cpu_offload()
190
+
191
+ generator = torch.manual_seed(0)
192
+ image = pipe(prompt, num_inference_steps=30, generator=generator, image=control_image).images[0]
193
+
194
+ image.save('images/image_out.png')
195
+
196
+ ```
197
+
198
+ ![bird](./images/input.png)
199
+
200
+ ![bird_canny](./images/control.png)
201
+
202
+ ![bird_canny_out](./images/image_out.png)
203
+
204
+ ## Other released checkpoints v1-1
205
+
206
+ The authors released 14 different checkpoints, each trained with [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)
207
+ on a different type of conditioning:
208
+
209
+ | Model Name | Control Image Overview| Control Image Example | Generated Image Example |
210
+ |---|---|---|---|
211
+ |[lllyasviel/control_v11p_sd15_canny](https://huggingface.co/lllyasviel/control_v11p_sd15_canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"/></a>|
212
+ |[lllyasviel/control_v11e_sd15_ip2p](https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p)<br/> *Trained with pixel to pixel instruction* | No condition .|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"/></a>|
213
+ |[lllyasviel/control_v11p_sd15_inpaint](https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint)<br/> Trained with image inpainting | No condition.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"/></a>|
214
+ |[lllyasviel/control_v11p_sd15_mlsd](https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd)<br/> Trained with multi-level line segment detection | An image with annotated line segments.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"/></a>|
215
+ |[lllyasviel/control_v11f1p_sd15_depth](https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth)<br/> Trained with depth estimation | An image with depth information, usually represented as a grayscale image.|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"/></a>|
216
+ |[lllyasviel/control_v11p_sd15_normalbae](https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae)<br/> Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"/></a>|
217
+ |[lllyasviel/control_v11p_sd15_seg](https://huggingface.co/lllyasviel/control_v11p_sd15_seg)<br/> Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"/></a>|
218
+ |[lllyasviel/control_v11p_sd15_lineart](https://huggingface.co/lllyasviel/control_v11p_sd15_lineart)<br/> Trained with line art generation | An image with line art, usually black lines on a white background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"/></a>|
219
+ |[lllyasviel/control_v11p_sd15s2_lineart_anime](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> Trained with anime line art generation | An image with anime-style line art.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"/></a>|
220
+ |[lllyasviel/control_v11p_sd15_openpose](https://huggingface.co/lllyasviel/control_v11p_sd15_openpose)<br/> Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"/></a>|
221
+ |[lllyasviel/control_v11p_sd15_scribble](https://huggingface.co/lllyasviel/control_v11p_sd15_scribble)<br/> Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"/></a>|
222
+ |[lllyasviel/control_v11p_sd15_softedge](https://huggingface.co/lllyasviel/control_v11p_sd15_softedge)<br/> Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"/></a>|
223
+ |[lllyasviel/control_v11e_sd15_shuffle](https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle)<br/> Trained with image shuffling | An image with shuffled patches or regions.|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"/></a>|
224
+
225
+ ## Improvements in Openpose 1.1:
226
+
227
+ - The improvement of this model is mainly based on our improved implementation of OpenPose. We carefully reviewed the difference between the pytorch OpenPose and CMU's c++ openpose. Now the processor should be more accurate, especially for hands. The improvement of processor leads to the improvement of Openpose 1.1.
228
+ - More inputs are supported (hand and face).
229
+ - The training dataset of previous cnet 1.0 has several problems including (1) a small group of greyscale human images are duplicated thousands of times (!!), causing the previous model somewhat likely to generate grayscale human images; (2) some images has low quality, very blurry, or significant JPEG artifacts; (3) a small group of images has wrong paired prompts caused by a mistake in our data processing scripts. The new model fixed all problems of the training dataset and should be more reasonable in many cases.
230
+
231
+ ## More information
232
+
233
+ For more information, please also have a look at the [Diffusers ControlNet Blog Post](https://huggingface.co/blog/controlnet) and have a look at the [official docs](https://github.com/lllyasviel/ControlNet-v1-1-nightly).