PrometheusProject xinsir commited on
Commit
b78a816
·
0 Parent(s):

Duplicate from xinsir/controlnet-canny-sdxl-1.0

Browse files

Co-authored-by: qi <xinsir@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ masonry.webp filter=lfs diff=lfs merge=lfs -text
000004_scribble_concat.webp ADDED
000006_scribble_concat.webp ADDED
000010_scribble_concat.webp ADDED
000013_scribble_concat.webp ADDED
000016_scribble_concat.webp ADDED
000028_scribble_concat.webp ADDED
000031_scribble_concat.webp ADDED
000034_scribble_concat.webp ADDED
000059_scribble_concat.webp ADDED
000078_scribble_concat.webp ADDED
000097_scribble_concat.webp ADDED
README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - text_to_image
5
+ - diffusers
6
+ - controlnet
7
+ - controlnet-canny-sdxl-1.0
8
+ pipeline_tag: text-to-image
9
+ ---
10
+
11
+ # ***Drawing like Midjourney! Come on!***
12
+ ![images](./masonry.webp)
13
+
14
+ # Controlnet-Canny-Sdxl-1.0
15
+
16
+ <!-- Provide a quick summary of what the model is/does. -->
17
+
18
+ Hello, I am very happy to announce the controlnet-canny-sdxl-1.0 model, **a very powerful controlnet that can generate high resolution images visually comparable with midjourney**.
19
+ The model was trained with large amount of high quality data(over 10000000 images), with carefully filtered and captioned(powerful vllm model). Besides, useful tricks are applied
20
+ during the training, including date augmentation, mutiple loss and multi resolution. With only 1 stage training, the performance outperforms the other opensource canny models
21
+ ([diffusers/controlnet-canny-sdxl-1.0], [TheMistoAI/MistoLine]). I release it and hope to advance the application of stable diffusion models. Canny is one of the most important
22
+ ControlNet series models and can be applied to many jobs associated with drawing and designing.
23
+
24
+ ## Model Details
25
+
26
+
27
+ ### Model Description
28
+
29
+ <!-- Provide a longer summary of what this model is. -->
30
+
31
+ - **Developed by:** xinsir
32
+ - **Model type:** ControlNet_SDXL
33
+ - **License:** apache-2.0
34
+ - **Finetuned from model [optional]:** stabilityai/stable-diffusion-xl-base-1.0
35
+
36
+ ### Model Sources [optional]
37
+
38
+ <!-- Provide the basic links for the model. -->
39
+
40
+ - **Paper [optional]:** https://arxiv.org/abs/2302.05543
41
+
42
+ ## Uses
43
+
44
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
45
+
46
+ ### Examples
47
+
48
+ prompt: A closeup of two day of the dead models, looking to the side, large flowered headdress, full dia de Los muertoe make up, lush red lips, butterflies,
49
+ flowers, pastel colors, looking to the side, jungle, birds, color harmony , extremely detailed, intricate, ornate, motion, stunning, beautiful, unique, soft lighting
50
+
51
+ ![images_0)](./000031_scribble_concat.webp)
52
+
53
+ prompt: ghost with a plague doctor mask in a venice carnaval hyper realistic
54
+ ![images_1)](./000028_scribble_concat.webp)
55
+
56
+ prompt: A picture surrounded by blue stars and gold stars, glowing, dark navy blue and gray tones, distributed in light silver and gold, playful, festive atmosphere, pure fabric, chalk, FHD 8K
57
+ ![images_2)](./000016_scribble_concat.webp)
58
+
59
+ prompt: Delicious vegetarian pizza with champignon mushrooms, tomatoes, mozzarella, peppers and black olives, isolated on white background , transparent isolated white background , top down view, studio photo, transparent png, Clean sharp focus. High end retouching. Food magazine photography. Award winning photography. Advertising photography. Commercial photography
60
+ ![images_3)](./000010_scribble_concat.webp)
61
+
62
+ prompt: a blonde woman in a wedding dress in a maple forest in summer with a flower crown laurel. Watercolor painting in the style of John William Waterhouse. Romanticism. Ethereal light.
63
+ ![images_4)](./000006_scribble_concat.webp)
64
+
65
+ ### Examples Anime(Note that you need to change the base model to CounterfeitXL, others remains the same)
66
+
67
+ ![images_5)](./000013_scribble_concat.webp)
68
+
69
+ ![images_6)](./000034_scribble_concat.webp)
70
+
71
+ ![images_7)](./000059_scribble_concat.webp)
72
+
73
+ ![images_8)](./000078_scribble_concat.webp)
74
+
75
+ ![images_9)](./000097_scribble_concat.webp)
76
+
77
+
78
+ ## How to Get Started with the Model
79
+
80
+ Use the code below to get started with the model.
81
+
82
+ ```python
83
+ from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline, AutoencoderKL
84
+ from diffusers import DDIMScheduler, EulerAncestralDiscreteScheduler
85
+ from PIL import Image
86
+ import torch
87
+ import numpy as np
88
+ import cv2
89
+
90
+ def HWC3(x):
91
+ assert x.dtype == np.uint8
92
+ if x.ndim == 2:
93
+ x = x[:, :, None]
94
+ assert x.ndim == 3
95
+ H, W, C = x.shape
96
+ assert C == 1 or C == 3 or C == 4
97
+ if C == 3:
98
+ return x
99
+ if C == 1:
100
+ return np.concatenate([x, x, x], axis=2)
101
+ if C == 4:
102
+ color = x[:, :, 0:3].astype(np.float32)
103
+ alpha = x[:, :, 3:4].astype(np.float32) / 255.0
104
+ y = color * alpha + 255.0 * (1.0 - alpha)
105
+ y = y.clip(0, 255).astype(np.uint8)
106
+ return y
107
+
108
+ controlnet_conditioning_scale = 1.0
109
+ prompt = "your prompt, the longer the better, you can describe it as detail as possible"
110
+ negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'
111
+
112
+
113
+
114
+ eulera_scheduler = EulerAncestralDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler")
115
+
116
+
117
+ controlnet = ControlNetModel.from_pretrained(
118
+ "xinsir/controlnet-canny-sdxl-1.0",
119
+ torch_dtype=torch.float16
120
+ )
121
+
122
+ # when test with other base model, you need to change the vae also.
123
+ vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
124
+
125
+ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
126
+ "stabilityai/stable-diffusion-xl-base-1.0",
127
+ controlnet=controlnet,
128
+ vae=vae,
129
+ safety_checker=None,
130
+ torch_dtype=torch.float16,
131
+ scheduler=eulera_scheduler,
132
+ )
133
+
134
+ # need to resize the image resolution to 1024 * 1024 or same bucket resolution to get the best performance
135
+
136
+ controlnet_img = cv2.imread("your image path")
137
+ height, width, _ = controlnet_img.shape
138
+ ratio = np.sqrt(1024. * 1024. / (width * height))
139
+ new_width, new_height = int(width * ratio), int(height * ratio)
140
+ controlnet_img = cv2.resize(controlnet_img, (new_width, new_height))
141
+
142
+ controlnet_img = cv2.Canny(controlnet_img, 100, 200)
143
+ controlnet_img = HWC3(controlnet_img)
144
+ controlnet_img = Image.fromarray(controlnet_img)
145
+
146
+ images = pipe(
147
+ prompt,
148
+ negative_prompt=negative_prompt,
149
+ image=controlnet_img,
150
+ controlnet_conditioning_scale=controlnet_conditioning_scale,
151
+ width=new_width,
152
+ height=new_height,
153
+ num_inference_steps=30,
154
+ ).images
155
+
156
+ images[0].save(f"your image save path, png format is usually better than jpg or webp in terms of image quality but got much bigger")
157
+ ```
158
+
159
+
160
+ ## Evaluation Metric
161
+ 1 Laion Aesthetic Score [https://laion.ai/blog/laion-aesthetics/]
162
+ 2 PerceptualSimilarity [https://github.com/richzhang/PerceptualSimilarity]
163
+
164
+
165
+ ## Evaluation Data
166
+ The test data is randomly sample from midjourney upscale images with prompts, as the purpose of the project is to letting people draw images like midjourney. midjourney’s users include a large number of professional designers,
167
+ and the upscale image tend to have more beauty score and prompt consistency, it is suitable to use it as the test set to judge the ability of controlnet. We select 300 prompt-image pairs randomly and generate 4 images per prompt,
168
+ totally 1200 images generated. We caculate the Laion Aesthetic Score to measure the beauty and the PerceptualSimilarity to measure the control ability, we find the quality of images have a good consistency with the meric values.
169
+ We compare our methods with other SOTA huggingface models and list the result below. We are the models that have highest aesthectic score, and can generate visually appealing images if you prompt it properly.
170
+
171
+ ## Quantitative Result
172
+ | metric | xinsir/controlnet-canny-sdxl-1.0 | diffusers/controlnet-canny-sdxl-1.0 | TheMistoAI/MistoLine |
173
+ |-------|-------|-------|-------|
174
+ | laion_aesthetic | **6.03** | 5.93 | 5.82 |
175
+ | perceptual similarity | **0.4200** | 0.5053 | 0.5387 |
176
+
177
+ laion_aesthetic(the higher the better)
178
+ perceptual similarity(the lower the better)
179
+ Note: The values are calculate when saved in webp format, if you save in png format the aesthetic values will increase 0.1-0.3 but the relative relation remains unchanged.
180
+
181
+
182
+ ## Training Details
183
+
184
+ The model is trained using high quality data, only 1 stage training, the resolution setting is the same with sdxl-base, 1024*1024. We use random threshold to generate canny images like lvming zhang, It is essential to find proper hyerparameters
185
+ to realize data augmentation, too easy or too hard will hurt the model performance. Besides, we use random mask to random mask out a random percentage of canny images to force the model to learn more semantic meaning between the prompt and the line.
186
+ We use over 10000000 images, which are annotated carefully, cogvlm is proved to be a powerful image caption model[https://github.com/THUDM/CogVLM?tab=readme-ov-file]. For comic images, it is recommened to use waifu tagger to generate special tags
187
+ [https://huggingface.co/spaces/SmilingWolf/wd-tagger]. More than 64 A100s are used to train the model and the real batch size is 2560 when used accumulate_grad_batches.
188
+
189
+
190
+ ### Training Data
191
+
192
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
193
+
194
+ The data consists of many sources, including midjourney, laion 5B, danbooru, and so on. The data is carefully filtered and annotated.
195
+
196
+
197
+ ### Conclusion
198
+
199
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
200
+
201
+ In our evaluation, the model got better aesthetic score in real images compared with stabilityai/stable-diffusion-xl-base-1.0, and comparable performance in cartoon sytle images.
202
+ The model is better in control ability when test with perception similarity due to more strong data augmentation and more training steps.
203
+ Besides, the model has lower rate to generate abnormal images which tend to include some abnormal human structure.
config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "ControlNetModel",
3
+ "_diffusers_version": "0.20.0.dev0",
4
+ "act_fn": "silu",
5
+ "addition_embed_type": "text_time",
6
+ "addition_embed_type_num_heads": 64,
7
+ "addition_time_embed_dim": 256,
8
+ "attention_head_dim": [
9
+ 5,
10
+ 10,
11
+ 20
12
+ ],
13
+ "block_out_channels": [
14
+ 320,
15
+ 640,
16
+ 1280
17
+ ],
18
+ "class_embed_type": null,
19
+ "conditioning_channels": 3,
20
+ "conditioning_embedding_out_channels": [
21
+ 16,
22
+ 32,
23
+ 96,
24
+ 256
25
+ ],
26
+ "controlnet_conditioning_channel_order": "rgb",
27
+ "cross_attention_dim": 2048,
28
+ "down_block_types": [
29
+ "DownBlock2D",
30
+ "CrossAttnDownBlock2D",
31
+ "CrossAttnDownBlock2D"
32
+ ],
33
+ "downsample_padding": 1,
34
+ "encoder_hid_dim": null,
35
+ "encoder_hid_dim_type": null,
36
+ "flip_sin_to_cos": true,
37
+ "freq_shift": 0,
38
+ "global_pool_conditions": false,
39
+ "in_channels": 4,
40
+ "layers_per_block": 2,
41
+ "mid_block_scale_factor": 1,
42
+ "norm_eps": 1e-05,
43
+ "norm_num_groups": 32,
44
+ "num_attention_heads": null,
45
+ "num_class_embeds": null,
46
+ "only_cross_attention": false,
47
+ "projection_class_embeddings_input_dim": 2816,
48
+ "resnet_time_scale_shift": "default",
49
+ "transformer_layers_per_block": [
50
+ 1,
51
+ 2,
52
+ 10
53
+ ],
54
+ "upcast_attention": null,
55
+ "use_linear_projection": true
56
+ }
diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf47cd757ceaf2572c53321329ef819ea38c09a6e3783588387913cd94dff47c
3
+ size 2502139104
diffusion_pytorch_model_V2.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3e4ac47bc814019d50dc842f579301440deb6d8f09ee1b91a30f527ace1b852
3
+ size 2502139104
masonry.webp ADDED

Git LFS Details

  • SHA256: b3de2a1f1c6d10d0d183f288b372a4253e7bf0dfc8332626fef01b57f2b20f36
  • Pointer size: 132 Bytes
  • Size of remote file: 5.66 MB