Merge branch 'main' of https://huggingface.co/Crosstyan/BPModel
Browse files
README.md
CHANGED
|
@@ -27,10 +27,28 @@ BPModel is an experimental Stable Diffusion model based on [ACertainty](https://
|
|
| 27 |
Why is the Model even existing? There are loads of Stable Diffusion model out there, especially anime style models.
|
| 28 |
Well, is there any models trained with resolution base resolution (`base_res`) 768 even 1024 before? Don't think so.
|
| 29 |
Here it is, the BPModel, a Stable Diffusion model you may love or hate.
|
| 30 |
-
Trained with 5k high quality images that suit my taste (not necessary yours unfortunately) from [Sankaku Complex](https://chan.sankakucomplex.com) with annotations.
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
[Mikubill/naifu-diffusion](https://github.com/Mikubill/naifu-diffusion) is used as training script and I also recommend to
|
| 36 |
checkout [CCRcmcpe/scal-sdt](https://github.com/CCRcmcpe/scal-sdt).
|
|
@@ -85,7 +103,7 @@ better than some artist style DreamBooth model which only train with a few
|
|
| 85 |
hundred images or even less. I also oppose changing style by merging model since You
|
| 86 |
could apply different style by training with proper captions and prompting.
|
| 87 |
|
| 88 |
-
Besides some of images in my dataset
|
| 89 |
be misinterpreted by CLIP when tokenizing. For example, *as109* will be tokenized as `[as, 1, 0, 9]` and
|
| 90 |
*fuzichoco* will become `[fu, z, ic, hoco]`. Romanized Japanese suffers from the problem a lot and
|
| 91 |
I don't have a good solution to fix it other than changing the artist name in the caption, which is
|
|
@@ -101,6 +119,10 @@ I don't think anyone would like to do. (Could Unstable Diffusion give us surpris
|
|
| 101 |
|
| 102 |
Here're some **cherry picked** samples.
|
| 103 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |

|
| 105 |
|
| 106 |
```txt
|
|
@@ -154,7 +176,9 @@ EMA weight is not included and it's fp16.
|
|
| 154 |
If you want to continue training, use [`bp_1024_e10_ema.ckpt`](bp_1024_e10_ema.ckpt) which is the ema unet weight
|
| 155 |
and with fp32 precision.
|
| 156 |
|
| 157 |
-
For better performance, it is strongly recommended to use Clip skip (CLIP stop at last layers) 2.
|
|
|
|
|
|
|
| 158 |
|
| 159 |
## About the Model Name
|
| 160 |
|
|
|
|
| 27 |
Why is the Model even existing? There are loads of Stable Diffusion model out there, especially anime style models.
|
| 28 |
Well, is there any models trained with resolution base resolution (`base_res`) 768 even 1024 before? Don't think so.
|
| 29 |
Here it is, the BPModel, a Stable Diffusion model you may love or hate.
|
| 30 |
+
Trained with 5k high quality images that suit my taste (not necessary yours unfortunately) from [Sankaku Complex](https://chan.sankakucomplex.com) with annotations.
|
| 31 |
+
The dataset is public in [Crosstyan/BPDataset](https://huggingface.co/datasets/Crosstyan/BPDataset) for the sake of full disclosure .
|
| 32 |
+
Pure combination of tags may not be the optimal way to describe the image,
|
| 33 |
+
but I don't need to do extra work.
|
| 34 |
+
And no, I won't feed any AI generated image
|
| 35 |
+
to the model even it might outlaw the model from being used in some countries.
|
| 36 |
+
|
| 37 |
+
The training of a high resolution model requires a significant amount of GPU
|
| 38 |
+
hours and can be costly. In this particular case, 10 V100 GPU hours were spent
|
| 39 |
+
on training 30 epochs with a resolution of 512, while 60 V100 GPU hours were spent
|
| 40 |
+
on training 30 epochs with a resolution of 768. An additional 100 V100 GPU hours
|
| 41 |
+
were also spent on training a model with a resolution of 1024, although **ONLY** 10
|
| 42 |
+
epochs were run. The results of the training on the 1024 resolution model did
|
| 43 |
+
not show a significant improvement compared to the 768 resolution model, and the
|
| 44 |
+
resource demands, achieving a batch size of 1 on a V100 with 32G VRAM, were
|
| 45 |
+
high. However, training on the 768 resolution did yield better results than
|
| 46 |
+
training on the 512 resolution, and it is worth considering as an option. It is
|
| 47 |
+
worth noting that Stable Diffusion 2.x also chose to train on a 768 resolution
|
| 48 |
+
model. However, it may be more efficient to start with training on a 512
|
| 49 |
+
resolution model due to the slower training process and the need for additional
|
| 50 |
+
prior knowledge to speed up the training process when working with a 768
|
| 51 |
+
resolution.
|
| 52 |
|
| 53 |
[Mikubill/naifu-diffusion](https://github.com/Mikubill/naifu-diffusion) is used as training script and I also recommend to
|
| 54 |
checkout [CCRcmcpe/scal-sdt](https://github.com/CCRcmcpe/scal-sdt).
|
|
|
|
| 103 |
hundred images or even less. I also oppose changing style by merging model since You
|
| 104 |
could apply different style by training with proper captions and prompting.
|
| 105 |
|
| 106 |
+
Besides some of images in my dataset have the artist name in the caption, however some artist name will
|
| 107 |
be misinterpreted by CLIP when tokenizing. For example, *as109* will be tokenized as `[as, 1, 0, 9]` and
|
| 108 |
*fuzichoco* will become `[fu, z, ic, hoco]`. Romanized Japanese suffers from the problem a lot and
|
| 109 |
I don't have a good solution to fix it other than changing the artist name in the caption, which is
|
|
|
|
| 119 |
|
| 120 |
Here're some **cherry picked** samples.
|
| 121 |
|
| 122 |
+
I were using [xformers](https://github.com/facebookresearch/xformers) when generating these sample
|
| 123 |
+
and it might yield slight different result even with the same seed (welcome to the non deterministic field).
|
| 124 |
+
"`Upscale latent space image when doing hires. fix`" is enabled also.
|
| 125 |
+
|
| 126 |

|
| 127 |
|
| 128 |
```txt
|
|
|
|
| 176 |
If you want to continue training, use [`bp_1024_e10_ema.ckpt`](bp_1024_e10_ema.ckpt) which is the ema unet weight
|
| 177 |
and with fp32 precision.
|
| 178 |
|
| 179 |
+
For better performance, it is strongly recommended to use Clip skip (CLIP stop at last layers) 2. It's also recommended to use turn on
|
| 180 |
+
"`Upscale latent space image when doing hires. fix`" in the settings of [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
|
| 181 |
+
which adds intricate details when using `Highres. fix`.
|
| 182 |
|
| 183 |
## About the Model Name
|
| 184 |
|