update readme for epoch 19 ckpt
Browse files
README.md
CHANGED
|
@@ -9,7 +9,7 @@ base_model:
|
|
| 9 |
A latent diffusion model (LDM) geared toward illustration, style composability, and sample variety. Addresses a few deficiencies with the SDXL base model; feels more like an SD 1.x with better resolution and much better prompt adherence.
|
| 10 |
|
| 11 |
* Architecture: SD XL (base model is v1.0)
|
| 12 |
-
* Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for
|
| 13 |
|
| 14 |
Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image
|
| 15 |
has from 3 to 24 different captions which are used interchangably during training. There are approximately 12 million images and 78 million captions in the dataset.
|
|
@@ -27,11 +27,11 @@ megapixels for epoch 15+. CFG scales between 2 and 7 can work well with Puzzle B
|
|
| 27 |
|
| 28 |
**Captioning:** About 1.4 million of the captions in the dataset are human-written. The remainder come from a variety of ML models, either vision transformers or
|
| 29 |
classifers. Models used in captioning the Puzzle Box dataset include: Qwen 2 VL 72b, BLIP 2 OPT-6.5B COCO, Llava 1.5, MiniCPM 2.6, bakllava, Moondream, DeepSeek Janus 7b,
|
| 30 |
-
Mistral Pixtral 12b, CapPa, Gemma 3 27b, JoyCaption, CLIP Interrogator 0.6.0, and wd-eva02-large-tagger-v3. Deepseek v3 is used to create detailed consensus captions from
|
| 31 |
the others. Only open-weights models were used.
|
| 32 |
|
| 33 |
In addition to human/machine-generated main caption, there are a large number of additional human-provided tags referring to style ("pointillism", "caricature", "Winsor McKay"),
|
| 34 |
-
genre ("pop art", "advertising", "pixel art"), source ("wikiart", "library of congress"), or image content ("fluid expression", "pin-up", "squash and stretch").
|
| 35 |
|
| 36 |
**Aesthetic labelling:** All images in the Puzzle Box dataset have been scored by multiple IQA models. There are also over 700,000 human paired image preferences. This data is combined to label especially high- or low-aesthetic images. Aesthetic breakpoints are chosen
|
| 37 |
on a per-style/genre tag basis (the threshold for "pixel art" is different than "classical oil painting".)
|
|
@@ -43,6 +43,8 @@ In later epochs, a form of curriculum training is used: a complexity proxy is ca
|
|
| 43 |
|
| 44 |
Epoch length was determined by the original size of the training set, and the best checkpoint that emerges after model soup experimentation is released.
|
| 45 |
|
|
|
|
|
|
|
| 46 |
**Other nifty tricks used:** Some less common techniques used in training Puzzle Box XL include:
|
| 47 |
|
| 48 |
- *Data augumentation/conditional dropout*: taking inspiration from GAN-space, transformations are done (with some probability) on both the images and their labels in training. For example, an image might be converted to grayscale, rotated, or blurred. A booru-style caption will have its order of tags randomized, an English caption might have its sentences re-ordered. Labels may also be dropped out, wholly or partially. This helps the model generalize and avoid overfitting.
|
|
@@ -54,6 +56,7 @@ Epoch length was determined by the original size of the training set, and the be
|
|
| 54 |
|
| 55 |
Model checkpoints currently available:
|
| 56 |
|
|
|
|
| 57 |
- from epoch 18, **19300k** training steps, 03 October 2025
|
| 58 |
- from epoch 17, **18000k** training steps, 06 July 2025
|
| 59 |
- from epoch 16, **16950k** training steps, 05 May 2025
|
|
@@ -71,6 +74,7 @@ interpolation is best.)
|
|
| 71 |
The U-Net attention layers are the layers most modified by the continued pretrain; comparing those layers to SD XL 1.0, the correlation is:
|
| 72 |
| Epoch | Date | R-squared |
|
| 73 |
| ----- | ---------- | --------- |
|
|
|
|
| 74 |
| 18 | 2025-10-03 | 97.426% |
|
| 75 |
| 17 | 2025-07-06 | 97.705% |
|
| 76 |
| 16 | 2025-05-05 | 97.917% |
|
|
|
|
| 9 |
A latent diffusion model (LDM) geared toward illustration, style composability, and sample variety. Addresses a few deficiencies with the SDXL base model; feels more like an SD 1.x with better resolution and much better prompt adherence.
|
| 10 |
|
| 11 |
* Architecture: SD XL (base model is v1.0)
|
| 12 |
+
* Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for 20,110,000 steps (at epoch 19, batch size 4). See below for more details.
|
| 13 |
|
| 14 |
Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image
|
| 15 |
has from 3 to 24 different captions which are used interchangably during training. There are approximately 12 million images and 78 million captions in the dataset.
|
|
|
|
| 27 |
|
| 28 |
**Captioning:** About 1.4 million of the captions in the dataset are human-written. The remainder come from a variety of ML models, either vision transformers or
|
| 29 |
classifers. Models used in captioning the Puzzle Box dataset include: Qwen 2 VL 72b, BLIP 2 OPT-6.5B COCO, Llava 1.5, MiniCPM 2.6, bakllava, Moondream, DeepSeek Janus 7b,
|
| 30 |
+
Mistral Pixtral 12b, CapPa, Gemma 3 27b, JoyCaption, NVIDIA Nanotron v2 VL 12b, Qwen 3 VL 32b, CLIP Interrogator 0.6.0, and wd-eva02-large-tagger-v3. Deepseek v3 is used to create detailed consensus captions from
|
| 31 |
the others. Only open-weights models were used.
|
| 32 |
|
| 33 |
In addition to human/machine-generated main caption, there are a large number of additional human-provided tags referring to style ("pointillism", "caricature", "Winsor McKay"),
|
| 34 |
+
genre ("pop art", "advertising", "pixel art"), source ("wikiart", "library of congress", "pexels"), or image content ("fluid expression", "pin-up", "squash and stretch").
|
| 35 |
|
| 36 |
**Aesthetic labelling:** All images in the Puzzle Box dataset have been scored by multiple IQA models. There are also over 700,000 human paired image preferences. This data is combined to label especially high- or low-aesthetic images. Aesthetic breakpoints are chosen
|
| 37 |
on a per-style/genre tag basis (the threshold for "pixel art" is different than "classical oil painting".)
|
|
|
|
| 43 |
|
| 44 |
Epoch length was determined by the original size of the training set, and the best checkpoint that emerges after model soup experimentation is released.
|
| 45 |
|
| 46 |
+
**Safety notes:** Similar to the base SD XL model, deployment in a production environment may require filters for undesired/inappropriate content. Classifiers on both input prompt and output image suggested.
|
| 47 |
+
|
| 48 |
**Other nifty tricks used:** Some less common techniques used in training Puzzle Box XL include:
|
| 49 |
|
| 50 |
- *Data augumentation/conditional dropout*: taking inspiration from GAN-space, transformations are done (with some probability) on both the images and their labels in training. For example, an image might be converted to grayscale, rotated, or blurred. A booru-style caption will have its order of tags randomized, an English caption might have its sentences re-ordered. Labels may also be dropped out, wholly or partially. This helps the model generalize and avoid overfitting.
|
|
|
|
| 56 |
|
| 57 |
Model checkpoints currently available:
|
| 58 |
|
| 59 |
+
- from epoch 19, **20110k** training steps, 28 November 2025
|
| 60 |
- from epoch 18, **19300k** training steps, 03 October 2025
|
| 61 |
- from epoch 17, **18000k** training steps, 06 July 2025
|
| 62 |
- from epoch 16, **16950k** training steps, 05 May 2025
|
|
|
|
| 74 |
The U-Net attention layers are the layers most modified by the continued pretrain; comparing those layers to SD XL 1.0, the correlation is:
|
| 75 |
| Epoch | Date | R-squared |
|
| 76 |
| ----- | ---------- | --------- |
|
| 77 |
+
| 19 | 2025-11-28 | 97.257% |
|
| 78 |
| 18 | 2025-10-03 | 97.426% |
|
| 79 |
| 17 | 2025-07-06 | 97.705% |
|
| 80 |
| 16 | 2025-05-05 | 97.917% |
|