512ep20
Browse files- README.md +14 -23
- result_grid.jpg +2 -2
- samples/unet_320x576_0.jpg +2 -2
- samples/unet_384x576_0.jpg +2 -2
- samples/unet_448x576_0.jpg +2 -2
- samples/unet_512x576_0.jpg +2 -2
- samples/unet_576x320_0.jpg +2 -2
- samples/unet_576x384_0.jpg +2 -2
- samples/unet_576x448_0.jpg +2 -2
- samples/unet_576x512_0.jpg +2 -2
- samples/unet_576x576_0.jpg +2 -2
- src/cherrypick.ipynb +2 -2
- test.ipynb +2 -2
- train.py +4 -4
- unet/diffusion_pytorch_model.fp16.safetensors +2 -2
- unet/diffusion_pytorch_model.safetensors +1 -1
README.md
CHANGED
|
@@ -7,6 +7,19 @@ pipeline_tag: text-to-image
|
|
| 7 |
|
| 8 |
*XS Size, Excess Quality*
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
Train status, in progress: [wandb](https://wandb.ai/recoilme/unet)
|
| 12 |
|
|
@@ -107,7 +120,7 @@ if __name__ == "__main__":
|
|
| 107 |
scheduler = DDPMScheduler.from_pretrained(pipeid, subfolder="scheduler")
|
| 108 |
|
| 109 |
|
| 110 |
-
height, width =
|
| 111 |
num_inference_steps = 40
|
| 112 |
output_folder, project_name = "samples", "sdxs"
|
| 113 |
latents = generate_latents(
|
|
@@ -126,28 +139,6 @@ if __name__ == "__main__":
|
|
| 126 |
print("Images generated and saved to:", output_folder)
|
| 127 |
```
|
| 128 |
|
| 129 |
-
## Introduction
|
| 130 |
-
*Fast, Lightweight & Multilingual Diffusion for Everyone*
|
| 131 |
-
|
| 132 |
-
We are **AiArtLab**, a small team of enthusiasts with a limited budget. Our goal is to create a compact and fast model that can be trained on consumer graphics cards (full training cycle, not LoRA). We chose U-Net for its ability to efficiently handle small datasets and train quickly even on a 16GB GPU (e.g., RTX 4080). Our budget was limited to a few thousand dollars, significantly less than competitors like SDXL (tens of millions), so we decided to create a small but efficient model, similar to SD1.5 but for 2025 year.
|
| 133 |
-
|
| 134 |
-
## Encoder Architecture (Text and Images)
|
| 135 |
-
We experimented with various encoders and concluded that large models like LLaMA or T5 XXL are unnecessary for high-quality generation. However, we needed an encoder that understands the context of the query, focusing on "prompt understanding" versus "prompt following." We chose the multilingual encoder Mexma-SigLIP, which supports 80 languages and processes sentences rather than individual tokens. Mexma accepts up to 512 tokens, creating a large matrix that slows down training. Therefore, we used a pooling layer to simplify 512x1152 matrix with plain 1x1152 vector. Specifically, we passed it through a linear model/text projector to achieve compatibility with SigLIP embeddings. This allowed us to synchronize text embeddings with images, potentially leading to a unified multimodal model. This functionality enables mixing image embeddings with textual descriptions in queries. Moreover, the model can be trained without text descriptions, using only images. This should simplify training on videos, where annotation is challenging, and achieve more consistent and seamless video generation by inputting embeddings of previous frames with decay. In the future, we aim to expand the model to 3D/video generation.
|
| 136 |
-
|
| 137 |
-
## VAE Architecture
|
| 138 |
-
We chose an unconventional 8x 16-channel AuraDiffusion VAE, which preserves details, text, and anatomy without the 'haze' characteristic of SD3/Flux.
|
| 139 |
-
|
| 140 |
-
## Training Process
|
| 141 |
-
### Optimizer
|
| 142 |
-
We tested several optimizers (AdamW, Laion, Optimi-AdamW, Adafactor, and AdamW-8bit) and chose AdamW-8bit. Optimi-AdamW demonstrated the smoothest gradient decay curve, although AdamW-8bit behaves more chaotically. However, its smaller size allows for larger batch sizes, maximizing training speed on low-cost GPUs (we used 4xA6000 and 5xL40s for training).
|
| 143 |
-
|
| 144 |
-
### Dataset
|
| 145 |
-
We trained the model on approximately 1 million images: 60 epochs on ImageNet at 256 resolution (wasted time because of low-quality annotations) and 8 epochs on CaptionEmporium/midjourney-niji-1m-llavanext, plus realistic photos and anime/art at 576 resolution. We used human prompts, Caption Emporium provided prompts, WD-Tagger from SmilingWolf, and Moondream2 for annotation, varying prompt length and composition to ensure the model understands different prompting styles. The dataset is extremely small, leading the model to miss many entities and struggle with unseen concepts like 'a goose on a bicycle.' The dataset also included many waifu-style images, as we were interested in how well the model learns human anatomy rather than drawing 'The Astronaut on horseback' skills. While most descriptions were in English, our tests indicate the model is multilingual.
|
| 146 |
-
|
| 147 |
-
## Limitations
|
| 148 |
-
- Limited concept coverage due to the extremely small dataset.
|
| 149 |
-
- The Image2Image functionality needs further training (we reduced the SigLIP portion to 5% to focus on text-to-image training).
|
| 150 |
-
|
| 151 |
## Acknowledgments
|
| 152 |
- **[Stan](https://t.me/Stangle)** — Key investor. Primary financial support - thank you for believing in us when others called it madness.
|
| 153 |
- **Captainsaturnus** — Material support.
|
|
|
|
| 7 |
|
| 8 |
*XS Size, Excess Quality*
|
| 9 |
|
| 10 |
+
At AiArtLab, we aim to develop a compact (1.7b) and fast (3sec/image) model that can be trained on consumer-grade graphics cards, all while operating on a limited budget.
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
We have chosen the multilingual encoder Mexma-SigLIP, which supports 80 languages and processes entire sentences rather than individual tokens. Our chosen VAE architecture, AuraDiffusion, preserves details and anatomy without the blurring effects seen in other models.
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
For training, we use AdamW-8bit, which allows for larger batch sizes and accelerates training on cost-effective GPUs. Our model has been trained on approximately one million images with various resolutions and styles, including anime and realistic photos. We employed a variety of annotation methods, combining both manual and automated approaches.
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
However, our model does have some limitations:
|
| 20 |
+
- Limited concept coverage due to the small dataset size.
|
| 21 |
+
- The Image2Image functionality requires further training.
|
| 22 |
+
|
| 23 |
|
| 24 |
Train status, in progress: [wandb](https://wandb.ai/recoilme/unet)
|
| 25 |
|
|
|
|
| 120 |
scheduler = DDPMScheduler.from_pretrained(pipeid, subfolder="scheduler")
|
| 121 |
|
| 122 |
|
| 123 |
+
height, width = 576, 384
|
| 124 |
num_inference_steps = 40
|
| 125 |
output_folder, project_name = "samples", "sdxs"
|
| 126 |
latents = generate_latents(
|
|
|
|
| 139 |
print("Images generated and saved to:", output_folder)
|
| 140 |
```
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
## Acknowledgments
|
| 143 |
- **[Stan](https://t.me/Stangle)** — Key investor. Primary financial support - thank you for believing in us when others called it madness.
|
| 144 |
- **Captainsaturnus** — Material support.
|
result_grid.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_320x576_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_384x576_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_448x576_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_512x576_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_576x320_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_576x384_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_576x448_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_576x512_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
samples/unet_576x576_0.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
src/cherrypick.ipynb
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c3bc29d6c8ede5a64c8ae9b2f8f824d4edfe91209bc6a6363a43ab66ec01d68f
|
| 3 |
+
size 44788
|
test.ipynb
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d35e0b0f890df0408db1efd1fe1daed6b8b27b3021e90fa10ded72d702f26677
|
| 3 |
+
size 4885422
|
train.py
CHANGED
|
@@ -28,9 +28,9 @@ import torch.nn.functional as F
|
|
| 28 |
ds_path = "datasets/576"
|
| 29 |
project = "unet"
|
| 30 |
batch_size = 50
|
| 31 |
-
base_learning_rate =
|
| 32 |
min_learning_rate = 1e-5
|
| 33 |
-
num_epochs =
|
| 34 |
# samples/save per epoch
|
| 35 |
sample_interval_share = 5
|
| 36 |
use_wandb = True
|
|
@@ -50,7 +50,7 @@ torch.backends.cuda.enable_mem_efficient_sdp(True)
|
|
| 50 |
dtype = torch.float32
|
| 51 |
save_barrier = 1.03
|
| 52 |
dispersive_temperature=0.5
|
| 53 |
-
dispersive_weight=0.
|
| 54 |
percentile_clipping = 97 # Lion
|
| 55 |
steps_offset = 1 # Scheduler
|
| 56 |
limit = 0
|
|
@@ -628,7 +628,7 @@ def create_optimizer(name, params):
|
|
| 628 |
if name == "adam8bit":
|
| 629 |
import bitsandbytes as bnb
|
| 630 |
return bnb.optim.AdamW8bit(
|
| 631 |
-
params, lr=base_learning_rate, betas=(0.9, 0.
|
| 632 |
)
|
| 633 |
elif name == "adam":
|
| 634 |
return torch.optim.AdamW(
|
|
|
|
| 28 |
ds_path = "datasets/576"
|
| 29 |
project = "unet"
|
| 30 |
batch_size = 50
|
| 31 |
+
base_learning_rate = 2e-5
|
| 32 |
min_learning_rate = 1e-5
|
| 33 |
+
num_epochs = 5
|
| 34 |
# samples/save per epoch
|
| 35 |
sample_interval_share = 5
|
| 36 |
use_wandb = True
|
|
|
|
| 50 |
dtype = torch.float32
|
| 51 |
save_barrier = 1.03
|
| 52 |
dispersive_temperature=0.5
|
| 53 |
+
dispersive_weight=0.05
|
| 54 |
percentile_clipping = 97 # Lion
|
| 55 |
steps_offset = 1 # Scheduler
|
| 56 |
limit = 0
|
|
|
|
| 628 |
if name == "adam8bit":
|
| 629 |
import bitsandbytes as bnb
|
| 630 |
return bnb.optim.AdamW8bit(
|
| 631 |
+
params, lr=base_learning_rate, betas=(0.9, 0.995), eps=1e-7, weight_decay=0.001
|
| 632 |
)
|
| 633 |
elif name == "adam":
|
| 634 |
return torch.optim.AdamW(
|
unet/diffusion_pytorch_model.fp16.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5bb9c77df1219662fc14575e71304875d9a98861cffe930e83483e4274a34238
|
| 3 |
+
size 3507231768
|
unet/diffusion_pytorch_model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 7014306128
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d16cb10d5304bc2214ed9559993089a98cd6696006e9ce64e77f3ac165ce2ecf
|
| 3 |
size 7014306128
|