recoilme commited on
Commit
6dab279
·
1 Parent(s): 8588fb2
README.md CHANGED
@@ -7,6 +7,19 @@ pipeline_tag: text-to-image
7
 
8
  *XS Size, Excess Quality*
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  Train status, in progress: [wandb](https://wandb.ai/recoilme/unet)
12
 
@@ -107,7 +120,7 @@ if __name__ == "__main__":
107
  scheduler = DDPMScheduler.from_pretrained(pipeid, subfolder="scheduler")
108
 
109
 
110
- height, width = 384, 384
111
  num_inference_steps = 40
112
  output_folder, project_name = "samples", "sdxs"
113
  latents = generate_latents(
@@ -126,28 +139,6 @@ if __name__ == "__main__":
126
  print("Images generated and saved to:", output_folder)
127
  ```
128
 
129
- ## Introduction
130
- *Fast, Lightweight & Multilingual Diffusion for Everyone*
131
-
132
- We are **AiArtLab**, a small team of enthusiasts with a limited budget. Our goal is to create a compact and fast model that can be trained on consumer graphics cards (full training cycle, not LoRA). We chose U-Net for its ability to efficiently handle small datasets and train quickly even on a 16GB GPU (e.g., RTX 4080). Our budget was limited to a few thousand dollars, significantly less than competitors like SDXL (tens of millions), so we decided to create a small but efficient model, similar to SD1.5 but for 2025 year.
133
-
134
- ## Encoder Architecture (Text and Images)
135
- We experimented with various encoders and concluded that large models like LLaMA or T5 XXL are unnecessary for high-quality generation. However, we needed an encoder that understands the context of the query, focusing on "prompt understanding" versus "prompt following." We chose the multilingual encoder Mexma-SigLIP, which supports 80 languages and processes sentences rather than individual tokens. Mexma accepts up to 512 tokens, creating a large matrix that slows down training. Therefore, we used a pooling layer to simplify 512x1152 matrix with plain 1x1152 vector. Specifically, we passed it through a linear model/text projector to achieve compatibility with SigLIP embeddings. This allowed us to synchronize text embeddings with images, potentially leading to a unified multimodal model. This functionality enables mixing image embeddings with textual descriptions in queries. Moreover, the model can be trained without text descriptions, using only images. This should simplify training on videos, where annotation is challenging, and achieve more consistent and seamless video generation by inputting embeddings of previous frames with decay. In the future, we aim to expand the model to 3D/video generation.
136
-
137
- ## VAE Architecture
138
- We chose an unconventional 8x 16-channel AuraDiffusion VAE, which preserves details, text, and anatomy without the 'haze' characteristic of SD3/Flux.
139
-
140
- ## Training Process
141
- ### Optimizer
142
- We tested several optimizers (AdamW, Laion, Optimi-AdamW, Adafactor, and AdamW-8bit) and chose AdamW-8bit. Optimi-AdamW demonstrated the smoothest gradient decay curve, although AdamW-8bit behaves more chaotically. However, its smaller size allows for larger batch sizes, maximizing training speed on low-cost GPUs (we used 4xA6000 and 5xL40s for training).
143
-
144
- ### Dataset
145
- We trained the model on approximately 1 million images: 60 epochs on ImageNet at 256 resolution (wasted time because of low-quality annotations) and 8 epochs on CaptionEmporium/midjourney-niji-1m-llavanext, plus realistic photos and anime/art at 576 resolution. We used human prompts, Caption Emporium provided prompts, WD-Tagger from SmilingWolf, and Moondream2 for annotation, varying prompt length and composition to ensure the model understands different prompting styles. The dataset is extremely small, leading the model to miss many entities and struggle with unseen concepts like 'a goose on a bicycle.' The dataset also included many waifu-style images, as we were interested in how well the model learns human anatomy rather than drawing 'The Astronaut on horseback' skills. While most descriptions were in English, our tests indicate the model is multilingual.
146
-
147
- ## Limitations
148
- - Limited concept coverage due to the extremely small dataset.
149
- - The Image2Image functionality needs further training (we reduced the SigLIP portion to 5% to focus on text-to-image training).
150
-
151
  ## Acknowledgments
152
  - **[Stan](https://t.me/Stangle)** — Key investor. Primary financial support - thank you for believing in us when others called it madness.
153
  - **Captainsaturnus** — Material support.
 
7
 
8
  *XS Size, Excess Quality*
9
 
10
+ At AiArtLab, we aim to develop a compact (1.7b) and fast (3sec/image) model that can be trained on consumer-grade graphics cards, all while operating on a limited budget.
11
+
12
+
13
+ We have chosen the multilingual encoder Mexma-SigLIP, which supports 80 languages and processes entire sentences rather than individual tokens. Our chosen VAE architecture, AuraDiffusion, preserves details and anatomy without the blurring effects seen in other models.
14
+
15
+
16
+ For training, we use AdamW-8bit, which allows for larger batch sizes and accelerates training on cost-effective GPUs. Our model has been trained on approximately one million images with various resolutions and styles, including anime and realistic photos. We employed a variety of annotation methods, combining both manual and automated approaches.
17
+
18
+
19
+ However, our model does have some limitations:
20
+ - Limited concept coverage due to the small dataset size.
21
+ - The Image2Image functionality requires further training.
22
+
23
 
24
  Train status, in progress: [wandb](https://wandb.ai/recoilme/unet)
25
 
 
120
  scheduler = DDPMScheduler.from_pretrained(pipeid, subfolder="scheduler")
121
 
122
 
123
+ height, width = 576, 384
124
  num_inference_steps = 40
125
  output_folder, project_name = "samples", "sdxs"
126
  latents = generate_latents(
 
139
  print("Images generated and saved to:", output_folder)
140
  ```
141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  ## Acknowledgments
143
  - **[Stan](https://t.me/Stangle)** — Key investor. Primary financial support - thank you for believing in us when others called it madness.
144
  - **Captainsaturnus** — Material support.
result_grid.jpg CHANGED

Git LFS Details

  • SHA256: 4cbf1bbbe931782fa4044a599248be5620b1e1f1f87678020b17f592a5d2ac68
  • Pointer size: 132 Bytes
  • Size of remote file: 6.77 MB

Git LFS Details

  • SHA256: dfa02a96dd70dfac26f99425f5e8ce90cd5bacf317e7d5c5eb0d721e73a6bb39
  • Pointer size: 132 Bytes
  • Size of remote file: 5.79 MB
samples/unet_320x576_0.jpg CHANGED

Git LFS Details

  • SHA256: 8519c1f9ee382b980d100f15568bc55c3814584838000fab77482e3cac7827fd
  • Pointer size: 130 Bytes
  • Size of remote file: 69.7 kB

Git LFS Details

  • SHA256: bf1b44cef2ed92ef96c662cd3cb1fdffa41e642a03a93fce325c4cd30e3ba18b
  • Pointer size: 130 Bytes
  • Size of remote file: 70.6 kB
samples/unet_384x576_0.jpg CHANGED

Git LFS Details

  • SHA256: ae03322c13f29d13fab00ecebdd6d8af1675547c0c5fda7def4bc905eea85690
  • Pointer size: 130 Bytes
  • Size of remote file: 60.3 kB

Git LFS Details

  • SHA256: d7891a8ffa8dc4a63948b8669f50d4814bb7bbd1d604dc3fb1f6b0e1f179c8fc
  • Pointer size: 130 Bytes
  • Size of remote file: 53.7 kB
samples/unet_448x576_0.jpg CHANGED

Git LFS Details

  • SHA256: e631ea48ddc1662442f1eb3d566c794fdfc201200a93ec75427c5c0a5c120100
  • Pointer size: 130 Bytes
  • Size of remote file: 76.4 kB

Git LFS Details

  • SHA256: 88ecc17c1280ba71f7e85d2cb51a4708f70161a994398c9cbdaa2e0ce0a3818b
  • Pointer size: 130 Bytes
  • Size of remote file: 76.3 kB
samples/unet_512x576_0.jpg CHANGED

Git LFS Details

  • SHA256: dd614dcc9ad0f95ea34183935843429c47e6e2220e404adcf20d5b2dc74ef656
  • Pointer size: 131 Bytes
  • Size of remote file: 126 kB

Git LFS Details

  • SHA256: e8b1c406f5bfa040a1516ab0e22fe30443dd252b86f96eb79fe9ae82c9c7b178
  • Pointer size: 131 Bytes
  • Size of remote file: 125 kB
samples/unet_576x320_0.jpg CHANGED

Git LFS Details

  • SHA256: 6a6733b598db20abe4bc83eb8511ccb46798b13b1bcccd16fc428bac9ce1fdee
  • Pointer size: 131 Bytes
  • Size of remote file: 124 kB

Git LFS Details

  • SHA256: 14655874456335a2cfd9ab4a9d385e3bd437cb5d1a70425d0ee00f810fd56b7a
  • Pointer size: 130 Bytes
  • Size of remote file: 98.8 kB
samples/unet_576x384_0.jpg CHANGED

Git LFS Details

  • SHA256: ef1e3adef8f995d535fb11683bbcd197d6cd4d1b07008278c8a07d5bb90970f3
  • Pointer size: 130 Bytes
  • Size of remote file: 78.5 kB

Git LFS Details

  • SHA256: 90294300da91d4fa91d9ea8dbb01015dacc54e33f14cf8d52f5ef26aa01b137f
  • Pointer size: 130 Bytes
  • Size of remote file: 65.1 kB
samples/unet_576x448_0.jpg CHANGED

Git LFS Details

  • SHA256: 31d5fb1cce43c0dcc3a6b54be300918d425fadc5ea55fd2f4d8785470e3cf0ba
  • Pointer size: 130 Bytes
  • Size of remote file: 82.9 kB

Git LFS Details

  • SHA256: bb2f5877d2ad8b24d2a4afb7a873885f86e6475b39d1fba728df6a98f0842db8
  • Pointer size: 131 Bytes
  • Size of remote file: 110 kB
samples/unet_576x512_0.jpg CHANGED

Git LFS Details

  • SHA256: 60a982caa2de7e280aa147ce105a14a5c47b14beb4900686d66dce3d9119e980
  • Pointer size: 131 Bytes
  • Size of remote file: 124 kB

Git LFS Details

  • SHA256: 3c8658248dca6e4f9c9c30539150b777b9da85c3d109f3c5b5f5c53bcda14e69
  • Pointer size: 131 Bytes
  • Size of remote file: 111 kB
samples/unet_576x576_0.jpg CHANGED

Git LFS Details

  • SHA256: 79ce66502a483f44a2660ebc8a9ca05e26d42079091927c76e2ccd8e2a70f387
  • Pointer size: 131 Bytes
  • Size of remote file: 142 kB

Git LFS Details

  • SHA256: 5c80ac843afd21d6ec5cf33cc840c0e36a1af1685c96333fa8790d48f89d00f1
  • Pointer size: 131 Bytes
  • Size of remote file: 137 kB
src/cherrypick.ipynb CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0ad16f012308a461f220b8b9330c5606f5d1d25a8fac8cd64f79d29358079d98
3
- size 15687
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3bc29d6c8ede5a64c8ae9b2f8f824d4edfe91209bc6a6363a43ab66ec01d68f
3
+ size 44788
test.ipynb CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:92d6a1b7034bd2de712560bf42723551e6884a9f3c379855329e50dee422c889
3
- size 1401708
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d35e0b0f890df0408db1efd1fe1daed6b8b27b3021e90fa10ded72d702f26677
3
+ size 4885422
train.py CHANGED
@@ -28,9 +28,9 @@ import torch.nn.functional as F
28
  ds_path = "datasets/576"
29
  project = "unet"
30
  batch_size = 50
31
- base_learning_rate = 5e-5
32
  min_learning_rate = 1e-5
33
- num_epochs = 19
34
  # samples/save per epoch
35
  sample_interval_share = 5
36
  use_wandb = True
@@ -50,7 +50,7 @@ torch.backends.cuda.enable_mem_efficient_sdp(True)
50
  dtype = torch.float32
51
  save_barrier = 1.03
52
  dispersive_temperature=0.5
53
- dispersive_weight=0.1
54
  percentile_clipping = 97 # Lion
55
  steps_offset = 1 # Scheduler
56
  limit = 0
@@ -628,7 +628,7 @@ def create_optimizer(name, params):
628
  if name == "adam8bit":
629
  import bitsandbytes as bnb
630
  return bnb.optim.AdamW8bit(
631
- params, lr=base_learning_rate, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01
632
  )
633
  elif name == "adam":
634
  return torch.optim.AdamW(
 
28
  ds_path = "datasets/576"
29
  project = "unet"
30
  batch_size = 50
31
+ base_learning_rate = 2e-5
32
  min_learning_rate = 1e-5
33
+ num_epochs = 5
34
  # samples/save per epoch
35
  sample_interval_share = 5
36
  use_wandb = True
 
50
  dtype = torch.float32
51
  save_barrier = 1.03
52
  dispersive_temperature=0.5
53
+ dispersive_weight=0.05
54
  percentile_clipping = 97 # Lion
55
  steps_offset = 1 # Scheduler
56
  limit = 0
 
628
  if name == "adam8bit":
629
  import bitsandbytes as bnb
630
  return bnb.optim.AdamW8bit(
631
+ params, lr=base_learning_rate, betas=(0.9, 0.995), eps=1e-7, weight_decay=0.001
632
  )
633
  elif name == "adam":
634
  return torch.optim.AdamW(
unet/diffusion_pytorch_model.fp16.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ba82fba0a530fa8d0a310e5946b98c08bb5e482aa26c65a74fed55e59b27a99c
3
- size 7014306128
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bb9c77df1219662fc14575e71304875d9a98861cffe930e83483e4274a34238
3
+ size 3507231768
unet/diffusion_pytorch_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:faa8982e546e7c861e550b58db204dfbf3a67688ed0b7b06b10fd0c58a596955
3
  size 7014306128
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d16cb10d5304bc2214ed9559993089a98cd6696006e9ce64e77f3ac165ce2ecf
3
  size 7014306128