recoilme commited on Jul 1, 2025

Commit

6dab279

1 Parent(s): 8588fb2

512ep20

Browse files

Files changed (16) hide show

README.md +14 -23
result_grid.jpg +2 -2
samples/unet_320x576_0.jpg +2 -2
samples/unet_384x576_0.jpg +2 -2
samples/unet_448x576_0.jpg +2 -2
samples/unet_512x576_0.jpg +2 -2
samples/unet_576x320_0.jpg +2 -2
samples/unet_576x384_0.jpg +2 -2
samples/unet_576x448_0.jpg +2 -2
samples/unet_576x512_0.jpg +2 -2
samples/unet_576x576_0.jpg +2 -2
src/cherrypick.ipynb +2 -2
test.ipynb +2 -2
train.py +4 -4
unet/diffusion_pytorch_model.fp16.safetensors +2 -2
unet/diffusion_pytorch_model.safetensors +1 -1

README.md CHANGED Viewed

@@ -7,6 +7,19 @@ pipeline_tag: text-to-image
 *XS Size, Excess Quality*
 Train status, in progress: [wandb](https://wandb.ai/recoilme/unet)
@@ -107,7 +120,7 @@ if __name__ == "__main__":
     scheduler = DDPMScheduler.from_pretrained(pipeid, subfolder="scheduler")
-    height, width = 384, 384
     num_inference_steps = 40
     output_folder, project_name = "samples", "sdxs"
     latents = generate_latents(
@@ -126,28 +139,6 @@ if __name__ == "__main__":
     print("Images generated and saved to:", output_folder)
 ```
-## Introduction
-*Fast, Lightweight & Multilingual Diffusion for Everyone*
-We are **AiArtLab**, a small team of enthusiasts with a limited budget. Our goal is to create a compact and fast model that can be trained on consumer graphics cards (full training cycle, not LoRA). We chose U-Net for its ability to efficiently handle small datasets and train quickly even on a 16GB GPU (e.g., RTX 4080). Our budget was limited to a few thousand dollars, significantly less than competitors like SDXL (tens of millions), so we decided to create a small but efficient model, similar to SD1.5 but for 2025 year.
-## Encoder Architecture (Text and Images)
-We experimented with various encoders and concluded that large models like LLaMA or T5 XXL are unnecessary for high-quality generation. However, we needed an encoder that understands the context of the query, focusing on "prompt understanding" versus "prompt following." We chose the multilingual encoder Mexma-SigLIP, which supports 80 languages and processes sentences rather than individual tokens. Mexma accepts up to 512 tokens, creating a large matrix that slows down training. Therefore, we used a pooling layer to simplify 512x1152 matrix with plain 1x1152 vector. Specifically, we passed it through a linear model/text projector to achieve compatibility with SigLIP embeddings. This allowed us to synchronize text embeddings with images, potentially leading to a unified multimodal model. This functionality enables mixing image embeddings with textual descriptions in queries. Moreover, the model can be trained without text descriptions, using only images. This should simplify training on videos, where annotation is challenging, and achieve more consistent and seamless video generation by inputting embeddings of previous frames with decay. In the future, we aim to expand the model to 3D/video generation.
-## VAE Architecture
-We chose an unconventional 8x 16-channel AuraDiffusion VAE, which preserves details, text, and anatomy without the 'haze' characteristic of SD3/Flux.
-## Training Process
-### Optimizer
-We tested several optimizers (AdamW, Laion, Optimi-AdamW, Adafactor, and AdamW-8bit) and chose AdamW-8bit. Optimi-AdamW demonstrated the smoothest gradient decay curve, although AdamW-8bit behaves more chaotically. However, its smaller size allows for larger batch sizes, maximizing training speed on low-cost GPUs (we used 4xA6000 and 5xL40s for training).
-### Dataset
-We trained the model on approximately 1 million images: 60 epochs on ImageNet at 256 resolution (wasted time because of low-quality annotations) and 8 epochs on CaptionEmporium/midjourney-niji-1m-llavanext, plus realistic photos and anime/art at 576 resolution. We used human prompts, Caption Emporium provided prompts, WD-Tagger from SmilingWolf, and Moondream2 for annotation, varying prompt length and composition to ensure the model understands different prompting styles. The dataset is extremely small, leading the model to miss many entities and struggle with unseen concepts like 'a goose on a bicycle.' The dataset also included many waifu-style images, as we were interested in how well the model learns human anatomy rather than drawing 'The Astronaut on horseback' skills. While most descriptions were in English, our tests indicate the model is multilingual.
-## Limitations
-- Limited concept coverage due to the extremely small dataset.
-- The Image2Image functionality needs further training (we reduced the SigLIP portion to 5% to focus on text-to-image training).
 ## Acknowledgments
 - **[Stan](https://t.me/Stangle)** — Key investor. Primary financial support - thank you for believing in us when others called it madness.
 - **Captainsaturnus** — Material support.

 *XS Size, Excess Quality*
+At AiArtLab, we aim to develop a compact (1.7b) and fast (3sec/image) model that can be trained on consumer-grade graphics cards, all while operating on a limited budget.
+We have chosen the multilingual encoder Mexma-SigLIP, which supports 80 languages and processes entire sentences rather than individual tokens. Our chosen VAE architecture, AuraDiffusion, preserves details and anatomy without the blurring effects seen in other models.
+For training, we use AdamW-8bit, which allows for larger batch sizes and accelerates training on cost-effective GPUs. Our model has been trained on approximately one million images with various resolutions and styles, including anime and realistic photos. We employed a variety of annotation methods, combining both manual and automated approaches.
+However, our model does have some limitations:
+- Limited concept coverage due to the small dataset size.
+- The Image2Image functionality requires further training.
 Train status, in progress: [wandb](https://wandb.ai/recoilme/unet)
     scheduler = DDPMScheduler.from_pretrained(pipeid, subfolder="scheduler")
+    height, width = 576, 384
     num_inference_steps = 40
     output_folder, project_name = "samples", "sdxs"
     latents = generate_latents(
     print("Images generated and saved to:", output_folder)
 ```
 ## Acknowledgments
 - **[Stan](https://t.me/Stangle)** — Key investor. Primary financial support - thank you for believing in us when others called it madness.
 - **Captainsaturnus** — Material support.

result_grid.jpg CHANGED Viewed

Git LFS Details

SHA256: 4cbf1bbbe931782fa4044a599248be5620b1e1f1f87678020b17f592a5d2ac68
Pointer size: 132 Bytes
Size of remote file: 6.77 MB

Git LFS Details

SHA256: dfa02a96dd70dfac26f99425f5e8ce90cd5bacf317e7d5c5eb0d721e73a6bb39
Pointer size: 132 Bytes
Size of remote file: 5.79 MB

samples/unet_320x576_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 8519c1f9ee382b980d100f15568bc55c3814584838000fab77482e3cac7827fd
Pointer size: 130 Bytes
Size of remote file: 69.7 kB

Git LFS Details

SHA256: bf1b44cef2ed92ef96c662cd3cb1fdffa41e642a03a93fce325c4cd30e3ba18b
Pointer size: 130 Bytes
Size of remote file: 70.6 kB

samples/unet_384x576_0.jpg CHANGED Viewed

Git LFS Details

SHA256: ae03322c13f29d13fab00ecebdd6d8af1675547c0c5fda7def4bc905eea85690
Pointer size: 130 Bytes
Size of remote file: 60.3 kB

Git LFS Details

SHA256: d7891a8ffa8dc4a63948b8669f50d4814bb7bbd1d604dc3fb1f6b0e1f179c8fc
Pointer size: 130 Bytes
Size of remote file: 53.7 kB

samples/unet_448x576_0.jpg CHANGED Viewed

Git LFS Details

SHA256: e631ea48ddc1662442f1eb3d566c794fdfc201200a93ec75427c5c0a5c120100
Pointer size: 130 Bytes
Size of remote file: 76.4 kB

Git LFS Details

SHA256: 88ecc17c1280ba71f7e85d2cb51a4708f70161a994398c9cbdaa2e0ce0a3818b
Pointer size: 130 Bytes
Size of remote file: 76.3 kB

samples/unet_512x576_0.jpg CHANGED Viewed

Git LFS Details

SHA256: dd614dcc9ad0f95ea34183935843429c47e6e2220e404adcf20d5b2dc74ef656
Pointer size: 131 Bytes
Size of remote file: 126 kB

Git LFS Details

SHA256: e8b1c406f5bfa040a1516ab0e22fe30443dd252b86f96eb79fe9ae82c9c7b178
Pointer size: 131 Bytes
Size of remote file: 125 kB

samples/unet_576x320_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 6a6733b598db20abe4bc83eb8511ccb46798b13b1bcccd16fc428bac9ce1fdee
Pointer size: 131 Bytes
Size of remote file: 124 kB

Git LFS Details

SHA256: 14655874456335a2cfd9ab4a9d385e3bd437cb5d1a70425d0ee00f810fd56b7a
Pointer size: 130 Bytes
Size of remote file: 98.8 kB

samples/unet_576x384_0.jpg CHANGED Viewed

Git LFS Details

SHA256: ef1e3adef8f995d535fb11683bbcd197d6cd4d1b07008278c8a07d5bb90970f3
Pointer size: 130 Bytes
Size of remote file: 78.5 kB

Git LFS Details

SHA256: 90294300da91d4fa91d9ea8dbb01015dacc54e33f14cf8d52f5ef26aa01b137f
Pointer size: 130 Bytes
Size of remote file: 65.1 kB

samples/unet_576x448_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 31d5fb1cce43c0dcc3a6b54be300918d425fadc5ea55fd2f4d8785470e3cf0ba
Pointer size: 130 Bytes
Size of remote file: 82.9 kB

Git LFS Details

SHA256: bb2f5877d2ad8b24d2a4afb7a873885f86e6475b39d1fba728df6a98f0842db8
Pointer size: 131 Bytes
Size of remote file: 110 kB

samples/unet_576x512_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 60a982caa2de7e280aa147ce105a14a5c47b14beb4900686d66dce3d9119e980
Pointer size: 131 Bytes
Size of remote file: 124 kB

Git LFS Details

SHA256: 3c8658248dca6e4f9c9c30539150b777b9da85c3d109f3c5b5f5c53bcda14e69
Pointer size: 131 Bytes
Size of remote file: 111 kB

samples/unet_576x576_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 79ce66502a483f44a2660ebc8a9ca05e26d42079091927c76e2ccd8e2a70f387
Pointer size: 131 Bytes
Size of remote file: 142 kB

Git LFS Details

SHA256: 5c80ac843afd21d6ec5cf33cc840c0e36a1af1685c96333fa8790d48f89d00f1
Pointer size: 131 Bytes
Size of remote file: 137 kB

src/cherrypick.ipynb CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0ad16f012308a461f220b8b9330c5606f5d1d25a8fac8cd64f79d29358079d98
-size 15687

 version https://git-lfs.github.com/spec/v1
+oid sha256:c3bc29d6c8ede5a64c8ae9b2f8f824d4edfe91209bc6a6363a43ab66ec01d68f
+size 44788

test.ipynb CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:92d6a1b7034bd2de712560bf42723551e6884a9f3c379855329e50dee422c889
-size 1401708

 version https://git-lfs.github.com/spec/v1
+oid sha256:d35e0b0f890df0408db1efd1fe1daed6b8b27b3021e90fa10ded72d702f26677
+size 4885422

train.py CHANGED Viewed

@@ -28,9 +28,9 @@ import torch.nn.functional as F
 ds_path = "datasets/576"
 project = "unet"
 batch_size = 50
-base_learning_rate = 5e-5
 min_learning_rate = 1e-5
-num_epochs = 19
 # samples/save per epoch
 sample_interval_share = 5
 use_wandb = True
@@ -50,7 +50,7 @@ torch.backends.cuda.enable_mem_efficient_sdp(True)
 dtype = torch.float32
 save_barrier = 1.03
 dispersive_temperature=0.5
-dispersive_weight=0.1
 percentile_clipping = 97 # Lion
 steps_offset = 1 # Scheduler
 limit = 0
@@ -628,7 +628,7 @@ def create_optimizer(name, params):
     if name == "adam8bit":
         import bitsandbytes as bnb
         return bnb.optim.AdamW8bit(
-            params, lr=base_learning_rate, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01
         )
     elif name == "adam":
         return torch.optim.AdamW(

 ds_path = "datasets/576"
 project = "unet"
 batch_size = 50
+base_learning_rate = 2e-5
 min_learning_rate = 1e-5
+num_epochs = 5
 # samples/save per epoch
 sample_interval_share = 5
 use_wandb = True
 dtype = torch.float32
 save_barrier = 1.03
 dispersive_temperature=0.5
+dispersive_weight=0.05
 percentile_clipping = 97 # Lion
 steps_offset = 1 # Scheduler
 limit = 0
     if name == "adam8bit":
         import bitsandbytes as bnb
         return bnb.optim.AdamW8bit(
+            params, lr=base_learning_rate, betas=(0.9, 0.995), eps=1e-7, weight_decay=0.001
         )
     elif name == "adam":
         return torch.optim.AdamW(

unet/diffusion_pytorch_model.fp16.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ba82fba0a530fa8d0a310e5946b98c08bb5e482aa26c65a74fed55e59b27a99c
-size 7014306128

 version https://git-lfs.github.com/spec/v1
+oid sha256:5bb9c77df1219662fc14575e71304875d9a98861cffe930e83483e4274a34238
+size 3507231768

unet/diffusion_pytorch_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:faa8982e546e7c861e550b58db204dfbf3a67688ed0b7b06b10fd0c58a596955
 size 7014306128

 version https://git-lfs.github.com/spec/v1
+oid sha256:d16cb10d5304bc2214ed9559993089a98cd6696006e9ce64e77f3ac165ce2ecf
 size 7014306128