File size: 4,445 Bytes
8e3d912 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
datasets:
- Muinez/sankaku-webp-256shortest-edge
---
# StupidAE — d8c16 Tiny Patch Autoencoder
StupidAE is a very small, very fast, and intentionally simple model that still works surprisingly well.
It has **13.24M parameters**, compresses by **8× per spatial dimension**, and uses **16 latent channels**.
The main goal: make a AE that doesn’t slow everything down and is fast enough to run directly during text-to-image training.
---
## Code
The code is available on GitHub:
👉 [https://github.com/Muinez/StupidAE](https://github.com/Muinez/StupidAE)
---
## Key Numbers
- Total params: **13,243,539**
- Compression: **d8 (8×8 patching)**
- Latent channels: **16 (c16)**
- Training: **30k steps**, batch size **256**, **~3** RTX 5090-hours
- Optimizer: **Muon + SnooC**, LR = `1e-3`
- Trained **without KL loss** (just mse)
---
## Performance (compared to SDXL VAE)
Stats for 1024×1024:
| Component | SDXL VAE | StupidAE |
|----------|----------|-----------|
| Encoder FLOPs | 4.34 TFLOPs | **124.18 GFLOPs** |
| Decoder FLOPs | 9.93 TFLOPs | **318.52 GFLOPs** |
| Encoder Params | 34.16M | **~3.8M** |
| Decoder Params | 49.49M | **~9.7M** |
The model is **tens of times faster and lighter**, making it usable directly inside training loops.
---
## Architecture Overview
### ❌ No Attention
It is simply unnecessary for this design and only slows things down.
### 🟦 Encoder
- Splits the image into **8×8 patches**
- Each patch is encoded **independently**
- Uses **only 1×1 convolutions**
- Extremely fast
The encoder can handle any aspect ratio, but if you want to mix different ARs inside the same batch, the 1×1 conv version becomes inconvenient.
The Linear encoder version solves this completely — mixed batches work out of the box, although I haven’t released it yet — I can upload it if needed.
There is also a Linear-based encoder version; I can publish it if needed.
### 🟥 Decoder
- Uses standard 3×3 convolutions (but 1×1 also works with surprisingly few artifacts)
- Uses a **PixNeRF-style head** instead of stacked upsampling blocks
---
## Limitations
- Reconstruction is not perfect — small details may appear slightly blurred.
- Current MSE loss: 0.0020.
- This can likely be improved by increasing model size.
---
## Notes on 32× Compression
If you want **32× spatial compression**, do **not** use naive 32× patching — quality drops heavily.
A better approach:
1. First stage: patch-8 → 16/32 channels
2. Second stage: patch-4 → 256 channels
This trains much better and works well for text-to-image training too.
I’ve tested it, and the results are significantly more stable than naive approaches.
If you want to keep FLOPs low, you could try using patch-16 from the start, but I’m not sure yet how stable the training would be.
I’m currently working on a **d32c64** model with reconstruction quality better than Hunyuan VAE, but I’m limited by compute resources.
---
## Support the Project
I’m renting an **RTX 5090** and running all experiments on it.
I’m currently looking for work and would love to join a team doing text-to-image or video model research.
If you want to support development:
- TRC20: 👉 TPssa5ung2MgqbaVr1aeBQEpHC3xfmm1CL
- BTC: bc1qfv6pyq5dvs0tths682nhfdnmdwnjvm2av80ej4
- Boosty: https://boosty.to/muinez
---
## How to use
Here's a minimal example:
```python
import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from torchvision.transforms import v2
from IPython.display import display
import requests
from stae import StupidAE
vae = StupidAE().cuda().half()
vae.load_state_dict(
torch.load(hf_hub_download(repo_id="Muinez/StupidAE", filename="smol_f8c16.pt"))
)
t = v2.Compose([
v2.Resize((1024, 1024)),
v2.ToTensor(),
v2.Normalize([0.5], [0.5])
])
image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", stream=True).raw).convert("RGB")
with torch.inference_mode():
image = t(image).unsqueeze(0).cuda().half()
latents = vae.encode(image)
image_decoded = vae.decode(latents)
image = v2.ToPILImage()(torch.clamp(image_decoded * 0.5 + 0.5, 0, 1).squeeze(0))
display(image)
```
---
## Coming Soon
- Linear-encoder variant
- d32c64 model
- Tutorial: training text-to-image **without bucketing** (supports mixed aspect ratios) |