File size: 4,445 Bytes
8e3d912
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
datasets:
- Muinez/sankaku-webp-256shortest-edge
---

# StupidAE — d8c16 Tiny Patch Autoencoder

StupidAE is a very small, very fast, and intentionally simple model that still works surprisingly well.  
It has **13.24M parameters**, compresses by **8× per spatial dimension**, and uses **16 latent channels**.

The main goal: make a AE that doesn’t slow everything down and is fast enough to run directly during text-to-image training.

---

## Code

The code is available on GitHub:

👉 [https://github.com/Muinez/StupidAE](https://github.com/Muinez/StupidAE)

---

## Key Numbers

- Total params: **13,243,539**
- Compression: **d8 (8×8 patching)**
- Latent channels: **16 (c16)**
- Training: **30k steps**, batch size **256**, **~3** RTX 5090-hours
- Optimizer: **Muon + SnooC**, LR = `1e-3`
- Trained **without KL loss** (just mse)

---

## Performance (compared to SDXL VAE)

Stats for 1024×1024:

| Component | SDXL VAE | StupidAE |
|----------|----------|-----------|
| Encoder FLOPs | 4.34 TFLOPs | **124.18 GFLOPs** |
| Decoder FLOPs | 9.93 TFLOPs | **318.52 GFLOPs** |
| Encoder Params | 34.16M | **~3.8M** |
| Decoder Params | 49.49M | **~9.7M** |

The model is **tens of times faster and lighter**, making it usable directly inside training loops.

---

## Architecture Overview

### ❌ No Attention  
It is simply unnecessary for this design and only slows things down.

### 🟦 Encoder  
- Splits the image into **8×8 patches**  
- Each patch is encoded **independently**  
- Uses **only 1×1 convolutions**  
- Extremely fast

The encoder can handle any aspect ratio, but if you want to mix different ARs inside the same batch, the 1×1 conv version becomes inconvenient.
The Linear encoder version solves this completely — mixed batches work out of the box, although I haven’t released it yet — I can upload it if needed.

There is also a Linear-based encoder version; I can publish it if needed.

### 🟥 Decoder  
- Uses standard 3×3 convolutions (but 1×1 also works with surprisingly few artifacts)  
- Uses a **PixNeRF-style head** instead of stacked upsampling blocks

---

## Limitations

- Reconstruction is not perfect — small details may appear slightly blurred.  
- Current MSE loss: 0.0020.  
- This can likely be improved by increasing model size.

---

## Notes on 32× Compression

If you want **32× spatial compression**, do **not** use naive 32× patching — quality drops heavily.

A better approach:

1. First stage: patch-8 → 16/32 channels  
2. Second stage: patch-4 → 256 channels  

This trains much better and works well for text-to-image training too.  
I’ve tested it, and the results are significantly more stable than naive approaches.

If you want to keep FLOPs low, you could try using patch-16 from the start, but I’m not sure yet how stable the training would be.

I’m currently working on a **d32c64** model with reconstruction quality better than Hunyuan VAE, but I’m limited by compute resources.

---

## Support the Project

I’m renting an **RTX 5090** and running all experiments on it.  
I’m currently looking for work and would love to join a team doing text-to-image or video model research.

If you want to support development:

- TRC20: 👉 TPssa5ung2MgqbaVr1aeBQEpHC3xfmm1CL  
- BTC: bc1qfv6pyq5dvs0tths682nhfdnmdwnjvm2av80ej4
- Boosty: https://boosty.to/muinez  

---

## How to use

Here's a minimal example:

```python
import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from torchvision.transforms import v2
from IPython.display import display
import requests
from stae import StupidAE

vae = StupidAE().cuda().half()
vae.load_state_dict(
    torch.load(hf_hub_download(repo_id="Muinez/StupidAE", filename="smol_f8c16.pt"))
)

t = v2.Compose([
    v2.Resize((1024, 1024)),
    v2.ToTensor(),
    v2.Normalize([0.5], [0.5])
])

image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", stream=True).raw).convert("RGB")

with torch.inference_mode():
    image = t(image).unsqueeze(0).cuda().half()
    
    latents = vae.encode(image)
    image_decoded = vae.decode(latents)
    
    image = v2.ToPILImage()(torch.clamp(image_decoded * 0.5 + 0.5, 0, 1).squeeze(0))
    display(image)
```
---
## Coming Soon

- Linear-encoder variant  
- d32c64 model  
- Tutorial: training text-to-image **without bucketing** (supports mixed aspect ratios)