File size: 4,458 Bytes
58b87c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e816c03
 
58b87c6
 
 
 
 
 
 
 
 
 
 
 
e816c03
58b87c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e816c03
58b87c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - pytorch
  - masked-autoencoder
library_name: mdiffae
---

# mdiffae-v2

**mDiffAE v2** β€” **M**asked **Diff**usion **A**uto**E**ncoder v2.
A fast, single-GPU-trainable diffusion autoencoder with a **96-channel**
spatial bottleneck and optional PDG sharpening.

**This is the recommended version** β€” it offers substantially better
reconstruction than [v1](https://huggingface.co/data-archetype/mdiffae-v1)
(+1.7 dB mean PSNR) while maintaining the same or better convergence for
downstream latent diffusion models.

This variant (mdiffae-v2): 120.9M parameters, 461.2 MB.
Bottleneck: **96 channels** at patch size 16
(compression ratio 8x).

## Documentation

- [Technical Report](technical_report_mdiffae_v2.md) β€” architecture, training changes from v1, and results
- [Results β€” interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-v2-results) β€” full-resolution side-by-side comparison
- [mDiffAE v1](https://huggingface.co/data-archetype/mdiffae-v1) β€” previous version
- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) β€” full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN

## Quick Start

```python
import torch
from m_diffae_v2 import MDiffAEV2

# Load from HuggingFace Hub (or a local path)
model = MDiffAEV2.from_pretrained("data-archetype/mdiffae-v2", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (2 steps by default β€” PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 2-step decode)
recon = model.reconstruct(images)
```

> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.

## Architecture

| Property | Value |
|---|---|
| Parameters | 120,893,792 |
| File size | 461.2 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 8 (2+4+2 skip-concat) |
| Bottleneck dim | 96 |
| Compression ratio | 8x |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG | Conditioning degradation for CFG-style sharpening at inference |
| Training regularizer | Token masking (25-75% ratio, 90% apply prob) + Path drop (10% drop prob) |

**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks with learned residual gates. No input RMSNorm. Post-bottleneck
RMSNorm (affine=False) normalizes the latent tokens.

**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. Skip-concat topology
(2 start + 4 middle + 2 end blocks)
with skip connections from start to end blocks. No outer RMSNorms
(input, latent conditioning, and output norms all removed).

### Changes from v1

| Aspect | mDiffAE v1 | mDiffAE v2 |
|---|---|---|
| Bottleneck dim | 64 (12x compression) | **96** (8x compression) |
| Decoder topology | 4 flat sequential blocks | **8 blocks (2+4+2 skip-concat)** |
| Token mask apply prob | 50% | **90%** |
| Token mask ratio | Fixed 75% | **Uniform(25%, 75%)** |
| PDG training regularizer | Token masking (50%) | **Token masking (90%) + path drop (10%)** |
| Latent noise prob | 10% | **50%** |
| Encoder input norm | RMSNorm (affine) | **Removed** |
| Decoder input norm | RMSNorm (affine) | **Removed** |
| Decoder latent norm | RMSNorm (affine) | **Removed** |
| Decoder output norm | RMSNorm (affine) | **Removed** |

## Recommended Settings

| Mode | Steps | PDG | Strength |
|---|---|---|---|
| **Default** (best PSNR) | 2 | off | β€” |
| **Sharp** (perceptual) | 10 | on | 2.0 |

```python
from m_diffae_v2 import MDiffAEV2InferenceConfig

# Default β€” best PSNR, fast (2 steps, no PDG)
recon = model.decode(latents, height=H, width=W)

# Sharp mode β€” perceptual sharpening (10 steps + PDG)
cfg = MDiffAEV2InferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```

## Citation

```bibtex
@misc{mdiffae_v2,
  title   = {mDiffAE v2: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae-v2},
}
```

## Dependencies

- PyTorch >= 2.0
- safetensors (for loading weights)

## License

Apache 2.0