File size: 3,900 Bytes
128cb34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42d9d8c
 
128cb34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b877c3
 
 
 
 
 
 
 
128cb34
 
 
 
 
 
 
 
 
 
 
 
9b877c3
 
128cb34
 
 
 
 
 
9b877c3
128cb34
 
 
 
 
 
 
 
 
 
 
 
 
9b877c3
128cb34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - pytorch
  - masked-autoencoder
library_name: mdiffae
---

# mdiffae_v1

**mDiffAE** — **M**asked **Diff**usion **A**uto**E**ncoder.
A fast, single-GPU-trainable diffusion autoencoder with a **64-channel**
spatial bottleneck. Uses decoder token masking as an implicit regularizer
instead of REPA alignment.

This variant (mdiffae_v1): 81.4M parameters, 310.6 MB.
Bottleneck: **64 channels** at patch size 16
(compression ratio 12x).

## Documentation

- [Technical Report](technical_report_mdiffae.md) — architecture, masking strategy, and results
- [iRDiffAE Technical Report](https://huggingface.co/data-archetype/irdiffae-v1/blob/main/technical_report.md) — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
- [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/mdiffae-results) — full-resolution side-by-side comparison

## Quick Start

```python
import torch
from m_diffae import MDiffAE

# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (1 step by default — PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)
```

> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.

## Architecture

| Property | Value |
|---|---|
| Parameters | 81,410,624 |
| File size | 310.6 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 4 |
| Decoder topology | Flat sequential (no skip connections) |
| Bottleneck dim | 64 |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG mechanism | Token-level masking (ratio 0.75) |
| Training regularizer | Decoder token masking (75% ratio, 50% apply prob) |

**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with
learned residual gates.

**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. 4 flat
sequential blocks (no skip connections).

**Compared to iRDiffAE**: iRDiffAE uses an 8-block decoder (2 start + 4 middle
+ 2 end) with skip connections and 128 bottleneck channels (needed partly because
REPA occupies half the channels). mDiffAE uses 4 flat blocks
with no skip connections and 64 bottleneck channels
(12x compression vs
iRDiffAE's 6x), which gives better channel utilisation.

### Key Differences from iRDiffAE

| Aspect | iRDiffAE v1 | mDiffAE v1 |
|---|---|---|
| Bottleneck dim | 128 | **64** |
| Decoder depth | 8 (2+4+2 skip-concat) | **4 (flat sequential)** |
| PDG mechanism | Block dropping | **Token masking** |
| Training regularizer | REPA + covariance reg | **Decoder token masking** |

## Recommended Settings

Best quality is achieved with **1 DDIM step** and PDG disabled.
PDG can sharpen images but should be kept very low (1.01–1.05).

| Setting | Default |
|---|---|
| Sampler | DDIM |
| Steps | 1 |
| PDG | Disabled |
| PDG strength (if enabled) | 1.05 |

```python
from m_diffae import MDiffAEInferenceConfig

# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```

## Citation

```bibtex
@misc{m_diffae,
  title   = {mDiffAE: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae_v1},
}
```

## Dependencies

- PyTorch >= 2.0
- safetensors (for loading weights)

## License

Apache 2.0