File size: 5,056 Bytes
bdf1427
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74b9494
 
 
 
 
bdf1427
74b9494
 
 
 
bdf1427
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74b9494
 
 
 
 
 
 
d036923
bdf1427
d036923
 
bdf1427
d036923
 
60276a8
 
d036923
 
 
 
 
 
 
 
 
 
60276a8
 
 
bdf1427
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - latent-space
  - pytorch
  - fcdm
library_name: fcdm_diffae
---

# data-archetype/semdisdiffae_p32_v2

**semdisdiffae_p32_v2** is a native patch-32 SemDisDiffAE diffusion autoencoder. It
keeps the same FCDM decoder family as
[SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae), with an
8-block encoder, an 8-block decoder, and a 384-channel spatial latent at
`H/32 x W/32`.

Relative to the original SemDisDiffAE, this model is optimized for a
lower-resolution latent grid and downstream latent diffusion: patch size `32`
instead of `16`, `384` latent channels instead of `128`, an 8-block encoder
instead of a 4-block encoder, and DINOv3 ConvNeXt-B semantic alignment instead
of the original DINO semantic alignment setup.

For details, see the
[semdisdiffae_p32_v2 technical report](https://huggingface.co/data-archetype/semdisdiffae_p32_v2/blob/main/technical_report_fcdm_diffae.md).
For additional shared FCDM / VP decoder background, see the original
[SemDisDiffAE technical report](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md).

The p32 checkpoint was trained at `384` resolution rather than the original
`256`-scale recipe. With patch size `32`, this gives a `12x12` latent grid
instead of `8x8`, reducing the impact of 7x7-convolution border effects during
training.

## 2k PSNR Benchmark

Evaluated on `2000` images, split as `1333` Pexels images and `667` Amazon book
covers. Reconstruction uses the default 1-step VP/DDIM path in `bfloat16`.

| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---:|---:|---:|---:|---:|
| semdisdiffae_p32_v2 | `36.06` | `5.47` | `35.80` | `27.63` | `45.02` |

## Reconstruction Viewer

The 39-image reconstruction viewer shows originals, semdisdiffae_p32_v2
reconstructions, RGB error deltas, and latent PCA side by side, with FLUX.2 VAE
included for comparison:
[semdisdiffae_p32_v2 reconstruction viewer](https://huggingface.co/spaces/data-archetype/semdisdiffae_p32_v2-results).

## Encode Throughput

Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging `20`
repeated batched `encode()` calls after `5` warmup batches.

| Resolution | Batch Size | Mean (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
|---:|---:|---:|---:|---:|---:|
| `256x256` | `128` | `12.38` | `0.097` | `10336.1` | `567.8 MiB` |
| `512x512` | `128` | `53.49` | `0.418` | `2393.0` | `1353.8 MiB` |

## Decode Latency

Measured on the same `NVIDIA GeForce RTX 5090` in `bfloat16`. This is
decode-only latency: images are encoded once, latents are cached, and timing is
sequential batch-1 `decode()` over the cached latent set with the default 1-step
sampler and PDG disabled.

| Resolution | Batch Size | Images | Mean (ms/image) | Images/s | Peak Allocated VRAM |
|---:|---:|---:|---:|---:|---:|
| `512x512` | `1` | `20` | `3.89` | `256.8` | `340.8 MiB` |
| `1024x1024` | `1` | `20` | `9.79` | `102.2` | `409.6 MiB` |
| `2048x2048` | `1` | `20` | `51.90` | `19.3` | `720.9 MiB` |

## Latent Interface

- `encode()` returns whitened latents using the model's saved running statistics.
- `decode()` expects those whitened latents and dewhitens internally.
- `whiten()` and `dewhiten()` expose the transform explicitly.
- `encode_posterior()` returns the raw exported posterior before whitening.

Weights are stored in `float32`. The recommended runtime path is `bfloat16` for
the encoder and decoder, while whitening, dewhitening, posterior moment math,
VP schedule math, and sampler state updates are kept in `float32`.

## Usage

```python
import torch

from fcdm_diffae import FCDMDiffAE, FCDMDiffAEInferenceConfig


device = "cuda"
model = FCDMDiffAE.from_pretrained(
    "data-archetype/semdisdiffae_p32_v2",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 32

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=FCDMDiffAEInferenceConfig(num_steps=1),
    )
```

## Details

- Architecture: patch-32 FCDM DiffAE, `156.6M` parameters, `384` latent channels.
- Encoder / decoder depth: `8` blocks each.
- Training resolution: `384` AR buckets and `384x384` square crops.
- Semantic alignment: DINOv3 ConvNeXt-B/LVD1689M, 50/50 MSE plus negative cosine.
- Posterior: diagonal Gaussian with VP log-SNR parameterization.
- Export variant: EMA weights.
- [Technical report](https://huggingface.co/data-archetype/semdisdiffae_p32_v2/blob/main/technical_report_fcdm_diffae.md)

## Citation

```bibtex
@misc{semdisdiffae_p32_v2,
  title   = {SemDisDiffAE p32 v2: a patch-32 FCDM diffusion autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/semdisdiffae_p32_v2},
}
```