File size: 5,056 Bytes

---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - latent-space
  - pytorch
  - fcdm
library_name: fcdm_diffae
---

# data-archetype/semdisdiffae_p32_v2

**semdisdiffae_p32_v2** is a native patch-32 SemDisDiffAE diffusion autoencoder. It
keeps the same FCDM decoder family as
[SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae), with an
8-block encoder, an 8-block decoder, and a 384-channel spatial latent at
`H/32 x W/32`.

Relative to the original SemDisDiffAE, this model is optimized for a
lower-resolution latent grid and downstream latent diffusion: patch size `32`
instead of `16`, `384` latent channels instead of `128`, an 8-block encoder
instead of a 4-block encoder, and DINOv3 ConvNeXt-B semantic alignment instead
of the original DINO semantic alignment setup.

For details, see the
[semdisdiffae_p32_v2 technical report](https://huggingface.co/data-archetype/semdisdiffae_p32_v2/blob/main/technical_report_fcdm_diffae.md).
For additional shared FCDM / VP decoder background, see the original
[SemDisDiffAE technical report](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md).

The p32 checkpoint was trained at `384` resolution rather than the original
`256`-scale recipe. With patch size `32`, this gives a `12x12` latent grid
instead of `8x8`, reducing the impact of 7x7-convolution border effects during
training.

## 2k PSNR Benchmark

Evaluated on `2000` images, split as `1333` Pexels images and `667` Amazon book
covers. Reconstruction uses the default 1-step VP/DDIM path in `bfloat16`.

| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---:|---:|---:|---:|---:|
| semdisdiffae_p32_v2 | `36.06` | `5.47` | `35.80` | `27.63` | `45.02` |

## Reconstruction Viewer

The 39-image reconstruction viewer shows originals, semdisdiffae_p32_v2
reconstructions, RGB error deltas, and latent PCA side by side, with FLUX.2 VAE
included for comparison:
[semdisdiffae_p32_v2 reconstruction viewer](https://huggingface.co/spaces/data-archetype/semdisdiffae_p32_v2-results).

## Encode Throughput

Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging `20`
repeated batched `encode()` calls after `5` warmup batches.

| Resolution | Batch Size | Mean (ms/batch) | ms/image | Images/s | Peak Allocated VRAM |
|---:|---:|---:|---:|---:|---:|
| `256x256` | `128` | `12.38` | `0.097` | `10336.1` | `567.8 MiB` |
| `512x512` | `128` | `53.49` | `0.418` | `2393.0` | `1353.8 MiB` |

## Decode Latency

Measured on the same `NVIDIA GeForce RTX 5090` in `bfloat16`. This is
decode-only latency: images are encoded once, latents are cached, and timing is
sequential batch-1 `decode()` over the cached latent set with the default 1-step
sampler and PDG disabled.

| Resolution | Batch Size | Images | Mean (ms/image) | Images/s | Peak Allocated VRAM |
|---:|---:|---:|---:|---:|---:|
| `512x512` | `1` | `20` | `3.89` | `256.8` | `340.8 MiB` |
| `1024x1024` | `1` | `20` | `9.79` | `102.2` | `409.6 MiB` |
| `2048x2048` | `1` | `20` | `51.90` | `19.3` | `720.9 MiB` |

## Latent Interface

- `encode()` returns whitened latents using the model's saved running statistics.
- `decode()` expects those whitened latents and dewhitens internally.
- `whiten()` and `dewhiten()` expose the transform explicitly.
- `encode_posterior()` returns the raw exported posterior before whitening.

Weights are stored in `float32`. The recommended runtime path is `bfloat16` for
the encoder and decoder, while whitening, dewhitening, posterior moment math,
VP schedule math, and sampler state updates are kept in `float32`.

## Usage

```python
import torch

from fcdm_diffae import FCDMDiffAE, FCDMDiffAEInferenceConfig


device = "cuda"
model = FCDMDiffAE.from_pretrained(
    "data-archetype/semdisdiffae_p32_v2",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 32

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=FCDMDiffAEInferenceConfig(num_steps=1),
    )
```

## Details

- Architecture: patch-32 FCDM DiffAE, `156.6M` parameters, `384` latent channels.
- Encoder / decoder depth: `8` blocks each.
- Training resolution: `384` AR buckets and `384x384` square crops.
- Semantic alignment: DINOv3 ConvNeXt-B/LVD1689M, 50/50 MSE plus negative cosine.
- Posterior: diagonal Gaussian with VP log-SNR parameterization.
- Export variant: EMA weights.
- [Technical report](https://huggingface.co/data-archetype/semdisdiffae_p32_v2/blob/main/technical_report_fcdm_diffae.md)

## Citation

```bibtex
@misc{semdisdiffae_p32_v2,
  title   = {SemDisDiffAE p32 v2: a patch-32 FCDM diffusion autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/semdisdiffae_p32_v2},
}
```