--- license: apache-2.0 tags: - diffusion - autoencoder - image-reconstruction - latent-space - pytorch - fcdm library_name: fcdm_diffae --- # data-archetype/semdisdiffae_p32_v2 **semdisdiffae_p32_v2** is a native patch-32 SemDisDiffAE diffusion autoencoder. It keeps the same FCDM decoder family as [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae), with an 8-block encoder, an 8-block decoder, and a 384-channel spatial latent at `H/32 x W/32`. Relative to the original SemDisDiffAE, this model is optimized for a lower-resolution latent grid and downstream latent diffusion: patch size `32` instead of `16`, `384` latent channels instead of `128`, an 8-block encoder instead of a 4-block encoder, and DINOv3 ConvNeXt-B semantic alignment instead of the original DINO semantic alignment setup. For details, see the [semdisdiffae_p32_v2 technical report](https://huggingface.co/data-archetype/semdisdiffae_p32_v2/blob/main/technical_report_fcdm_diffae.md). For additional shared FCDM / VP decoder background, see the original [SemDisDiffAE technical report](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md). The p32 checkpoint was trained at `384` resolution rather than the original `256`-scale recipe. With patch size `32`, this gives a `12x12` latent grid instead of `8x8`, reducing the impact of 7x7-convolution border effects during training. ## 2k PSNR Benchmark Evaluated on `2000` images, split as `1333` Pexels images and `667` Amazon book covers. Reconstruction uses the default 1-step VP/DDIM path in `bfloat16`. | Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) | |---|---:|---:|---:|---:|---:| | semdisdiffae_p32_v2 | `36.06` | `5.47` | `35.80` | `27.63` | `45.02` | ## Reconstruction Viewer The 39-image reconstruction viewer shows originals, semdisdiffae_p32_v2 reconstructions, RGB error deltas, and latent PCA side by side, with FLUX.2 VAE included for comparison: [semdisdiffae_p32_v2 reconstruction viewer](https://huggingface.co/spaces/data-archetype/semdisdiffae_p32_v2-results). ## Encode Throughput Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging `20` repeated batched `encode()` calls after `5` warmup batches. | Resolution | Batch Size | Mean (ms/batch) | ms/image | Images/s | Peak Allocated VRAM | |---:|---:|---:|---:|---:|---:| | `256x256` | `128` | `12.38` | `0.097` | `10336.1` | `567.8 MiB` | | `512x512` | `128` | `53.49` | `0.418` | `2393.0` | `1353.8 MiB` | ## Decode Latency Measured on the same `NVIDIA GeForce RTX 5090` in `bfloat16`. This is decode-only latency: images are encoded once, latents are cached, and timing is sequential batch-1 `decode()` over the cached latent set with the default 1-step sampler and PDG disabled. | Resolution | Batch Size | Images | Mean (ms/image) | Images/s | Peak Allocated VRAM | |---:|---:|---:|---:|---:|---:| | `512x512` | `1` | `20` | `3.89` | `256.8` | `340.8 MiB` | | `1024x1024` | `1` | `20` | `9.79` | `102.2` | `409.6 MiB` | | `2048x2048` | `1` | `20` | `51.90` | `19.3` | `720.9 MiB` | ## Latent Interface - `encode()` returns whitened latents using the model's saved running statistics. - `decode()` expects those whitened latents and dewhitens internally. - `whiten()` and `dewhiten()` expose the transform explicitly. - `encode_posterior()` returns the raw exported posterior before whitening. Weights are stored in `float32`. The recommended runtime path is `bfloat16` for the encoder and decoder, while whitening, dewhitening, posterior moment math, VP schedule math, and sampler state updates are kept in `float32`. ## Usage ```python import torch from fcdm_diffae import FCDMDiffAE, FCDMDiffAEInferenceConfig device = "cuda" model = FCDMDiffAE.from_pretrained( "data-archetype/semdisdiffae_p32_v2", device=device, dtype=torch.bfloat16, ) image = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 32 with torch.inference_mode(): latents = model.encode(image.to(device=device, dtype=torch.bfloat16)) recon = model.decode( latents, height=int(image.shape[-2]), width=int(image.shape[-1]), inference_config=FCDMDiffAEInferenceConfig(num_steps=1), ) ``` ## Details - Architecture: patch-32 FCDM DiffAE, `156.6M` parameters, `384` latent channels. - Encoder / decoder depth: `8` blocks each. - Training resolution: `384` AR buckets and `384x384` square crops. - Semantic alignment: DINOv3 ConvNeXt-B/LVD1689M, 50/50 MSE plus negative cosine. - Posterior: diagonal Gaussian with VP log-SNR parameterization. - Export variant: EMA weights. - [Technical report](https://huggingface.co/data-archetype/semdisdiffae_p32_v2/blob/main/technical_report_fcdm_diffae.md) ## Citation ```bibtex @misc{semdisdiffae_p32_v2, title = {SemDisDiffAE p32 v2: a patch-32 FCDM diffusion autoencoder}, author = {data-archetype}, email = {data-archetype@proton.me}, year = {2026}, month = apr, url = {https://huggingface.co/data-archetype/semdisdiffae_p32_v2}, } ```