File size: 5,187 Bytes
1b703d5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
license: apache-2.0
tags:
- diffusion
- autoencoder
- image-reconstruction
- latent-space
- dino
- pytorch
---
# data-archetype/dinac_ae
**DINAC-AE** is a **DIN**O-**A**ligned **C**lass-token **A**uto**E**ncoder.
It follows the [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned
representations.
Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a
6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment.
The latent-to-DINO alignment head is extended to predict the DINO class token
as well as patch tokens. `predict_class(latents)` exposes that class-token
feature directly from latents.
## 2k PSNR Benchmark
| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---:|---:|---:|---:|---:|
| dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` |
| FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` |
Evaluated on `2000` validation images.
DINAC-AE targets a compromise between high reconstruction quality, a learnable
latent space with KL-like variance expansion, DINOv3 alignment, and robustness
to local token errors.
[Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae-results)
shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE
reconstructions, RGB differences, and latent PCA.
The released export recheck on that 39-image set gives `35.15 dB` mean PSNR
(`25.73` min, `45.99` max).
[Full technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md)
## Encode Throughput
Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated
batches per resolution.
| Resolution | Batch Size | dinac_ae encode (ms/batch) | FLUX.2 encode (ms/batch) | dinac_ae peak VRAM (MiB) | FLUX.2 peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 |
|---:|---:|---:|---:|---:|---:|---:|---:|
| `256x256` | `128` | `50` | `383` | `1,637` | `12,511` | `7.62x` | `86.9%` |
| `512x512` | `32` | `53` | `354` | `1,639` | `12,511` | `6.72x` | `86.9%` |
The transformer encoder is slightly slower and larger than the full_capacitor
FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE
encoder.
## Latent Interface
- `encode()` returns DINAC-AE's own whitened latent space.
- `decode()` expects that same whitened latent space and dewhitens internally.
- `predict_class()` expects the same whitened latent space, dewhitens
internally, and predicts a DINOv3 ViT-B/16 class-token feature.
- `whiten()` and `dewhiten()` are exposed for explicit control.
- `encode_posterior()` returns the raw exported posterior before whitening.
- `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly:
`num_steps=1` means one NFE.
The export ships weights in `float32`. The recommended and default runtime path
is `bfloat16` AMP for the main encoder, decoder, and class-token path, with
`float32` retained for sensitive operations such as whitening/dewhitening,
normalization math, RoPE frequency construction, and VP diffusion schedule
helpers.
## Usage
```python
import torch
from dinac_ae import DinacAE, DinacAEInferenceConfig
device = "cuda"
model = DinacAE.from_pretrained(
"data-archetype/dinac_ae",
device=device,
dtype=torch.bfloat16,
)
image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16
with torch.inference_mode():
latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
class_token = model.predict_class(latents)
recon = model.decode(
latents,
height=int(image.shape[-2]),
width=int(image.shape[-1]),
inference_config=DinacAEInferenceConfig(num_steps=1),
)
```
## Details
- DINAC-AE uses a `6`-block ViT/DiT-style transformer encoder and an `8`-block
FCDM decoder.
- Patch size is `16`, model width is `896`, and latent width is `128`.
- The DINO alignment head predicts spatial patch tokens and is extended with a
class-token output in DINOv3 ViT-B/16 feature space.
- The class-token output is used to improve semantic organization of the latent
space and to support FD-loss / Representation Frechet Distance objectives
directly in latent space.
- `predict_class(latents)` reaches mean cosine similarity `0.757458` against
the frozen DINOv3 ViT-B/16 teacher class token on the same `2000` images.
- DINO alignment is applied directly to clean latent tokens. Robustness to
local token errors is handled by random-token logSNR offset regularization.
- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results
- Related: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae),
[full_capacitor](https://huggingface.co/data-archetype/full_capacitor),
[capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder)
## Citation
```bibtex
@misc{dinac_ae,
title = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder},
author = {data-archetype},
email = {data-archetype@proton.me},
year = {2026},
month = may,
url = {https://huggingface.co/data-archetype/dinac_ae},
}
```
|