dinac_ae / README.md
data-archetype's picture
Upload DINAC-AE export package
1b703d5
---
license: apache-2.0
tags:
- diffusion
- autoencoder
- image-reconstruction
- latent-space
- dino
- pytorch
---
# data-archetype/dinac_ae
**DINAC-AE** is a **DIN**O-**A**ligned **C**lass-token **A**uto**E**ncoder.
It follows the [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned
representations.
Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a
6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment.
The latent-to-DINO alignment head is extended to predict the DINO class token
as well as patch tokens. `predict_class(latents)` exposes that class-token
feature directly from latents.
## 2k PSNR Benchmark
| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---:|---:|---:|---:|---:|
| dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` |
| FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` |
Evaluated on `2000` validation images.
DINAC-AE targets a compromise between high reconstruction quality, a learnable
latent space with KL-like variance expansion, DINOv3 alignment, and robustness
to local token errors.
[Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae-results)
shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE
reconstructions, RGB differences, and latent PCA.
The released export recheck on that 39-image set gives `35.15 dB` mean PSNR
(`25.73` min, `45.99` max).
[Full technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md)
## Encode Throughput
Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated
batches per resolution.
| Resolution | Batch Size | dinac_ae encode (ms/batch) | FLUX.2 encode (ms/batch) | dinac_ae peak VRAM (MiB) | FLUX.2 peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 |
|---:|---:|---:|---:|---:|---:|---:|---:|
| `256x256` | `128` | `50` | `383` | `1,637` | `12,511` | `7.62x` | `86.9%` |
| `512x512` | `32` | `53` | `354` | `1,639` | `12,511` | `6.72x` | `86.9%` |
The transformer encoder is slightly slower and larger than the full_capacitor
FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE
encoder.
## Latent Interface
- `encode()` returns DINAC-AE's own whitened latent space.
- `decode()` expects that same whitened latent space and dewhitens internally.
- `predict_class()` expects the same whitened latent space, dewhitens
internally, and predicts a DINOv3 ViT-B/16 class-token feature.
- `whiten()` and `dewhiten()` are exposed for explicit control.
- `encode_posterior()` returns the raw exported posterior before whitening.
- `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly:
`num_steps=1` means one NFE.
The export ships weights in `float32`. The recommended and default runtime path
is `bfloat16` AMP for the main encoder, decoder, and class-token path, with
`float32` retained for sensitive operations such as whitening/dewhitening,
normalization math, RoPE frequency construction, and VP diffusion schedule
helpers.
## Usage
```python
import torch
from dinac_ae import DinacAE, DinacAEInferenceConfig
device = "cuda"
model = DinacAE.from_pretrained(
"data-archetype/dinac_ae",
device=device,
dtype=torch.bfloat16,
)
image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16
with torch.inference_mode():
latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
class_token = model.predict_class(latents)
recon = model.decode(
latents,
height=int(image.shape[-2]),
width=int(image.shape[-1]),
inference_config=DinacAEInferenceConfig(num_steps=1),
)
```
## Details
- DINAC-AE uses a `6`-block ViT/DiT-style transformer encoder and an `8`-block
FCDM decoder.
- Patch size is `16`, model width is `896`, and latent width is `128`.
- The DINO alignment head predicts spatial patch tokens and is extended with a
class-token output in DINOv3 ViT-B/16 feature space.
- The class-token output is used to improve semantic organization of the latent
space and to support FD-loss / Representation Frechet Distance objectives
directly in latent space.
- `predict_class(latents)` reaches mean cosine similarity `0.757458` against
the frozen DINOv3 ViT-B/16 teacher class token on the same `2000` images.
- DINO alignment is applied directly to clean latent tokens. Robustness to
local token errors is handled by random-token logSNR offset regularization.
- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results
- Related: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae),
[full_capacitor](https://huggingface.co/data-archetype/full_capacitor),
[capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder)
## Citation
```bibtex
@misc{dinac_ae,
title = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder},
author = {data-archetype},
email = {data-archetype@proton.me},
year = {2026},
month = may,
url = {https://huggingface.co/data-archetype/dinac_ae},
}
```