File size: 5,187 Bytes

1b703d5

---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - latent-space
  - dino
  - pytorch
---

# data-archetype/dinac_ae

**DINAC-AE** is a **DIN**O-**A**ligned **C**lass-token **A**uto**E**ncoder.
It follows the [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned
representations.

Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a
6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment.
The latent-to-DINO alignment head is extended to predict the DINO class token
as well as patch tokens. `predict_class(latents)` exposes that class-token
feature directly from latents.

## 2k PSNR Benchmark

| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---:|---:|---:|---:|---:|
| dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` |
| FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` |

Evaluated on `2000` validation images.

DINAC-AE targets a compromise between high reconstruction quality, a learnable
latent space with KL-like variance expansion, DINOv3 alignment, and robustness
to local token errors.

[Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae-results)
shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE
reconstructions, RGB differences, and latent PCA.
The released export recheck on that 39-image set gives `35.15 dB` mean PSNR
(`25.73` min, `45.99` max).

[Full technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md)

## Encode Throughput

Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated
batches per resolution.

| Resolution | Batch Size | dinac_ae encode (ms/batch) | FLUX.2 encode (ms/batch) | dinac_ae peak VRAM (MiB) | FLUX.2 peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 |
|---:|---:|---:|---:|---:|---:|---:|---:|
| `256x256` | `128` | `50` | `383` | `1,637` | `12,511` | `7.62x` | `86.9%` |
| `512x512` | `32` | `53` | `354` | `1,639` | `12,511` | `6.72x` | `86.9%` |

The transformer encoder is slightly slower and larger than the full_capacitor
FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE
encoder.

## Latent Interface

- `encode()` returns DINAC-AE's own whitened latent space.
- `decode()` expects that same whitened latent space and dewhitens internally.
- `predict_class()` expects the same whitened latent space, dewhitens
  internally, and predicts a DINOv3 ViT-B/16 class-token feature.
- `whiten()` and `dewhiten()` are exposed for explicit control.
- `encode_posterior()` returns the raw exported posterior before whitening.
- `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly:
  `num_steps=1` means one NFE.

The export ships weights in `float32`. The recommended and default runtime path
is `bfloat16` AMP for the main encoder, decoder, and class-token path, with
`float32` retained for sensitive operations such as whitening/dewhitening,
normalization math, RoPE frequency construction, and VP diffusion schedule
helpers.

## Usage

```python
import torch

from dinac_ae import DinacAE, DinacAEInferenceConfig


device = "cuda"
model = DinacAE.from_pretrained(
    "data-archetype/dinac_ae",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], H and W divisible by 16

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    class_token = model.predict_class(latents)
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=DinacAEInferenceConfig(num_steps=1),
    )
```

## Details

- DINAC-AE uses a `6`-block ViT/DiT-style transformer encoder and an `8`-block
  FCDM decoder.
- Patch size is `16`, model width is `896`, and latent width is `128`.
- The DINO alignment head predicts spatial patch tokens and is extended with a
  class-token output in DINOv3 ViT-B/16 feature space.
- The class-token output is used to improve semantic organization of the latent
  space and to support FD-loss / Representation Frechet Distance objectives
  directly in latent space.
- `predict_class(latents)` reaches mean cosine similarity `0.757458` against
  the frozen DINOv3 ViT-B/16 teacher class token on the same `2000` images.
- DINO alignment is applied directly to clean latent tokens. Robustness to
  local token errors is handled by random-token logSNR offset regularization.
- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results
- Related: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae),
  [full_capacitor](https://huggingface.co/data-archetype/full_capacitor),
  [capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder)

## Citation

```bibtex
@misc{dinac_ae,
  title   = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = may,
  url     = {https://huggingface.co/data-archetype/dinac_ae},
}
```