---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - latent-space
  - dino
  - pytorch
---

# data-archetype/dinac_ae_d2

**DINAC-AE-D2** is a close variant of
[DINAC-AE](https://huggingface.co/data-archetype/dinac_ae). It keeps the same
patch-16 spatial latent interface, VP diffusion decoder, class-token prediction
API, and one-step default reconstruction path, but changes the teacher alignment
and encoder capacity:

- DINO alignment target: **DINOv2 ViT-B/14** feature space.
- Encoder: **8** ViT/DiT-style transformer blocks instead of DINAC-AE's 6.
- Decoder: unchanged 8-block FCDM decoder.

DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more
high-frequency information. In downstream diffusion experiments, this variant
has shown faster early convergence than the original DINAC-AE latent space.

## 2k PSNR Benchmark

| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---:|---:|---:|---:|---:|
| dinac_ae_d2 | `35.59` | `4.87` | `35.40` | `27.89` | `43.51` |
| dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` |
| FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` |


[Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae_d2-results)
shows the 39-image reconstruction set with DINAC-AE-D2 and FLUX.2 VAE
reconstructions, RGB differences, and latent PCA.
The 39-image set gives `35.46 dB` mean PSNR (`25.61` min, `46.69` max).

[DINAC-AE technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md)
describes the training recipe used for this model. DINAC-AE-D2 follows the same
autoencoder training setup, with the teacher alignment changed to DINOv2 ViT-B/14
and the encoder depth increased from 6 to 8 blocks.

## Encode Throughput

Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated
batches per resolution.

| Resolution | Batch Size | Model | Encode (ms/batch) | ms/image | Images/s | Peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 |
|---:|---:|---|---:|---:|---:|---:|---:|---:|
| `256x256` | `128` | dinac_ae_d2 | `69.56` | `0.543` | `1840.0` | `1606.5` | `4.92x` | `87.2%` |
| `256x256` | `128` | dinac_ae | `50.25` | `0.393` | `2547.4` | `1569.7` | `6.80x` | `87.5%` |
| `256x256` | `128` | FLUX.2 VAE | `341.94` | `2.671` | `374.3` | `12533.8` | `1.00x` | `0.0%` |
| `512x512` | `32` | dinac_ae_d2 | `75.09` | `2.347` | `426.2` | `1606.7` | `4.74x` | `87.2%` |
| `512x512` | `32` | dinac_ae | `53.09` | `1.659` | `602.7` | `1570.0` | `6.70x` | `87.5%` |
| `512x512` | `32` | FLUX.2 VAE | `355.64` | `11.114` | `90.0` | `12533.8` | `1.00x` | `0.0%` |

The encoder is slower than DINAC-AE's encoder
because it uses 8 transformer blocks instead of 6, but remains much faster and lighter than the FLUX.2 VAE encoder.

## Latent Interface

- `encode()` returns DINAC-AE-D2's own whitened latent space.
- `decode()` expects that same whitened latent space and dewhitens internally.
- `predict_class()` expects the same whitened latent space, dewhitens
  internally, and predicts a DINOv2-B class-token feature.
- `whiten()` and `dewhiten()` are exposed for explicit control.
- `encode_posterior()` returns the raw exported posterior before whitening.
- `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly:
  `num_steps=1` means one NFE.

The export ships weights in `float32`. The recommended and default runtime path is
`bfloat16` for the main encoder, decoder, and class-token path, with `float32`
retained for whitening/dewhitening, normalization math, RoPE frequency
construction, and VP diffusion schedule helpers.

## Usage

```python
import torch

from dinac_ae import DinacAE, DinacAEInferenceConfig


device = "cuda"
model = DinacAE.from_pretrained(
    "data-archetype/dinac_ae_d2",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], H and W divisible by 16

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    class_token = model.predict_class(latents)
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=DinacAEInferenceConfig(num_steps=1),
    )
```

## Details

- DINAC-AE-D2 uses an `8`-block ViT/DiT-style transformer encoder and an
  `8`-block FCDM decoder.
- Patch size is `16`, model width is `896`, and latent width is `128`.
- Total parameter count is `154.22M`: `78.02M` encoder, `61.93M` decoder, and
  `14.26M` DINO token/class alignment head.
- The DINO alignment head predicts spatial patch tokens and a class-token output
  in DINOv2 ViT-B/14 feature space.
- `predict_class(latents)` exposes the DINOv2 ViT-B/14 class-token feature
  directly from latents.
- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae_d2-results
- Related: [DINAC-AE](https://huggingface.co/data-archetype/dinac_ae),
  [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae),
  [full_capacitor](https://huggingface.co/data-archetype/full_capacitor),
  [capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder)

## Citation

```bibtex
@misc{dinac_ae_d2,
  title   = {DINAC-AE-D2: a DINOv2-aligned class-token diffusion autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = jun,
  url     = {https://huggingface.co/data-archetype/dinac_ae_d2},
}
```