| --- |
| license: apache-2.0 |
| tags: |
| - diffusion |
| - autoencoder |
| - image-reconstruction |
| - latent-space |
| - dino |
| - pytorch |
| --- |
| |
| # data-archetype/dinac_ae |
| |
| **DINAC-AE** is a **DIN**O-**A**ligned **C**lass-token **A**uto**E**ncoder. |
| It follows the [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae) |
| family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned |
| representations. |
| |
| Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a |
| 6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment. |
| The latent-to-DINO alignment head is extended to predict the DINO class token |
| as well as patch tokens. `predict_class(latents)` exposes that class-token |
| feature directly from latents. |
|
|
| ## 2k PSNR Benchmark |
|
|
| | Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) | |
| |---|---:|---:|---:|---:|---:| |
| | dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` | |
| | FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` | |
| |
| Evaluated on `2000` validation images. |
| |
| DINAC-AE targets a compromise between high reconstruction quality, a learnable |
| latent space with KL-like variance expansion, DINOv3 alignment, and robustness |
| to local token errors. |
| |
| [Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae-results) |
| shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE |
| reconstructions, RGB differences, and latent PCA. |
| The released export recheck on that 39-image set gives `35.15 dB` mean PSNR |
| (`25.73` min, `45.99` max). |
| |
| [Full technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md) |
| |
| ## Encode Throughput |
| |
| Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated |
| batches per resolution. |
| |
| | Resolution | Batch Size | dinac_ae encode (ms/batch) | FLUX.2 encode (ms/batch) | dinac_ae peak VRAM (MiB) | FLUX.2 peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 | |
| |---:|---:|---:|---:|---:|---:|---:|---:| |
| | `256x256` | `128` | `50` | `383` | `1,637` | `12,511` | `7.62x` | `86.9%` | |
| | `512x512` | `32` | `53` | `354` | `1,639` | `12,511` | `6.72x` | `86.9%` | |
| |
| The transformer encoder is slightly slower and larger than the full_capacitor |
| FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE |
| encoder. |
|
|
| ## Latent Interface |
|
|
| - `encode()` returns DINAC-AE's own whitened latent space. |
| - `decode()` expects that same whitened latent space and dewhitens internally. |
| - `predict_class()` expects the same whitened latent space, dewhitens |
| internally, and predicts a DINOv3 ViT-B/16 class-token feature. |
| - `whiten()` and `dewhiten()` are exposed for explicit control. |
| - `encode_posterior()` returns the raw exported posterior before whitening. |
| - `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly: |
| `num_steps=1` means one NFE. |
|
|
| The export ships weights in `float32`. The recommended and default runtime path |
| is `bfloat16` AMP for the main encoder, decoder, and class-token path, with |
| `float32` retained for sensitive operations such as whitening/dewhitening, |
| normalization math, RoPE frequency construction, and VP diffusion schedule |
| helpers. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| |
| from dinac_ae import DinacAE, DinacAEInferenceConfig |
| |
| |
| device = "cuda" |
| model = DinacAE.from_pretrained( |
| "data-archetype/dinac_ae", |
| device=device, |
| dtype=torch.bfloat16, |
| ) |
| |
| image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16 |
| |
| with torch.inference_mode(): |
| latents = model.encode(image.to(device=device, dtype=torch.bfloat16)) |
| class_token = model.predict_class(latents) |
| recon = model.decode( |
| latents, |
| height=int(image.shape[-2]), |
| width=int(image.shape[-1]), |
| inference_config=DinacAEInferenceConfig(num_steps=1), |
| ) |
| ``` |
|
|
| ## Details |
|
|
| - DINAC-AE uses a `6`-block ViT/DiT-style transformer encoder and an `8`-block |
| FCDM decoder. |
| - Patch size is `16`, model width is `896`, and latent width is `128`. |
| - The DINO alignment head predicts spatial patch tokens and is extended with a |
| class-token output in DINOv3 ViT-B/16 feature space. |
| - The class-token output is used to improve semantic organization of the latent |
| space and to support FD-loss / Representation Frechet Distance objectives |
| directly in latent space. |
| - `predict_class(latents)` reaches mean cosine similarity `0.757458` against |
| the frozen DINOv3 ViT-B/16 teacher class token on the same `2000` images. |
| - DINO alignment is applied directly to clean latent tokens. Robustness to |
| local token errors is handled by random-token logSNR offset regularization. |
| - Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results |
| - Related: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae), |
| [full_capacitor](https://huggingface.co/data-archetype/full_capacitor), |
| [capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder) |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{dinac_ae, |
| title = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder}, |
| author = {data-archetype}, |
| email = {data-archetype@proton.me}, |
| year = {2026}, |
| month = may, |
| url = {https://huggingface.co/data-archetype/dinac_ae}, |
| } |
| ``` |
| |