File size: 3,303 Bytes
1ed770c
 
 
 
 
 
 
 
 
 
8d69668
1ed770c
ae5bd71
1ed770c
 
 
 
 
 
 
 
8d69668
1ed770c
8d69668
1ed770c
 
 
 
 
 
 
 
 
 
 
 
 
 
8d69668
1ed770c
 
 
 
 
8d69668
1ed770c
 
8d69668
1ed770c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d69668
1ed770c
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - pytorch
library_name: irdiffae
---

# data-archetype/irdiffae-v1

**iRDiffAE** — **iR**epa **Diff**usion **A**uto**E**ncoder.
A fast, single-GPU-trainable diffusion autoencoder with spatially structured
latents for rapid downstream model convergence. Encoding runs ~5× faster than
Flux VAE; single-step decoding runs ~3× faster.

## Model Variants

| Variant | Patch | Channels | Compression | |
|---------|-------|----------|-------------|---|
| [irdiffae_v1](https://huggingface.co/data-archetype/irdiffae_v1) | 16x16 | 128 | 6x | recommended |

This variant (data-archetype/irdiffae-v1): 121.0M parameters, 461.4 MB.

## Documentation

- [Technical Report](technical_report.md) — diffusion math, architecture, training, and results
- [Results — interactive viewer](https://huggingface.co/spaces/data-archetype/irdiffae-results) — full-resolution side-by-side comparison
- [Results — summary stats](technical_report.md#7-results) — metrics and per-image PSNR

## Quick Start

```python
import torch
from ir_diffae import IRDiffAE

# Load from HuggingFace Hub (or a local path)
model = IRDiffAE.from_pretrained("data-archetype/irdiffae-v1", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (1 step by default — PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)
```

> **Note:** Requires `pip install huggingface_hub safetensors` for Hub downloads.
> You can also pass a local directory path to `from_pretrained()`.

## Architecture

| Property | Value |
|---|---|
| Parameters | 120,957,440 |
| File size | 461.4 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 8 |
| Bottleneck dim | 128 |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |

**Encoder**: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by
DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with
learned residual gates.

**Decoder**: VP diffusion conditioned on encoder latents and timestep via
shared-base + per-layer low-rank AdaLN-Zero. Start blocks (2) -> middle
blocks (4) -> skip fusion -> end blocks (2). Supports
Path-Drop Guidance (PDG) at inference for quality/speed tradeoff.

## Recommended Settings

Best quality is achieved with just **1 DDIM step** and PDG disabled,
making inference extremely fast. PDG (strength 2-4) can optionally
increase perceptual sharpness but is easy to overdo.

| Setting | Default |
|---|---|
| Sampler | DDIM |
| Steps | 1 |
| PDG | Disabled |

```python
from ir_diffae import IRDiffAEInferenceConfig

# PSNR-optimal (fast, 1 step)
cfg = IRDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
```

## Citation

```bibtex
@misc{irdiffae_v1,
  title   = {iRDiffAE: A Fast, Representation Aligned Diffusion Autoencoder with DiCo Blocks},
  author  = {data-archetype},
  year    = {2026},
  month   = feb,
  url     = {https://huggingface.co/data-archetype/irdiffae-v1},
}
```

## Dependencies

- PyTorch >= 2.0
- safetensors (for loading weights)

## License

Apache 2.0