Image-to-Image
Diffusers
File size: 14,516 Bytes
f4365ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
license: mit
pipeline_tag: image-to-image
library_name: diffusers
---

<h1 align="center"> REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers </h1>

<p align="center">
  <a href="https://scholar.google.com.au/citations?user=GQzvqS4AAAAJ" target="_blank">Xingjian&nbsp;Leng</a><sup>1*</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://1jsingh.github.io/" target="_blank">Jaskirat&nbsp;Singh</a><sup>1*</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://hou-yz.github.io/" target="_blank">Yunzhong&nbsp;Hou</a><sup>1</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://people.csiro.au/X/Z/Zhenchang-Xing/" target="_blank">Zhenchang&nbsp;Xing</a><sup>2</sup>&ensp; <b>&middot;</b> &ensp;
  <a href="https://www.sainingxie.com/" target="_blank">Saining&nbsp;Xie</a><sup>3</sup>&ensp; <b>&middot;</b> &ensp;
  <a href="https://zheng-lab-anu.github.io/" target="_blank">Liang&nbsp;Zheng</a><sup>1</sup>&ensp;
</p>

<p align="center">
  <sup>1</sup> Australian National University &emsp; <sup>2</sup>Data61-CSIRO &emsp; <sup>3</sup>New York University &emsp; <br>
  <sub><sup>*</sup>Project Leads &emsp;</sub>
</p>

<p align="center">
  <a href="https://End2End-Diffusion.github.io">🌐 Project Page</a> &ensp;
  <a href="https://huggingface.co/REPA-E">πŸ€— Models</a> &ensp;
  <a href="https://arxiv.org/abs/2504.10483">πŸ“ƒ Paper</a> &ensp;
  <a href="https://github.com/REPA-E/REPA-E">πŸ’» Code</a> &ensp;
  <br><br>
  <a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a>
</p>

![](assets/vis-examples.jpg)

## Overview
We address a fundamental question: ***Can latent diffusion models and their VAE tokenizer be trained end-to-end?*** While training both components jointly with standard diffusion loss is observed to be ineffective β€” often degrading final performance β€” we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, **REPA-E**, enables stable and effective joint training of both the VAE and the diffusion model.

![](assets/overview.jpg)

**REPA-E** significantly accelerates training β€” achieving over **17Γ—** speedup compared to REPA and **45Γ—** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256Γ—256: **1.26** with CFG and **1.83** without CFG.

## News and Updates
**[2025-04-15]** Initial Release with pre-trained models and codebase.

## Getting Started
### 1. Environment Setup
To set up our environment, please run:

```bash
git clone https://github.com/REPA-E/REPA-E.git
cd REPA-E
conda env create -f environment.yml -y
conda activate repa-e
```

### 2. Prepare the training data
Download and extract the training split of the [ImageNet-1K](https://www.image-net.org/challenges/LSVRC/2012/index) dataset. Once it's ready, run the following command to preprocess the dataset:

```bash
python preprocessing.py --imagenet-path /PATH/TO/IMAGENET_TRAIN
```

Replace `/PATH/TO/IMAGENET_TRAIN` with the actual path to the extracted training images.

### 3. Train the REPA-E model

To train the REPA-E model, you first need to download the following pre-trained VAE checkpoints:
- [πŸ€— **SD-VAE (f8d4)**](https://huggingface.co/REPA-E/sdvae): Derived from the [Stability AI SD-VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse), originally trained on [Open Images](https://storage.googleapis.com/openimages/web/index.html) and fine-tuned on a subset of [LAION-2B](https://laion.ai/blog/laion-5b/).
- [πŸ€— **IN-VAE (f16d32)**](https://huggingface.co/REPA-E/invae): Trained from scratch on [ImageNet-1K](https://www.image-net.org/) using the [latent-diffusion](https://github.com/CompVis/latent-diffusion) codebase with our custom architecture.
- [πŸ€— **VA-VAE (f16d32)**](https://huggingface.co/REPA-E/vavae): Taken from [LightningDiT](https://github.com/hustvl/LightningDiT), this VAE is a visual tokenizer aligned with vision foundation models during reconstruction training. It is also trained on [ImageNet-1K](https://www.image-net.org/) for high-quality tokenization in high-dimensional latent spaces.

Recommended directory structure:
```
pretrained/
β”œβ”€β”€ invae/
β”œβ”€β”€ sdvae/
└── vavae/
```

Once you've downloaded the VAE checkpoint, you can launch REPA-E training with:
```bash
accelerate launch train_repae.py \
    --max-train-steps=400000 \
    --report-to="wandb" \
    --allow-tf32 \
    --mixed-precision="fp16" \
    --seed=0 \
    --data-dir="data" \
    --output-dir="exps" \
    --batch-size=256 \
    --path-type="linear" \
    --prediction="v" \
    --weighting="uniform" \
    --model="SiT-XL/2" \
    --checkpointing-steps=50000 \
    --loss-cfg-path="configs/l1_lpips_kl_gan.yaml" \
    --vae="f8d4" \
    --vae-ckpt="pretrained/sdvae/sdvae-f8d4.pt" \
    --disc-pretrained-ckpt="pretrained/sdvae/sdvae-f8d4-discriminator-ckpt.pt" \
    --enc-type="dinov2-vit-b" \
    --proj-coeff=0.5 \
    --encoder-depth=8 \
    --vae-align-proj-coeff=1.5 \
    --bn-momentum=0.1 \
    --exp-name="sit-xl-dinov2-b-enc8-repae-sdvae-0.5-1.5-400k"
```
<details>
  <summary>Click to expand for configuration options</summary>

Then this script will automatically create the folder in `exps` to save logs and checkpoints. You can adjust the following options:

- `--output-dir`: Directory to save checkpoints and logs
- `--exp-name`: Experiment name (a subfolder will be created under `output-dir`)
- `--vae`: Choose between `[f8d4, f16d32]`
- `--vae-ckpt`: Path to a provided or custom VAE checkpoint
- `--disc-pretrained-ckpt`: Path to a provided or custom VAE discriminator checkpoint
- `--models`: Choose from `[SiT-B/2, SiT-L/2, SiT-XL/2, SiT-B/1, SiT-L/1, SiT-XL/1]`. The number indicates the patch size. Select a model compatible with your VAE architecture.
- `--enc-type`: `[dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, dinov1-vit-b, mocov3-vit-b, mocov3-vit-l, clip-vit-L, jepa-vit-h, mae-vit-l]`
- `--encoder-depth`: Any integer from 1 up to the full depth of the selected encoder
- `--proj-coeff`: REPA-E projection coefficient for SiT alignment (float > 0)
- `--vae-align-proj-coeff`: REPA-E projection coefficient for VAE alignment (float > 0)
- `--bn-momentum`: Batchnorm layer momentum (float)

</details>

### 4. Use REPA-E Tuned VAE (E2E-VAE) for Accelerated Training and Better Generation
This section shows how to use the REPA-E fine-tuned VAE (E2E-VAE) in latent diffusion training. E2E-VAE acts as a drop-in replacement for the original VAE, enabling significantly accelerated generation performance. You can either download a pre-trained VAE or extract it from a REPA-E checkpoint.

**Step 1**: Obtain the fine-tuned VAE from REPA-E checkpoints:
- **Option 1**: Download pre-trained REPA-E VAEs directly from Hugging Face:
    - [πŸ€— **E2E-SDVAE**](https://huggingface.co/REPA-E/e2e-sdvae)
    - [πŸ€— **E2E-INVAE**](https://huggingface.co/REPA-E/e2e-invae)
    - [πŸ€— **E2E-VAVAE**](https://huggingface.co/REPA-E/e2e-vavae)
  
    Recommended directory structure:
    ```
    pretrained/
    β”œβ”€β”€ e2e-sdvae/
    β”œβ”€β”€ e2e-invae/
    └── e2e-vavae/
    ```
- **Option 2**: Extract the VAE from a full REPA-E checkpoint manually:
    ```bash
    python save_vae_weights.py \
        --repae-ckpt pretrained/sit-repae-vavae/checkpoints/0400000.pt \
        --vae-name e2e-vavae \
        --save-dir exps
    ```

**Step 2**: Cache latents to enable fast training:
```bash
accelerate launch --num_machines=1 --num_processes=8 cache_latents.py \
    --vae-arch="f16d32" \
    --vae-ckpt-path="pretrained/e2e-vavae/e2e-vavae-400k.pt" \
    --vae-latents-name="e2e-vavae" \
    --pproc-batch-size=128
```

**Step 3**: Train the SiT generation model using the cached latents:

```bash
accelerate launch train_ldm_only.py \
    --max-train-steps=4000000 \
    --report-to="wandb" \
    --allow-tf32 \
    --mixed-precision="fp16" \
    --seed=0 \
    --data-dir="data" \
    --batch-size=256 \
    --path-type="linear" \
    --prediction="v" \
    --weighting="uniform" \
    --model="SiT-XL/1" \
    --checkpointing-steps=50000 \
    --vae="f16d32" \
    --vae-ckpt="pretrained/e2e-vavae/e2e-vavae-400k.pt" \
    --vae-latents-name="e2e-vavae" \
    --learning-rate=1e-4 \
    --enc-type="dinov2-vit-b" \
    --proj-coeff=0.5 \
    --encoder-depth=8 \
    --output-dir="exps" \
    --exp-name="sit-xl-1-dinov2-b-enc8-ldm-only-e2e-vavae-0.5-4m"
```

For details on the available training options and argument descriptions, refer to [Section 3](#3-train-the-repa-e-model).

### 5. Generate samples and run evaluation
You can generate samples and save them as `.npz` files using the following script. Simply set the `--exp-path` and `--train-steps` corresponding to your trained model (REPA-E or Traditional LDM Training).

```bash
torchrun --nnodes=1 --nproc_per_node=8 generate.py \
    --num-fid-samples 50000 \
    --path-type linear \
    --mode sde \
    --num-steps 250 \
    --cfg-scale 1.0 \
    --guidance-high 1.0 \
    --guidance-low 0.0 \
    --exp-path pretrained/sit-repae-sdvae \
    --train-steps 400000
```

```bash
torchrun --nnodes=1 --nproc_per_node=8 generate.py \
    --num-fid-samples 50000 \
    --path-type linear \
    --mode sde \
    --num-steps 250 \
    --cfg-scale 1.0 \
    --guidance-high 1.0 \
    --guidance-low 0.0 \
    --exp-path pretrained/sit-ldm-e2e-vavae \
    --train-steps 4000000
```

<details>
  <summary>Click to expand for sampling options</summary>

You can adjust the following options for sampling:
- `--path-type linear`: Noise schedule type, choose from `[linear, cosine]`
- `--mode`: Sampling mode, `[ode, sde]`
- `--num-steps`: Number of denoising steps
- `--cfg-scale`: Guidance scale (float β‰₯ 1), setting it to 1 disables classifier-free guidance (CFG)
- `--guidance-high`: Upper guidance interval (float in [0, 1])
- `--guidance-low`: Lower guidance interval (float in [0, 1], must be < `--guidance-high`)\
- `--exp-path`: Path to the experiment directory
- `--train-steps`: Training step of the checkpoint to evaluate

</details>

You can then use the [ADM evaluation suite](https://github.com/openai/guided-diffusion/tree/main/evaluations) to compute image generation quality metrics, including gFID, sFID, Inception Score (IS), Precision, and Recall.

### Quantitative Results
Tables below report generation performance using gFID on 50k samples, with and without classifier-free guidance (CFG). We compare models trained end-to-end with **REPA-E** and models using a frozen REPA-E fine-tuned VAE (**E2E-VAE**). Lower is better. All linked checkpoints below are hosted on our [πŸ€— Hugging Face Hub](https://huggingface.co/REPA-E). To reproduce these results, download the respective checkpoints to the `pretrained` folder and run the evaluation script as detailed in [Section 5](#5-generate-samples-and-run-evaluation).

#### A. End-to-End Training (REPA-E)
| Tokenizer | Generation Model | Epochs | gFID-50k ↓ | gFID-50k (CFG) ↓ |
|:---------|:----------------|:-----:|:----:|:---:|
| [**SD-VAE<sup>*</sup>**](https://huggingface.co/REPA-E/sdvae) | [**SiT-XL/2**](https://huggingface.co/REPA-E/sit-repae-sdvae) | 80 | 4.07 | 1.67<sup>a</sup> |
| [**IN-VAE<sup>*</sup>**](https://huggingface.co/REPA-E/invae) | [**SiT-XL/1**](https://huggingface.co/REPA-E/sit-repae-invae) | 80 | 4.09 | 1.61<sup>b</sup> |
| [**VA-VAE<sup>*</sup>**](https://huggingface.co/REPA-E/vavae) | [**SiT-XL/1**](https://huggingface.co/REPA-E/sit-repae-vavae) | 80 | 4.05 | 1.73<sup>c</sup> |

\* The "Tokenizer" column refers to the initial VAE used for joint REPA-E training. The final (jointly optimized) VAE is bundled within the generation model checkpoint. 

<details>
  <summary>Click to expand for CFG parameters</summary>
  <ul>
    <li><strong>a</strong>: <code>--cfg-scale=2.2</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.65</code></li>
    <li><strong>b</strong>: <code>--cfg-scale=1.8</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.825</code></li>
    <li><strong>c</strong>: <code>--cfg-scale=1.9</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.825</code></li>
  </ul>
</details>

---

#### B. Traditional Latent Diffusion Model Training (Frozen VAE)
| Tokenizer | Generation Model | Method | Epochs | gFID-50k ↓ | gFID-50k (CFG) ↓ |
|:------|:---------|:----------------|:-----:|:----:|:---:|
| SD-VAE | SiT-XL/2 | SiT | 1400 | 8.30 | 2.06 |
| SD-VAE | SiT-XL/2 | REPA | 800 | 5.90 | 1.42 |
| VA-VAE | LightningDiT-XL/1 | LightningDiT | 800 | 2.17 | 1.36 |
| [**E2E-VAVAE (Ours)**](https://huggingface.co/REPA-E/e2e-vavae) | [**SiT-XL/1**](https://huggingface.co/REPA-E/sit-ldm-e2e-vavae) | REPA | 800 | **1.83** | **1.26**<sup>†</sup> |

In this setup, the VAE is kept frozen, and only the generator is trained. Models using our E2E-VAE (fine-tuned via REPA-E) consistently outperform baselines like SD-VAE and VA-VAE, achieving state-of-the-art performance when incorporating the REPA alignment objective.

<details>
    <summary>Click to expand for CFG parameters</summary>
<ul>
    <li><strong>†</strong>: <code>--cfg-scale=2.5</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.75</code></li>
</ul>
</details>

## Acknowledgement
This codebase builds upon several excellent open-source projects, including:
- [1d-tokenizer](https://github.com/bytedance/1d-tokenizer)
- [edm2](https://github.com/NVlabs/edm2)
- [LightningDiT](https://github.com/hustvl/LightningDiT)
- [REPA](https://github.com/sihyun-yu/REPA)
- [Taming-Transformers](https://github.com/CompVis/taming-transformers)

We sincerely thank the authors for making their work publicly available.

## BibTeX
If you find our work useful, please consider citing:

```bibtex
@article{leng2025repae,
  title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
  author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
  year={2025},
  journal={arXiv preprint arXiv:2504.10483},
}
```