File size: 3,499 Bytes
8f67742 0ed1fa9 8f67742 0ed1fa9 8b21c67 8f67742 0ed1fa9 8f67742 0ed1fa9 8f67742 0ed1fa9 8f67742 8b21c67 8f67742 0ed1fa9 8f67742 0ed1fa9 8f67742 0ed1fa9 8f67742 0ed1fa9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
---
license: mit
pipeline_tag: image-to-image
library_name: diffusers
---
<h1 align="center"> REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers </h1>
<p align="center">
<a href="https://scholar.google.com.au/citations?user=GQzvqS4AAAAJ" target="_blank">Xingjian Leng</a><sup>1*</sup>   <b>·</b>  
<a href="https://1jsingh.github.io/" target="_blank">Jaskirat Singh</a><sup>1*</sup>   <b>·</b>  
<a href="https://hou-yz.github.io/" target="_blank">Yunzhong Hou</a><sup>1</sup>   <b>·</b>  
<a href="https://people.csiro.au/X/Z/Zhenchang-Xing/" target="_blank">Zhenchang Xing</a><sup>2</sup>  <b>·</b>  
<a href="https://www.sainingxie.com/" target="_blank">Saining Xie</a><sup>3</sup>  <b>·</b>  
<a href="https://zheng-lab-anu.github.io/" target="_blank">Liang Zheng</a><sup>1</sup> 
</p>
<p align="center">
<sup>1</sup> Australian National University   <sup>2</sup>Data61-CSIRO   <sup>3</sup>New York University   <br>
<sub><sup>*</sup>Project Leads </sub>
</p>
<p align="center">
<a href="https://End2End-Diffusion.github.io">π Project Page</a>  
<a href="https://huggingface.co/REPA-E">π€ Models</a>  
<a href="https://arxiv.org/abs/2504.10483">π Paper</a>  
<br>
<!-- <a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a> -->
</p>
<p align="center">
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/vis-examples.jpg" width="100%" alt="teaser">
</p>
---
We address a fundamental question: ***Can latent diffusion models and their VAE tokenizer be trained end-to-end?*** While training both components jointly with standard diffusion loss is observed to be ineffective β often degrading final performance β we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, **REPA-E**, enables stable and effective joint training of both the VAE and the diffusion model.
<p align="center">
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/overview.jpg" width="100%" alt="teaser">
</p>
**REPA-E** significantly accelerates training β achieving over **17Γ** speedup compared to REPA and **45Γ** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256Γ256: **1.12** with CFG and **1.69** without CFG.
## Usage and Training
Please refer our [Github Repo](https://github.com/End2End-Diffusion/REPA-E) for detailed notes on end-to-end training and inference using REPA-E.
## π Citation
```bibtex
@article{leng2025repae,
title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
year={2025},
journal={arXiv preprint arXiv:2504.10483},
}
```
|