Update README.md
Browse files
README.md
CHANGED
|
@@ -4,28 +4,57 @@ library_name: diffusers
|
|
| 4 |
pipeline_tag: image-to-image
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
-
This model addresses the question of whether latent diffusion models and their VAE tokenizer can be trained end-to-end. Using a representation-alignment (REPA) loss, REPA-E enables stable and effective joint training of both components, leading to significant training acceleration and improved VAE performance. The resulting E2E-VAE serves as a drop-in replacement for existing VAEs, improving convergence and generation quality across diverse LDM architectures.
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
-
from diffusers import DiffusionPipeline
|
| 26 |
|
| 27 |
-
|
| 28 |
-
image = pipeline().images[0]
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
pipeline_tag: image-to-image
|
| 5 |
---
|
| 6 |
|
| 7 |
+
<h1 align="center"> REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers </h1>
|
| 8 |
+
|
| 9 |
+
<p align="center">
|
| 10 |
+
<a href="https://scholar.google.com.au/citations?user=GQzvqS4AAAAJ" target="_blank">Xingjian Leng</a><sup>1*</sup>   <b>·</b>  
|
| 11 |
+
<a href="https://1jsingh.github.io/" target="_blank">Jaskirat Singh</a><sup>1*</sup>   <b>·</b>  
|
| 12 |
+
<a href="https://hou-yz.github.io/" target="_blank">Yunzhong Hou</a><sup>1</sup>   <b>·</b>  
|
| 13 |
+
<a href="https://people.csiro.au/X/Z/Zhenchang-Xing/" target="_blank">Zhenchang Xing</a><sup>2</sup>  <b>·</b>  
|
| 14 |
+
<a href="https://www.sainingxie.com/" target="_blank">Saining Xie</a><sup>3</sup>  <b>·</b>  
|
| 15 |
+
<a href="https://zheng-lab-anu.github.io/" target="_blank">Liang Zheng</a><sup>1</sup> 
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
<p align="center">
|
| 19 |
+
<sup>1</sup> Australian National University   <sup>2</sup>Data61-CSIRO   <sup>3</sup>New York University   <br>
|
| 20 |
+
<sub><sup>*</sup>Project Leads </sub>
|
| 21 |
+
</p>
|
| 22 |
+
|
| 23 |
+
<p align="center">
|
| 24 |
+
<a href="https://End2End-Diffusion.github.io">๐ Project Page</a>  
|
| 25 |
+
<a href="https://huggingface.co/REPA-E">๐ค Models</a>  
|
| 26 |
+
<a href="https://arxiv.org/abs/2504.10483">๐ Paper</a>  
|
| 27 |
+
<br>
|
| 28 |
+
<a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a>
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
<p align="center">
|
| 33 |
+
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/vis-examples.jpg" width="100%" alt="teaser">
|
| 34 |
+
</p>
|
| 35 |
|
| 36 |
+
---
|
| 37 |
|
| 38 |
+
We address a fundamental question: ***Can latent diffusion models and their VAE tokenizer be trained end-to-end?*** While training both components jointly with standard diffusion loss is observed to be ineffective โ often degrading final performance โ we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, **REPA-E**, enables stable and effective joint training of both the VAE and the diffusion model.
|
|
|
|
| 39 |
|
| 40 |
+
<p align="center">
|
| 41 |
+
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/overview.jpg" width="100%" alt="teaser">
|
| 42 |
+
</p>
|
| 43 |
|
| 44 |
+
**REPA-E** significantly accelerates training โ achieving over **17ร** speedup compared to REPA and **45ร** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256ร256: **1.26** with CFG and **1.83** without CFG.
|
| 45 |
|
| 46 |
+
|
| 47 |
+
## Usage and Training
|
| 48 |
|
| 49 |
+
Please refer our [Github Repo](https://github.com/End2End-Diffusion/REPA-E) for detailed notes on end-to-end training and inference using REPA-E.
|
|
|
|
| 50 |
|
| 51 |
+
## ๐ Citation
|
|
|
|
| 52 |
|
| 53 |
+
```bibtex
|
| 54 |
+
@article{leng2025repae,
|
| 55 |
+
title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
|
| 56 |
+
author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
|
| 57 |
+
year={2025},
|
| 58 |
+
journal={arXiv preprint arXiv:2504.10483},
|
| 59 |
+
}
|
| 60 |
+
```
|