|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: image-to-image |
|
|
library_name: diffusers |
|
|
--- |
|
|
|
|
|
<h1 align="center"> REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers </h1> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://scholar.google.com.au/citations?user=GQzvqS4AAAAJ" target="_blank">Xingjian Leng</a><sup>1*</sup>   <b>·</b>   |
|
|
<a href="https://1jsingh.github.io/" target="_blank">Jaskirat Singh</a><sup>1*</sup>   <b>·</b>   |
|
|
<a href="https://hou-yz.github.io/" target="_blank">Yunzhong Hou</a><sup>1</sup>   <b>·</b>   |
|
|
<a href="https://people.csiro.au/X/Z/Zhenchang-Xing/" target="_blank">Zhenchang Xing</a><sup>2</sup>  <b>·</b>   |
|
|
<a href="https://www.sainingxie.com/" target="_blank">Saining Xie</a><sup>3</sup>  <b>·</b>   |
|
|
<a href="https://zheng-lab-anu.github.io/" target="_blank">Liang Zheng</a><sup>1</sup>  |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<sup>1</sup> Australian National University   <sup>2</sup>Data61-CSIRO   <sup>3</sup>New York University   <br> |
|
|
<sub><sup>*</sup>Project Leads </sub> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://End2End-Diffusion.github.io">🌐 Project Page</a>   |
|
|
<a href="https://huggingface.co/REPA-E">🤗 Models</a>   |
|
|
<a href="https://arxiv.org/abs/2504.10483">📃 Paper</a>   |
|
|
<br> |
|
|
<!-- <a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a> --> |
|
|
</p> |
|
|
|
|
|
|
|
|
<!-- <p align="center"> |
|
|
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/vis-examples.jpg" width="100%" alt="teaser"> |
|
|
</p> --> |
|
|
|
|
|
--- |
|
|
|
|
|
We address a fundamental question: ***Can latent diffusion models and their VAE tokenizer be trained end-to-end?*** While training both components jointly with standard diffusion loss is observed to be ineffective — often degrading final performance — we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, **REPA-E**, enables stable and effective joint training of both the VAE and the diffusion model. |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/overview.jpg" width="100%" alt="teaser"> |
|
|
</p> |
|
|
|
|
|
**REPA-E** significantly accelerates training — achieving over **17×** speedup compared to REPA and **45×** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256×256: **1.12** with CFG and **1.69** without CFG. |
|
|
|
|
|
|
|
|
<h1 align="left" style="color:#ff000d">🆕 AutoencoderKL-Compatible Release</h1> |
|
|
|
|
|
> **New in this release:** We are releasing the **REPA-E E2E-VAE** as a fully **Hugging Face AutoencoderKL** checkpoint — ready to use with `diffusers` out of the box. |
|
|
|
|
|
We previously released the REPA-E VAE checkpoint, which required loading through the model class in our REPA-E repository. |
|
|
This new version provides a **Hugging Face–compatible AutoencoderKL** checkpoint that can be loaded directly via the `diffusers` API — no extra code or custom wrapper needed. |
|
|
|
|
|
It offers **plug-and-play compatibility** with diffusion pipelines and can be seamlessly used to build or train new diffusion models. |
|
|
|
|
|
## ⚡️ Quickstart |
|
|
```python |
|
|
from diffusers import AutoencoderKL |
|
|
|
|
|
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to("cuda") |
|
|
``` |
|
|
> Use `vae.encode(...)` / `vae.decode(...)` in your pipeline. (A full example is provided below.) |
|
|
|
|
|
## 📦 Requirements |
|
|
The following packages are required to load and run the REPA-E VAEs with the `diffusers` library: |
|
|
|
|
|
```bash |
|
|
pip install diffusers>=0.33.0 |
|
|
pip install torch>=2.3.1 |
|
|
``` |
|
|
|
|
|
## 🚀 Example Usage |
|
|
Below is a minimal example showing how to load and use the REPA-E end-to-end trained SD-VAE with `diffusers`: |
|
|
|
|
|
```python |
|
|
from io import BytesIO |
|
|
import requests |
|
|
|
|
|
from diffusers import AutoencoderKL |
|
|
import numpy as np |
|
|
import torch |
|
|
from PIL import Image |
|
|
|
|
|
|
|
|
response = requests.get("https://s3.amazonaws.com/masters.galleries.prod.dpreview.com/2935392.jpg?X-Amz-Expires=3600&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUIXIAMA3N436PSEA/20251019/us-east-1/s3/aws4_request&X-Amz-Date=20251019T103721Z&X-Amz-SignedHeaders=host&X-Amz-Signature=219dc5f98e5c2e5f3b72587716f75889b8f45b0a01f1bd08dbbc44106e484144") |
|
|
device = "cuda" |
|
|
|
|
|
image = torch.from_numpy( |
|
|
np.array( |
|
|
Image.open(BytesIO(response.content)).resize((512, 512)) |
|
|
) |
|
|
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1 |
|
|
image = image.to(device) |
|
|
|
|
|
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
latents = vae.encode(image).latent_dist.sample() |
|
|
reconstructed = vae.decode(latents).sample |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
```bibtex |
|
|
@article{leng2025repae, |
|
|
title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers}, |
|
|
author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng}, |
|
|
year={2025}, |
|
|
journal={arXiv preprint arXiv:2504.10483}, |
|
|
} |
|
|
``` |