Update README.md
Browse files
README.md
CHANGED
|
@@ -17,271 +17,38 @@ library_name: diffusers
|
|
| 17 |
|
| 18 |
<p align="center">
|
| 19 |
<sup>1</sup> Australian National University   <sup>2</sup>Data61-CSIRO   <sup>3</sup>New York University   <br>
|
| 20 |
-
<sub><sup>*</sup>Project Leads
|
| 21 |
</p>
|
| 22 |
|
| 23 |
<p align="center">
|
| 24 |
<a href="https://End2End-Diffusion.github.io">π Project Page</a>  
|
| 25 |
<a href="https://huggingface.co/REPA-E">π€ Models</a>  
|
| 26 |
<a href="https://arxiv.org/abs/2504.10483">π Paper</a>  
|
| 27 |
-
<
|
| 28 |
-
<br><br>
|
| 29 |
<a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a>
|
| 30 |
</p>
|
| 31 |
|
| 32 |
-

|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-

|
| 38 |
-
|
| 39 |
-
**REPA-E** significantly accelerates training β achieving over **17Γ** speedup compared to REPA and **45Γ** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256Γ256: **1.26** with CFG and **1.83** without CFG.
|
| 40 |
-
|
| 41 |
-
## News and Updates
|
| 42 |
-
**[2025-04-15]** Initial Release with pre-trained models and codebase.
|
| 43 |
-
|
| 44 |
-
## Getting Started
|
| 45 |
-
### 1. Environment Setup
|
| 46 |
-
To set up our environment, please run:
|
| 47 |
-
|
| 48 |
-
```bash
|
| 49 |
-
git clone https://github.com/REPA-E/REPA-E.git
|
| 50 |
-
cd REPA-E
|
| 51 |
-
conda env create -f environment.yml -y
|
| 52 |
-
conda activate repa-e
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
### 2. Prepare the training data
|
| 56 |
-
Download and extract the training split of the [ImageNet-1K](https://www.image-net.org/challenges/LSVRC/2012/index) dataset. Once it's ready, run the following command to preprocess the dataset:
|
| 57 |
-
|
| 58 |
-
```bash
|
| 59 |
-
python preprocessing.py --imagenet-path /PATH/TO/IMAGENET_TRAIN
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
Replace `/PATH/TO/IMAGENET_TRAIN` with the actual path to the extracted training images.
|
| 63 |
-
|
| 64 |
-
### 3. Train the REPA-E model
|
| 65 |
-
|
| 66 |
-
To train the REPA-E model, you first need to download the following pre-trained VAE checkpoints:
|
| 67 |
-
- [π€ **SD-VAE (f8d4)**](https://huggingface.co/REPA-E/sdvae): Derived from the [Stability AI SD-VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse), originally trained on [Open Images](https://storage.googleapis.com/openimages/web/index.html) and fine-tuned on a subset of [LAION-2B](https://laion.ai/blog/laion-5b/).
|
| 68 |
-
- [π€ **IN-VAE (f16d32)**](https://huggingface.co/REPA-E/invae): Trained from scratch on [ImageNet-1K](https://www.image-net.org/) using the [latent-diffusion](https://github.com/CompVis/latent-diffusion) codebase with our custom architecture.
|
| 69 |
-
- [π€ **VA-VAE (f16d32)**](https://huggingface.co/REPA-E/vavae): Taken from [LightningDiT](https://github.com/hustvl/LightningDiT), this VAE is a visual tokenizer aligned with vision foundation models during reconstruction training. It is also trained on [ImageNet-1K](https://www.image-net.org/) for high-quality tokenization in high-dimensional latent spaces.
|
| 70 |
-
|
| 71 |
-
Recommended directory structure:
|
| 72 |
-
```
|
| 73 |
-
pretrained/
|
| 74 |
-
βββ invae/
|
| 75 |
-
βββ sdvae/
|
| 76 |
-
βββ vavae/
|
| 77 |
-
```
|
| 78 |
-
|
| 79 |
-
Once you've downloaded the VAE checkpoint, you can launch REPA-E training with:
|
| 80 |
-
```bash
|
| 81 |
-
accelerate launch train_repae.py \
|
| 82 |
-
--max-train-steps=400000 \
|
| 83 |
-
--report-to="wandb" \
|
| 84 |
-
--allow-tf32 \
|
| 85 |
-
--mixed-precision="fp16" \
|
| 86 |
-
--seed=0 \
|
| 87 |
-
--data-dir="data" \
|
| 88 |
-
--output-dir="exps" \
|
| 89 |
-
--batch-size=256 \
|
| 90 |
-
--path-type="linear" \
|
| 91 |
-
--prediction="v" \
|
| 92 |
-
--weighting="uniform" \
|
| 93 |
-
--model="SiT-XL/2" \
|
| 94 |
-
--checkpointing-steps=50000 \
|
| 95 |
-
--loss-cfg-path="configs/l1_lpips_kl_gan.yaml" \
|
| 96 |
-
--vae="f8d4" \
|
| 97 |
-
--vae-ckpt="pretrained/sdvae/sdvae-f8d4.pt" \
|
| 98 |
-
--disc-pretrained-ckpt="pretrained/sdvae/sdvae-f8d4-discriminator-ckpt.pt" \
|
| 99 |
-
--enc-type="dinov2-vit-b" \
|
| 100 |
-
--proj-coeff=0.5 \
|
| 101 |
-
--encoder-depth=8 \
|
| 102 |
-
--vae-align-proj-coeff=1.5 \
|
| 103 |
-
--bn-momentum=0.1 \
|
| 104 |
-
--exp-name="sit-xl-dinov2-b-enc8-repae-sdvae-0.5-1.5-400k"
|
| 105 |
-
```
|
| 106 |
-
<details>
|
| 107 |
-
<summary>Click to expand for configuration options</summary>
|
| 108 |
-
|
| 109 |
-
Then this script will automatically create the folder in `exps` to save logs and checkpoints. You can adjust the following options:
|
| 110 |
-
|
| 111 |
-
- `--output-dir`: Directory to save checkpoints and logs
|
| 112 |
-
- `--exp-name`: Experiment name (a subfolder will be created under `output-dir`)
|
| 113 |
-
- `--vae`: Choose between `[f8d4, f16d32]`
|
| 114 |
-
- `--vae-ckpt`: Path to a provided or custom VAE checkpoint
|
| 115 |
-
- `--disc-pretrained-ckpt`: Path to a provided or custom VAE discriminator checkpoint
|
| 116 |
-
- `--models`: Choose from `[SiT-B/2, SiT-L/2, SiT-XL/2, SiT-B/1, SiT-L/1, SiT-XL/1]`. The number indicates the patch size. Select a model compatible with your VAE architecture.
|
| 117 |
-
- `--enc-type`: `[dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, dinov1-vit-b, mocov3-vit-b, mocov3-vit-l, clip-vit-L, jepa-vit-h, mae-vit-l]`
|
| 118 |
-
- `--encoder-depth`: Any integer from 1 up to the full depth of the selected encoder
|
| 119 |
-
- `--proj-coeff`: REPA-E projection coefficient for SiT alignment (float > 0)
|
| 120 |
-
- `--vae-align-proj-coeff`: REPA-E projection coefficient for VAE alignment (float > 0)
|
| 121 |
-
- `--bn-momentum`: Batchnorm layer momentum (float)
|
| 122 |
-
|
| 123 |
-
</details>
|
| 124 |
-
|
| 125 |
-
### 4. Use REPA-E Tuned VAE (E2E-VAE) for Accelerated Training and Better Generation
|
| 126 |
-
This section shows how to use the REPA-E fine-tuned VAE (E2E-VAE) in latent diffusion training. E2E-VAE acts as a drop-in replacement for the original VAE, enabling significantly accelerated generation performance. You can either download a pre-trained VAE or extract it from a REPA-E checkpoint.
|
| 127 |
-
|
| 128 |
-
**Step 1**: Obtain the fine-tuned VAE from REPA-E checkpoints:
|
| 129 |
-
- **Option 1**: Download pre-trained REPA-E VAEs directly from Hugging Face:
|
| 130 |
-
- [π€ **E2E-SDVAE**](https://huggingface.co/REPA-E/e2e-sdvae)
|
| 131 |
-
- [π€ **E2E-INVAE**](https://huggingface.co/REPA-E/e2e-invae)
|
| 132 |
-
- [π€ **E2E-VAVAE**](https://huggingface.co/REPA-E/e2e-vavae)
|
| 133 |
-
|
| 134 |
-
Recommended directory structure:
|
| 135 |
-
```
|
| 136 |
-
pretrained/
|
| 137 |
-
βββ e2e-sdvae/
|
| 138 |
-
βββ e2e-invae/
|
| 139 |
-
βββ e2e-vavae/
|
| 140 |
-
```
|
| 141 |
-
- **Option 2**: Extract the VAE from a full REPA-E checkpoint manually:
|
| 142 |
-
```bash
|
| 143 |
-
python save_vae_weights.py \
|
| 144 |
-
--repae-ckpt pretrained/sit-repae-vavae/checkpoints/0400000.pt \
|
| 145 |
-
--vae-name e2e-vavae \
|
| 146 |
-
--save-dir exps
|
| 147 |
-
```
|
| 148 |
-
|
| 149 |
-
**Step 2**: Cache latents to enable fast training:
|
| 150 |
-
```bash
|
| 151 |
-
accelerate launch --num_machines=1 --num_processes=8 cache_latents.py \
|
| 152 |
-
--vae-arch="f16d32" \
|
| 153 |
-
--vae-ckpt-path="pretrained/e2e-vavae/e2e-vavae-400k.pt" \
|
| 154 |
-
--vae-latents-name="e2e-vavae" \
|
| 155 |
-
--pproc-batch-size=128
|
| 156 |
-
```
|
| 157 |
-
|
| 158 |
-
**Step 3**: Train the SiT generation model using the cached latents:
|
| 159 |
-
|
| 160 |
-
```bash
|
| 161 |
-
accelerate launch train_ldm_only.py \
|
| 162 |
-
--max-train-steps=4000000 \
|
| 163 |
-
--report-to="wandb" \
|
| 164 |
-
--allow-tf32 \
|
| 165 |
-
--mixed-precision="fp16" \
|
| 166 |
-
--seed=0 \
|
| 167 |
-
--data-dir="data" \
|
| 168 |
-
--batch-size=256 \
|
| 169 |
-
--path-type="linear" \
|
| 170 |
-
--prediction="v" \
|
| 171 |
-
--weighting="uniform" \
|
| 172 |
-
--model="SiT-XL/1" \
|
| 173 |
-
--checkpointing-steps=50000 \
|
| 174 |
-
--vae="f16d32" \
|
| 175 |
-
--vae-ckpt="pretrained/e2e-vavae/e2e-vavae-400k.pt" \
|
| 176 |
-
--vae-latents-name="e2e-vavae" \
|
| 177 |
-
--learning-rate=1e-4 \
|
| 178 |
-
--enc-type="dinov2-vit-b" \
|
| 179 |
-
--proj-coeff=0.5 \
|
| 180 |
-
--encoder-depth=8 \
|
| 181 |
-
--output-dir="exps" \
|
| 182 |
-
--exp-name="sit-xl-1-dinov2-b-enc8-ldm-only-e2e-vavae-0.5-4m"
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
-
For details on the available training options and argument descriptions, refer to [Section 3](#3-train-the-repa-e-model).
|
| 186 |
-
|
| 187 |
-
### 5. Generate samples and run evaluation
|
| 188 |
-
You can generate samples and save them as `.npz` files using the following script. Simply set the `--exp-path` and `--train-steps` corresponding to your trained model (REPA-E or Traditional LDM Training).
|
| 189 |
-
|
| 190 |
-
```bash
|
| 191 |
-
torchrun --nnodes=1 --nproc_per_node=8 generate.py \
|
| 192 |
-
--num-fid-samples 50000 \
|
| 193 |
-
--path-type linear \
|
| 194 |
-
--mode sde \
|
| 195 |
-
--num-steps 250 \
|
| 196 |
-
--cfg-scale 1.0 \
|
| 197 |
-
--guidance-high 1.0 \
|
| 198 |
-
--guidance-low 0.0 \
|
| 199 |
-
--exp-path pretrained/sit-repae-sdvae \
|
| 200 |
-
--train-steps 400000
|
| 201 |
-
```
|
| 202 |
-
|
| 203 |
-
```bash
|
| 204 |
-
torchrun --nnodes=1 --nproc_per_node=8 generate.py \
|
| 205 |
-
--num-fid-samples 50000 \
|
| 206 |
-
--path-type linear \
|
| 207 |
-
--mode sde \
|
| 208 |
-
--num-steps 250 \
|
| 209 |
-
--cfg-scale 1.0 \
|
| 210 |
-
--guidance-high 1.0 \
|
| 211 |
-
--guidance-low 0.0 \
|
| 212 |
-
--exp-path pretrained/sit-ldm-e2e-vavae \
|
| 213 |
-
--train-steps 4000000
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
<details>
|
| 217 |
-
<summary>Click to expand for sampling options</summary>
|
| 218 |
-
|
| 219 |
-
You can adjust the following options for sampling:
|
| 220 |
-
- `--path-type linear`: Noise schedule type, choose from `[linear, cosine]`
|
| 221 |
-
- `--mode`: Sampling mode, `[ode, sde]`
|
| 222 |
-
- `--num-steps`: Number of denoising steps
|
| 223 |
-
- `--cfg-scale`: Guidance scale (float β₯ 1), setting it to 1 disables classifier-free guidance (CFG)
|
| 224 |
-
- `--guidance-high`: Upper guidance interval (float in [0, 1])
|
| 225 |
-
- `--guidance-low`: Lower guidance interval (float in [0, 1], must be < `--guidance-high`)\
|
| 226 |
-
- `--exp-path`: Path to the experiment directory
|
| 227 |
-
- `--train-steps`: Training step of the checkpoint to evaluate
|
| 228 |
-
|
| 229 |
-
</details>
|
| 230 |
-
|
| 231 |
-
You can then use the [ADM evaluation suite](https://github.com/openai/guided-diffusion/tree/main/evaluations) to compute image generation quality metrics, including gFID, sFID, Inception Score (IS), Precision, and Recall.
|
| 232 |
-
|
| 233 |
-
### Quantitative Results
|
| 234 |
-
Tables below report generation performance using gFID on 50k samples, with and without classifier-free guidance (CFG). We compare models trained end-to-end with **REPA-E** and models using a frozen REPA-E fine-tuned VAE (**E2E-VAE**). Lower is better. All linked checkpoints below are hosted on our [π€ Hugging Face Hub](https://huggingface.co/REPA-E). To reproduce these results, download the respective checkpoints to the `pretrained` folder and run the evaluation script as detailed in [Section 5](#5-generate-samples-and-run-evaluation).
|
| 235 |
-
|
| 236 |
-
#### A. End-to-End Training (REPA-E)
|
| 237 |
-
| Tokenizer | Generation Model | Epochs | gFID-50k β | gFID-50k (CFG) β |
|
| 238 |
-
|:---------|:----------------|:-----:|:----:|:---:|
|
| 239 |
-
| [**SD-VAE<sup>*</sup>**](https://huggingface.co/REPA-E/sdvae) | [**SiT-XL/2**](https://huggingface.co/REPA-E/sit-repae-sdvae) | 80 | 4.07 | 1.67<sup>a</sup> |
|
| 240 |
-
| [**IN-VAE<sup>*</sup>**](https://huggingface.co/REPA-E/invae) | [**SiT-XL/1**](https://huggingface.co/REPA-E/sit-repae-invae) | 80 | 4.09 | 1.61<sup>b</sup> |
|
| 241 |
-
| [**VA-VAE<sup>*</sup>**](https://huggingface.co/REPA-E/vavae) | [**SiT-XL/1**](https://huggingface.co/REPA-E/sit-repae-vavae) | 80 | 4.05 | 1.73<sup>c</sup> |
|
| 242 |
-
|
| 243 |
-
\* The "Tokenizer" column refers to the initial VAE used for joint REPA-E training. The final (jointly optimized) VAE is bundled within the generation model checkpoint.
|
| 244 |
-
|
| 245 |
-
<details>
|
| 246 |
-
<summary>Click to expand for CFG parameters</summary>
|
| 247 |
-
<ul>
|
| 248 |
-
<li><strong>a</strong>: <code>--cfg-scale=2.2</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.65</code></li>
|
| 249 |
-
<li><strong>b</strong>: <code>--cfg-scale=1.8</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.825</code></li>
|
| 250 |
-
<li><strong>c</strong>: <code>--cfg-scale=1.9</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.825</code></li>
|
| 251 |
-
</ul>
|
| 252 |
-
</details>
|
| 253 |
|
| 254 |
---
|
| 255 |
|
| 256 |
-
|
| 257 |
-
| Tokenizer | Generation Model | Method | Epochs | gFID-50k β | gFID-50k (CFG) β |
|
| 258 |
-
|:------|:---------|:----------------|:-----:|:----:|:---:|
|
| 259 |
-
| SD-VAE | SiT-XL/2 | SiT | 1400 | 8.30 | 2.06 |
|
| 260 |
-
| SD-VAE | SiT-XL/2 | REPA | 800 | 5.90 | 1.42 |
|
| 261 |
-
| VA-VAE | LightningDiT-XL/1 | LightningDiT | 800 | 2.17 | 1.36 |
|
| 262 |
-
| [**E2E-VAVAE (Ours)**](https://huggingface.co/REPA-E/e2e-vavae) | [**SiT-XL/1**](https://huggingface.co/REPA-E/sit-ldm-e2e-vavae) | REPA | 800 | **1.83** | **1.26**<sup>β </sup> |
|
| 263 |
|
| 264 |
-
|
|
|
|
|
|
|
| 265 |
|
| 266 |
-
|
| 267 |
-
<summary>Click to expand for CFG parameters</summary>
|
| 268 |
-
<ul>
|
| 269 |
-
<li><strong>β </strong>: <code>--cfg-scale=2.5</code>, <code>--guidance-low=0.0</code>, <code>--guidance-high=0.75</code></li>
|
| 270 |
-
</ul>
|
| 271 |
-
</details>
|
| 272 |
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
- [1d-tokenizer](https://github.com/bytedance/1d-tokenizer)
|
| 276 |
-
- [edm2](https://github.com/NVlabs/edm2)
|
| 277 |
-
- [LightningDiT](https://github.com/hustvl/LightningDiT)
|
| 278 |
-
- [REPA](https://github.com/sihyun-yu/REPA)
|
| 279 |
-
- [Taming-Transformers](https://github.com/CompVis/taming-transformers)
|
| 280 |
|
| 281 |
-
|
| 282 |
|
| 283 |
-
##
|
| 284 |
-
If you find our work useful, please consider citing:
|
| 285 |
|
| 286 |
```bibtex
|
| 287 |
@article{leng2025repae,
|
|
@@ -290,4 +57,4 @@ If you find our work useful, please consider citing:
|
|
| 290 |
year={2025},
|
| 291 |
journal={arXiv preprint arXiv:2504.10483},
|
| 292 |
}
|
| 293 |
-
```
|
|
|
|
| 17 |
|
| 18 |
<p align="center">
|
| 19 |
<sup>1</sup> Australian National University   <sup>2</sup>Data61-CSIRO   <sup>3</sup>New York University   <br>
|
| 20 |
+
<sub><sup>*</sup>Project Leads </sub>
|
| 21 |
</p>
|
| 22 |
|
| 23 |
<p align="center">
|
| 24 |
<a href="https://End2End-Diffusion.github.io">π Project Page</a>  
|
| 25 |
<a href="https://huggingface.co/REPA-E">π€ Models</a>  
|
| 26 |
<a href="https://arxiv.org/abs/2504.10483">π Paper</a>  
|
| 27 |
+
<br>
|
|
|
|
| 28 |
<a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a>
|
| 29 |
</p>
|
| 30 |
|
|
|
|
| 31 |
|
| 32 |
+
<p align="center">
|
| 33 |
+
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/vis-examples.jpg" width="100%" alt="teaser">
|
| 34 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
---
|
| 37 |
|
| 38 |
+
We address a fundamental question: ***Can latent diffusion models and their VAE tokenizer be trained end-to-end?*** While training both components jointly with standard diffusion loss is observed to be ineffective β often degrading final performance β we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, **REPA-E**, enables stable and effective joint training of both the VAE and the diffusion model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
<p align="center">
|
| 41 |
+
<img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/overview.jpg" width="100%" alt="teaser">
|
| 42 |
+
</p>
|
| 43 |
|
| 44 |
+
**REPA-E** significantly accelerates training β achieving over **17Γ** speedup compared to REPA and **45Γ** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256Γ256: **1.26** with CFG and **1.83** without CFG.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
+
|
| 47 |
+
## Usage and Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
Please refer our [Github Repo](https://github.com/End2End-Diffusion/REPA-E) for detailed notes on end-to-end training and inference using REPA-E.
|
| 50 |
|
| 51 |
+
## π Citation
|
|
|
|
| 52 |
|
| 53 |
```bibtex
|
| 54 |
@article{leng2025repae,
|
|
|
|
| 57 |
year={2025},
|
| 58 |
journal={arXiv preprint arXiv:2504.10483},
|
| 59 |
}
|
| 60 |
+
```
|