Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders (Scale-RAE)
This repository contains artifacts related to the paper Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders.
- Project Page: https://rae-dit.github.io/scale-rae/
- GitHub Repository: https://github.com/ZitengWangNYU/Scale-RAE
Introduction
Representation Autoencoders (RAEs) provide a simplified and powerful alternative to VAEs for large-scale text-to-image generation. Scale-RAE demonstrates that training diffusion models in high-dimensional semantic latent spaces (using encoders like SigLIP-2) leads to faster convergence, better generation quality, and improved stability compared to state-of-the-art VAE-based foundations.
Usage
For detailed instructions on installation, training, and inference, please visit the official GitHub repository.
This decoder is also directly compatitable with original RAE codebase. Try it out by simply swapping the encoder with google/siglip2-so400m-patch14-224!
Citation
@article{scale-rae-2026,
title={Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders},
author={Shengbang Tong and Boyang Zheng and Ziteng Wang and Bingda Tang and Nanye Ma and Ellis Brown and Jihan Yang and Rob Fergus and Yann LeCun and Saining Xie},
journal={arXiv preprint arXiv:2601.16208},
year={2026}
}
- Downloads last month
- 345