Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders (Scale-RAE)

This repository contains artifacts related to the paper Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders.

Project Page: https://rae-dit.github.io/scale-rae/
GitHub Repository: https://github.com/ZitengWangNYU/Scale-RAE

Introduction

Representation Autoencoders (RAEs) provide a simplified and powerful alternative to VAEs for large-scale text-to-image generation. Scale-RAE demonstrates that training diffusion models in high-dimensional semantic latent spaces (using encoders like SigLIP-2) leads to faster convergence, better generation quality, and improved stability compared to state-of-the-art VAE-based foundations.

Usage

For detailed instructions on installation, training, and inference, please visit the official GitHub repository.

This decoder is also directly compatitable with original RAE codebase. Try it out by simply swapping the encoder with google/siglip2-so400m-patch14-224!

Citation

@article{scale-rae-2026,
  title={Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders},
  author={Shengbang Tong and Boyang Zheng and Ziteng Wang and Bingda Tang and Nanye Ma and Ellis Brown and Jihan Yang and Rob Fergus and Yann LeCun and Saining Xie},
  journal={arXiv preprint arXiv:2601.16208},
  year={2026}
}

Downloads last month: 787

Collection including nyu-visionx/siglip2_decoder

Scale RAE

Collection

Collection for "Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders" • 9 items • Updated Mar 15 • 4

Paper for nyu-visionx/siglip2_decoder

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Paper • 2601.16208 • Published Jan 22 • 55