Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders (Scale-RAE)

This repository contains artifacts related to the paper Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders.

Introduction

Representation Autoencoders (RAEs) provide a simplified and powerful alternative to VAEs for large-scale text-to-image generation. Scale-RAE demonstrates that training diffusion models in high-dimensional semantic latent spaces (using encoders like SigLIP-2) leads to faster convergence, better generation quality, and improved stability compared to state-of-the-art VAE-based foundations.

Usage

For detailed instructions on installation, training, and inference, please visit the official GitHub repository.

This decoder is also directly compatitable with original RAE codebase. Try it out by simply swapping the encoder with google/siglip2-so400m-patch14-224!

Citation

@article{scale-rae-2026,
  title={Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders},
  author={Shengbang Tong and Boyang Zheng and Ziteng Wang and Bingda Tang and Nanye Ma and Ellis Brown and Jihan Yang and Rob Fergus and Yann LeCun and Saining Xie},
  journal={arXiv preprint arXiv:2601.16208},
  year={2026}
}
Downloads last month
345
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nyu-visionx/siglip2_decoder

Paper for nyu-visionx/siglip2_decoder