Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders (Scale-RAE)
This repository contains the implementation of Scale-RAE, a framework that scales text-to-image (T2I) generation by training diffusion models in high-dimensional semantic latent spaces using Representation Autoencoders (RAEs).
Paper | Project Page | GitHub
Introduction
Representation Autoencoders (RAEs) offer distinct advantages in diffusion modeling by training in high-dimensional semantic latent spaces. Scale-RAE investigates scaling this framework to large-scale, freeform text-to-image generation. By using frozen representation encoders (like SigLIP-2) and targeted data composition, RAE-based diffusion models demonstrate faster convergence and better generation quality than traditional VAE-based models across scales from 0.5B to 9.8B parameters.
Quick Start
Installation
git clone https://github.com/ZitengWangNYU/Scale-RAE.git
cd Scale-RAE
conda create -n scale_rae python=3.10 -y
conda activate scale_rae
pip install -e .
Inference
You can generate images using the provided command-line interface. The models and decoders will be automatically downloaded from Hugging Face:
cd inference
python cli.py t2i --prompt "Can you generate a photo of a cat on a windowsill?"
Citation
If you find this work useful, please cite:
@article{scale-rae-2026,
title={Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders},
author={Shengbang Tong and Boyang Zheng and Ziteng Wang and Bingda Tang and Nanye Ma and Ellis Brown and Jihan Yang and Rob Fergus and Yann LeCun and Saining Xie},
journal={arXiv preprint arXiv:2601.16208},
year={2026}
}
Acknowledgments
This work builds upon RAE, Cambrian-1, WebSSL, and SigLIP-2.
- Downloads last month
- 74