Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders (Scale-RAE)

This repository contains the implementation of Scale-RAE, a framework that scales text-to-image (T2I) generation by training diffusion models in high-dimensional semantic latent spaces using Representation Autoencoders (RAEs).

Paper | Project Page | GitHub

Introduction

Representation Autoencoders (RAEs) offer distinct advantages in diffusion modeling by training in high-dimensional semantic latent spaces. Scale-RAE investigates scaling this framework to large-scale, freeform text-to-image generation. By using frozen representation encoders (like SigLIP-2) and targeted data composition, RAE-based diffusion models demonstrate faster convergence and better generation quality than traditional VAE-based models across scales from 0.5B to 9.8B parameters.

Quick Start

Installation

git clone https://github.com/ZitengWangNYU/Scale-RAE.git
cd Scale-RAE
conda create -n scale_rae python=3.10 -y
conda activate scale_rae
pip install -e .

Inference

You can generate images using the provided command-line interface. The models and decoders will be automatically downloaded from Hugging Face:

cd inference
python cli.py t2i --prompt "Can you generate a photo of a cat on a windowsill?"

Citation

If you find this work useful, please cite:

@article{scale-rae-2026,
  title={Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders},
  author={Shengbang Tong and Boyang Zheng and Ziteng Wang and Bingda Tang and Nanye Ma and Ellis Brown and Jihan Yang and Rob Fergus and Yann LeCun and Saining Xie},
  journal={arXiv preprint arXiv:2601.16208},
  year={2026}
}

Acknowledgments

This work builds upon RAE, Cambrian-1, WebSSL, and SigLIP-2.

Downloads last month: 15

Safetensors

Model size

4B params

Tensor type

F32

Collection including nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B-WebSSL

Scale RAE

Collection

Collection for "Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders" • 9 items • Updated Mar 15 • 3

Paper for nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B-WebSSL

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Paper • 2601.16208 • Published Jan 22 • 55