RiT: Vanilla Diffusion Transformers Suffice in Representation Space
Abstract
Flow matching in representation spaces with improved statistical properties enables efficient diffusion model training with reduced parameters and fast sampling.
Flow matching with x-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space li2025back. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both d!approx!33) yet DINOv2 exhibits 7.3times higher effective rank, 35times better covariance conditioning, 11.5times lower excess kurtosis, and 1.7times lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the Representation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint [CLS]-patch modeling. On ImageNet 256{times}256, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT^DH-XL with 19% fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, 5 Heun steps already reach FID 2.0 and 10 steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.
Community
RiT-XL: Vanilla Diffusion Transformers Are Enough in Representation Space
This repository hosts the released RiT-XL checkpoint trained for 800 epochs
on ImageNet 256×256 with frozen DINOv2-Small features.
Results on ImageNet 256×256
| Method | Encoder | Params | FID ↓ (CFG=1) | FID ↓ (CFG≈3.7) |
|---|---|---|---|---|
| DiT-XL | SD-VAE | 675M | 9.62 | 2.27 |
| SiT-XL | SD-VAE | 675M | 8.61 | 2.06 |
| REPA-XL | SD-VAE | 675M | 5.78 | 1.29 |
| DDT-XL | SD-VAE | 675M | 6.27 | 1.26 |
| REG-XL | SD-VAE | 675M | 1.80 | 1.36 |
| RAE-XL | DINOv2-S | 676M | 1.87 | 1.41 |
| RAE-XLDH | DINOv2-B | 839M | 1.51 | 1.16 |
| FAE-XL | FAE-DINOv2-G | 675M | 1.48 | 1.29 |
| RiT-XL (ours) | DINOv2-S | 676M | 1.45 | 1.14 |
All FIDs use 25 Heun steps with the time-shift schedule.
Few-step generation (no distillation, no consistency training):
| Heun steps | 5 | 10 | 25 | 50 |
|---|---|---|---|---|
| FID (CFG=1.0) | 2.44 | 1.59 | 1.47 | 1.46 |
| FID (CFG=3.7) | 1.99 | 1.27 | 1.15 | 1.15 |
Quick start
The full training/inference code lives at
lezhang7/RiT. The eval script auto-pulls
this checkpoint plus the matching RAE decoder on first run:
git clone https://github.com/lezhang7/RiT.git
cd RiT
pip install -r requirements.txt
bash scripts/eval.sh # CFG=3.7, FID ~1.14 on ImageNet 256x256
To download just the weights manually:
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(repo_id="le723z/RiT", filename="checkpoint-last.pth")
import torch
state = torch.load(ckpt, map_location="cpu", weights_only=False)
# state['model'] / state['model_ema1'] / state['model_ema2'] are the
# trainable + two EMA-decay parameter dictionaries.
Checkpoint contents
checkpoint-last.pth is a PyTorch checkpoint produced after 740 training
epochs (the released model used for the paper's headline numbers). Top-level
keys:
model— main parameters of theDenoiser(RiT-XL backbone).model_ema1— EMA decay 0.9999 (used for sampling by default).model_ema2— EMA decay 0.9996 (tracked but unused at inference).optimizer— AdamW state for resuming training.epoch—740.args— argparse namespace from the original training run (legacyJiT-RAE-XL/16model name; the architecture matches the releasedRiT-XL/16).
Loading uses only model / model_ema*, so the legacy args field does not
matter — eval.sh constructs the model from the CLI flags.
Model details
- Architecture: vanilla Diffusion Transformer — 28 layers, hidden 1152,
16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class
tokens, joint [CLS]-patch modeling. - Encoder (frozen):
facebook/dinov2-with-registers-small(d=384). - Decoder (frozen): ViT-MAE-style decoder from
nyu-visionx/RAE-collections,
variantdecoders/dinov2/wReg_small/ViTXL_n08/model.pt. - Parameters (denoiser only): 676M.
- Training: 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs (this
ckpt: epoch 740), x-prediction loss, dimension-aware time shift
(s ≈ 4.9), CLS auxiliary loss weight λ=0.2. - Sampling defaults: Heun, 25 steps, time-shift schedule, CFG=3.7 in
interval [0.1, 0.98], coupled-noise initialization for [CLS].
Citation
@article {zhang2025rit,
title = {RiT: Vanilla Diffusion Transformers Are Enough in Representation Space},
author = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},
year = {2025}
}
Acknowledgments
This release reuses the frozen DINOv2 encoder + ViT decoder pairing from
RAE and adopts the modernized DiT
block design + in-context class tokens from JiT.
Get this paper in your agent:
hf papers read 2605.21981 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper