pipeline_tag: unconditional-image-generation
library_name: diffusers
license: unknown
tags:
- diffusion-model
- self-supervised-learning
- dit
- sit
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
This repository contains the weights for USP: Unified Self-Supervised Pretraining for Image Generation and Understanding, as described in our paper: https://huggingface.co/papers/2503.06132.
Find our official code and more details on GitHub: https://github.com/GD-ML/USP.
Abstract
Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.
Model Architecture and Convergence
USP significantly improves convergence speed just with weight initialization from pretraining:

Finetuning Weights and Evaluation Results
Finetuning weights for image generation tasks are available. All weights were pretrained for 1600 epochs and then finetuned for 400K steps.
Using the above weights and following the inference and evaluation procedures outlined in GENERATION.md, we obtained the following evaluation results:
| Model Name | Pretrain | Finetuning | FID (\downarrow) | IS (\uparrow) | sFID (\downarrow) |
|---|---|---|---|---|---|
| DiT_B-2 | 1600 epochs | 400 K steps | 27.22 | 50.47 | 7.60 |
| DiT_L-2 | 1600 epochs | 400 K steps | 15.05 | 80.11 | 6.41 |
| DiT_XL-2 | 1600 epochs | 400 K steps | 9.64 | 112.93 | 6.30 |
| SiT_B-2 | 1600 epochs | 400 K steps | 22.10 | 61.59 | 5.88 |
| SiT_XL-2 | 1600 epochs | 400 K steps | 7.35 | 128.50 | 5.00 |
Our method is somewhat orthogonal to other DINO based acceleration methods. Results combined with external-model-based methods:
| Model | Params | Steps | FID (\downarrow) | IS (\uparrow) |
|---|---|---|---|---|
| SiT-XL/2 | 130M | 400K | 16.97 | 77.50 |
| USP | 130M | 400K | 7.38 | 127.96 |
| REPA | 130M | 400K | 7.9 | 122.6 |
| USP + REPA | 130M | 400K | 6.26 | 139.84 |
| VAVAE | 130M | 64 Epochs | 5.18/2.15† | 132.4/245.1† |
| USP + VAVAE | 130M | 64 Epochs | 4.2/1.81† | 144/261.0† |
Table: Results Combined with External-Model-Based Methods. †: w/ CFG=10.0.
Usage
You can use this model with the diffusers library for unconditional image generation.
from diffusers import DiffusionPipeline
import torch
# Load the USP Image Generation pipeline
# Replace "GD-ML/USP-Image_Generation" with the actual repo ID if different
pipeline = DiffusionPipeline.from_pretrained("GD-ML/USP-Image_Generation", torch_dtype=torch.float16)
pipeline.to("cuda")
# Generate an image
image = pipeline(num_inference_steps=50).images[0]
# Save or display the image
image.save("usp_generated_image.png")
print("Generated image saved as usp_generated_image.png")
For detailed instructions on pre-training and image generation tasks, please refer to the following guides in the official GitHub repository:
Acknowledgement
Our code is based on MAE, DiT, SiT and VisionLLaMA. Thanks for their great work.
Citation
If you find USP useful in your research or applications, please consider citing our paper:
@misc{chu2025uspunifiedselfsupervisedpretraining,
title={USP: Unified Self-Supervised Pretraining for Image Generation and Understanding},
author={Xiangxiang Chu and Renda Li and Yong Wang},
year={2025},
eprint={2503.06132},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.06132},
}
