nielsr's picture
nielsr HF Staff
Add comprehensive model card for USP
6ae2afb verified
|
raw
history blame
5.23 kB
metadata
pipeline_tag: unconditional-image-generation
library_name: diffusers
license: unknown
tags:
  - diffusion-model
  - self-supervised-learning
  - dit
  - sit

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

This repository contains the weights for USP: Unified Self-Supervised Pretraining for Image Generation and Understanding, as described in our paper: https://huggingface.co/papers/2503.06132.

Find our official code and more details on GitHub: https://github.com/GD-ML/USP.

Abstract

Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.

Model Architecture and Convergence

Model Architecture

USP significantly improves convergence speed just with weight initialization from pretraining: Convergence Speed

Finetuning Weights and Evaluation Results

Finetuning weights for image generation tasks are available. All weights were pretrained for 1600 epochs and then finetuned for 400K steps.

Using the above weights and following the inference and evaluation procedures outlined in GENERATION.md, we obtained the following evaluation results:

Model Name Pretrain Finetuning FID (\downarrow) IS (\uparrow) sFID (\downarrow)
DiT_B-2 1600 epochs 400 K steps 27.22 50.47 7.60
DiT_L-2 1600 epochs 400 K steps 15.05 80.11 6.41
DiT_XL-2 1600 epochs 400 K steps 9.64 112.93 6.30
SiT_B-2 1600 epochs 400 K steps 22.10 61.59 5.88
SiT_XL-2 1600 epochs 400 K steps 7.35 128.50 5.00

Our method is somewhat orthogonal to other DINO based acceleration methods. Results combined with external-model-based methods:

Model Params Steps FID (\downarrow) IS (\uparrow)
SiT-XL/2 130M 400K 16.97 77.50
USP 130M 400K 7.38 127.96
REPA 130M 400K 7.9 122.6
USP + REPA 130M 400K 6.26 139.84
VAVAE 130M 64 Epochs 5.18/2.15† 132.4/245.1†
USP + VAVAE 130M 64 Epochs 4.2/1.81† 144/261.0†

Table: Results Combined with External-Model-Based Methods. †: w/ CFG=10.0.

Usage

You can use this model with the diffusers library for unconditional image generation.

from diffusers import DiffusionPipeline
import torch

# Load the USP Image Generation pipeline
# Replace "GD-ML/USP-Image_Generation" with the actual repo ID if different
pipeline = DiffusionPipeline.from_pretrained("GD-ML/USP-Image_Generation", torch_dtype=torch.float16)
pipeline.to("cuda")

# Generate an image
image = pipeline(num_inference_steps=50).images[0]

# Save or display the image
image.save("usp_generated_image.png")
print("Generated image saved as usp_generated_image.png")

For detailed instructions on pre-training and image generation tasks, please refer to the following guides in the official GitHub repository:

Acknowledgement

Our code is based on MAE, DiT, SiT and VisionLLaMA. Thanks for their great work.

Citation

If you find USP useful in your research or applications, please consider citing our paper:

@misc{chu2025uspunifiedselfsupervisedpretraining,
      title={USP: Unified Self-Supervised Pretraining for Image Generation and Understanding}, 
      author={Xiangxiang Chu and Renda Li and Yong Wang},
      year={2025},
      eprint={2503.06132},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06132}, 
}