USP-Image_Generation / README.md

nielsr HF Staff

Add comprehensive model card for USP

6ae2afb verified 7 months ago

preview code

raw

history blame

5.23 kB

metadata

pipeline_tag: unconditional-image-generation
library_name: diffusers
license: unknown
tags:
  - diffusion-model
  - self-supervised-learning
  - dit
  - sit

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

This repository contains the weights for USP: Unified Self-Supervised Pretraining for Image Generation and Understanding, as described in our paper: https://huggingface.co/papers/2503.06132.

Find our official code and more details on GitHub: https://github.com/GD-ML/USP.

Abstract

Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.

Model Architecture and Convergence

USP significantly improves convergence speed just with weight initialization from pretraining:

Finetuning Weights and Evaluation Results

Finetuning weights for image generation tasks are available. All weights were pretrained for 1600 epochs and then finetuned for 400K steps.

Using the above weights and following the inference and evaluation procedures outlined in GENERATION.md, we obtained the following evaluation results:

Model Name	Pretrain	Finetuning	FID (\downarrow)	IS (\uparrow)	sFID (\downarrow)
DiT_B-2	1600 epochs	400 K steps	27.22	50.47	7.60
DiT_L-2	1600 epochs	400 K steps	15.05	80.11	6.41
DiT_XL-2	1600 epochs	400 K steps	9.64	112.93	6.30
SiT_B-2	1600 epochs	400 K steps	22.10	61.59	5.88
SiT_XL-2	1600 epochs	400 K steps	7.35	128.50	5.00

Our method is somewhat orthogonal to other DINO based acceleration methods. Results combined with external-model-based methods:

Model	Params	Steps	FID (\downarrow)	IS (\uparrow)
SiT-XL/2	130M	400K	16.97	77.50
USP	130M	400K	7.38	127.96
REPA	130M	400K	7.9	122.6
USP + REPA	130M	400K	6.26	139.84
VAVAE	130M	64 Epochs	5.18/2.15†	132.4/245.1†
USP + VAVAE	130M	64 Epochs	4.2/1.81†	144/261.0†

Table: Results Combined with External-Model-Based Methods. †: w/ CFG=10.0.

Usage

You can use this model with the diffusers library for unconditional image generation.

from diffusers import DiffusionPipeline
import torch

# Load the USP Image Generation pipeline
# Replace "GD-ML/USP-Image_Generation" with the actual repo ID if different
pipeline = DiffusionPipeline.from_pretrained("GD-ML/USP-Image_Generation", torch_dtype=torch.float16)
pipeline.to("cuda")

# Generate an image
image = pipeline(num_inference_steps=50).images[0]

# Save or display the image
image.save("usp_generated_image.png")
print("Generated image saved as usp_generated_image.png")

For detailed instructions on pre-training and image generation tasks, please refer to the following guides in the official GitHub repository:

Acknowledgement

Our code is based on MAE, DiT, SiT and VisionLLaMA. Thanks for their great work.

Citation

If you find USP useful in your research or applications, please consider citing our paper:

@misc{chu2025uspunifiedselfsupervisedpretraining,
      title={USP: Unified Self-Supervised Pretraining for Image Generation and Understanding}, 
      author={Xiangxiang Chu and Renda Li and Yong Wang},
      year={2025},
      eprint={2503.06132},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06132}, 
}