DiffiT / README.md

Update README.md

ea6c8c4 verified 4 days ago

4.15 kB

	---
	license: other
	license_name: nsclv1
	license_link: LICENSE
	tags:
	- image-generation
	- diffusion
	- vision-transformer
	- class-conditional
	datasets:
	- imagenet-1k
	---

	# DiffiT: Diffusion Vision Transformers for Image Generation

	[![Paper](https://img.shields.io/badge/arXiv-2312.02139-b31b1b.svg)](https://arxiv.org/abs/2312.02139)
	[![GitHub](https://img.shields.io/github/stars/NVlabs/DiffiT.svg?style=social)](https://github.com/NVlabs/DiffiT)

	This repository hosts the pretrained model weights for [DiffiT](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.

	## Overview

	DiffiT (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing Time-dependent Multihead Self Attention (TMSA) for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an FID score of 1.73 on ImageNet-256 while using 19.85% and 16.88% fewer parameters than comparable Transformer-based diffusion models such as MDT and DiT, respectively.

	![imagenet](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/5Pbe6fTZAV5eAwH6eokdh.png)

	![latent_diffit](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/2hPFK3g2uHfDR1bzhYJyJ.png)

	## Models

	### ImageNet-256

	\| Model \| Dataset \| Resolution \| FID-50K \| Inception Score \| Download \|
	\|-------\|---------\|------------\|---------\|-----------------\|----------\|
	\| DiffiT \| ImageNet \| 256×256 \| 1.73 \| 276.49 \| [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) \|

	### ImageNet-512

	\| Model \| Dataset \| Resolution \| FID-50K \| Inception Score \| Download \|
	\|-------\|---------\|------------\|---------\|-----------------\|----------\|
	\| DiffiT \| ImageNet \| 512×512 \| 2.67 \| 252.12 \| [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) \|

	## Usage

	Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts.

	### Sampling Images

	Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below.

	ImageNet-256:

	```bash
	python sample.py \
	--log_dir $LOG_DIR \
	--cfg_scale 4.4 \
	--model_path $MODEL \
	--image_size 256 \
	--model Diffit \
	--num_sampling_steps 250 \
	--num_samples 50000 \
	--cfg_cond True
	```

	ImageNet-512:

	```bash
	python sample.py \
	--log_dir $LOG_DIR \
	--cfg_scale 1.49 \
	--model_path $MODEL \
	--image_size 512 \
	--model Diffit \
	--num_sampling_steps 250 \
	--num_samples 50000 \
	--cfg_cond True
	```

	### Evaluation

	Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations).

	```bash
	bash eval_run.sh
	```

	## Citation

	```bibtex
	@inproceedings{hatamizadeh2025diffit,
	title={Diffit: Diffusion vision transformers for image generation},
	author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
	booktitle={European Conference on Computer Vision},
	pages={37--55},
	year={2025},
	organization={Springer}
	}
	```

	## License

	Copyright © 2026, NVIDIA Corporation. All rights reserved.

	The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.