File size: 4,153 Bytes
ea6c8c4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | ---
license: other
license_name: nsclv1
license_link: LICENSE
tags:
- image-generation
- diffusion
- vision-transformer
- class-conditional
datasets:
- imagenet-1k
---
# DiffiT: Diffusion Vision Transformers for Image Generation
[](https://arxiv.org/abs/2312.02139)
[](https://github.com/NVlabs/DiffiT)
This repository hosts the pretrained model weights for [**DiffiT**](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.
## Overview
**DiffiT** (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing **Time-dependent Multihead Self Attention (TMSA)** for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an **FID score of 1.73** on ImageNet-256 while using **19.85% and 16.88% fewer parameters** than comparable Transformer-based diffusion models such as MDT and DiT, respectively.


## Models
### ImageNet-256
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|-------|---------|------------|---------|-----------------|----------|
| **DiffiT** | ImageNet | 256×256 | **1.73** | **276.49** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) |
### ImageNet-512
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|-------|---------|------------|---------|-----------------|----------|
| **DiffiT** | ImageNet | 512×512 | **2.67** | **252.12** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) |
## Usage
Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts.
### Sampling Images
Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below.
**ImageNet-256:**
```bash
python sample.py \
--log_dir $LOG_DIR \
--cfg_scale 4.4 \
--model_path $MODEL \
--image_size 256 \
--model Diffit \
--num_sampling_steps 250 \
--num_samples 50000 \
--cfg_cond True
```
**ImageNet-512:**
```bash
python sample.py \
--log_dir $LOG_DIR \
--cfg_scale 1.49 \
--model_path $MODEL \
--image_size 512 \
--model Diffit \
--num_sampling_steps 250 \
--num_samples 50000 \
--cfg_cond True
```
### Evaluation
Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations).
```bash
bash eval_run.sh
```
## Citation
```bibtex
@inproceedings{hatamizadeh2025diffit,
title={Diffit: Diffusion vision transformers for image generation},
author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
booktitle={European Conference on Computer Vision},
pages={37--55},
year={2025},
organization={Springer}
}
```
## License
Copyright © 2026, NVIDIA Corporation. All rights reserved.
The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
|