File size: 4,153 Bytes
ea6c8c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: other
license_name: nsclv1
license_link: LICENSE
tags:
  - image-generation
  - diffusion
  - vision-transformer
  - class-conditional
datasets:
  - imagenet-1k
---

# DiffiT: Diffusion Vision Transformers for Image Generation

[![Paper](https://img.shields.io/badge/arXiv-2312.02139-b31b1b.svg)](https://arxiv.org/abs/2312.02139)
[![GitHub](https://img.shields.io/github/stars/NVlabs/DiffiT.svg?style=social)](https://github.com/NVlabs/DiffiT)

This repository hosts the pretrained model weights for [**DiffiT**](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.

## Overview

**DiffiT** (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing **Time-dependent Multihead Self Attention (TMSA)** for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an **FID score of 1.73** on ImageNet-256 while using **19.85% and 16.88% fewer parameters** than comparable Transformer-based diffusion models such as MDT and DiT, respectively.

![imagenet](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/5Pbe6fTZAV5eAwH6eokdh.png)

![latent_diffit](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/2hPFK3g2uHfDR1bzhYJyJ.png)

## Models

### ImageNet-256

| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|-------|---------|------------|---------|-----------------|----------|
| **DiffiT** | ImageNet | 256×256 | **1.73** | **276.49** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) |

### ImageNet-512

| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|-------|---------|------------|---------|-----------------|----------|
| **DiffiT** | ImageNet | 512×512 | **2.67** | **252.12** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) |

## Usage

Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts.

### Sampling Images

Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below.

**ImageNet-256:**

```bash
python sample.py \
    --log_dir $LOG_DIR \
    --cfg_scale 4.4 \
    --model_path $MODEL \
    --image_size 256 \
    --model Diffit \
    --num_sampling_steps 250 \
    --num_samples 50000 \
    --cfg_cond True
```

**ImageNet-512:**

```bash
python sample.py \
    --log_dir $LOG_DIR \
    --cfg_scale 1.49 \
    --model_path $MODEL \
    --image_size 512 \
    --model Diffit \
    --num_sampling_steps 250 \
    --num_samples 50000 \
    --cfg_cond True
```

### Evaluation

Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations).

```bash
bash eval_run.sh
```

## Citation

```bibtex
@inproceedings{hatamizadeh2025diffit,
  title={Diffit: Diffusion vision transformers for image generation},
  author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
  booktitle={European Conference on Computer Vision},
  pages={37--55},
  year={2025},
  organization={Springer}
}
```

## License

Copyright © 2026, NVIDIA Corporation. All rights reserved.

The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.