| | --- |
| | license: other |
| | license_name: nsclv1 |
| | license_link: LICENSE |
| | tags: |
| | - image-generation |
| | - diffusion |
| | - vision-transformer |
| | - class-conditional |
| | datasets: |
| | - imagenet-1k |
| | --- |
| | |
| | # DiffiT: Diffusion Vision Transformers for Image Generation |
| |
|
| | [](https://arxiv.org/abs/2312.02139) |
| | [](https://github.com/NVlabs/DiffiT) |
| |
|
| | This repository hosts the pretrained model weights for [**DiffiT**](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency. |
| |
|
| | ## Overview |
| |
|
| | **DiffiT** (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing **Time-dependent Multihead Self Attention (TMSA)** for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an **FID score of 1.73** on ImageNet-256 while using **19.85% and 16.88% fewer parameters** than comparable Transformer-based diffusion models such as MDT and DiT, respectively. |
| |
|
| |  |
| |
|
| |  |
| |
|
| | ## Models |
| |
|
| | ### ImageNet-256 |
| |
|
| | | Model | Dataset | Resolution | FID-50K | Inception Score | Download | |
| | |-------|---------|------------|---------|-----------------|----------| |
| | | **DiffiT** | ImageNet | 256×256 | **1.73** | **276.49** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) | |
| |
|
| | ### ImageNet-512 |
| |
|
| | | Model | Dataset | Resolution | FID-50K | Inception Score | Download | |
| | |-------|---------|------------|---------|-----------------|----------| |
| | | **DiffiT** | ImageNet | 512×512 | **2.67** | **252.12** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) | |
| |
|
| | ## Usage |
| |
|
| | Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts. |
| |
|
| | ### Sampling Images |
| |
|
| | Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below. |
| |
|
| | **ImageNet-256:** |
| |
|
| | ```bash |
| | python sample.py \ |
| | --log_dir $LOG_DIR \ |
| | --cfg_scale 4.4 \ |
| | --model_path $MODEL \ |
| | --image_size 256 \ |
| | --model Diffit \ |
| | --num_sampling_steps 250 \ |
| | --num_samples 50000 \ |
| | --cfg_cond True |
| | ``` |
| |
|
| | **ImageNet-512:** |
| |
|
| | ```bash |
| | python sample.py \ |
| | --log_dir $LOG_DIR \ |
| | --cfg_scale 1.49 \ |
| | --model_path $MODEL \ |
| | --image_size 512 \ |
| | --model Diffit \ |
| | --num_sampling_steps 250 \ |
| | --num_samples 50000 \ |
| | --cfg_cond True |
| | ``` |
| |
|
| | ### Evaluation |
| |
|
| | Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations). |
| |
|
| | ```bash |
| | bash eval_run.sh |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @inproceedings{hatamizadeh2025diffit, |
| | title={Diffit: Diffusion vision transformers for image generation}, |
| | author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash}, |
| | booktitle={European Conference on Computer Vision}, |
| | pages={37--55}, |
| | year={2025}, |
| | organization={Springer} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Copyright © 2026, NVIDIA Corporation. All rights reserved. |
| |
|
| | The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. |
| |
|