--- license: other license_name: nsclv1 license_link: LICENSE tags: - image-generation - diffusion - vision-transformer - class-conditional datasets: - imagenet-1k --- # DiffiT: Diffusion Vision Transformers for Image Generation [![Paper](https://img.shields.io/badge/arXiv-2312.02139-b31b1b.svg)](https://arxiv.org/abs/2312.02139) [![GitHub](https://img.shields.io/github/stars/NVlabs/DiffiT.svg?style=social)](https://github.com/NVlabs/DiffiT) This repository hosts the pretrained model weights for [**DiffiT**](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency. ## Overview **DiffiT** (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing **Time-dependent Multihead Self Attention (TMSA)** for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an **FID score of 1.73** on ImageNet-256 while using **19.85% and 16.88% fewer parameters** than comparable Transformer-based diffusion models such as MDT and DiT, respectively. ![imagenet](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/5Pbe6fTZAV5eAwH6eokdh.png) ![latent_diffit](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/2hPFK3g2uHfDR1bzhYJyJ.png) ## Models ### ImageNet-256 | Model | Dataset | Resolution | FID-50K | Inception Score | Download | |-------|---------|------------|---------|-----------------|----------| | **DiffiT** | ImageNet | 256×256 | **1.73** | **276.49** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) | ### ImageNet-512 | Model | Dataset | Resolution | FID-50K | Inception Score | Download | |-------|---------|------------|---------|-----------------|----------| | **DiffiT** | ImageNet | 512×512 | **2.67** | **252.12** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) | ## Usage Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts. ### Sampling Images Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below. **ImageNet-256:** ```bash python sample.py \ --log_dir $LOG_DIR \ --cfg_scale 4.4 \ --model_path $MODEL \ --image_size 256 \ --model Diffit \ --num_sampling_steps 250 \ --num_samples 50000 \ --cfg_cond True ``` **ImageNet-512:** ```bash python sample.py \ --log_dir $LOG_DIR \ --cfg_scale 1.49 \ --model_path $MODEL \ --image_size 512 \ --model Diffit \ --num_sampling_steps 250 \ --num_samples 50000 \ --cfg_cond True ``` ### Evaluation Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations). ```bash bash eval_run.sh ``` ## Citation ```bibtex @inproceedings{hatamizadeh2025diffit, title={Diffit: Diffusion vision transformers for image generation}, author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash}, booktitle={European Conference on Computer Vision}, pages={37--55}, year={2025}, organization={Springer} } ``` ## License Copyright © 2026, NVIDIA Corporation. All rights reserved. The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.