File size: 3,971 Bytes
4c42d10 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
library_name: diffusers
pipeline_tag: unconditional-image-generation
tags:
- diffusers
- sit
- image-generation
- class-conditional
- imagenet
license: mit
inference: true
---
# SiT-diffusers
Diffusers-ready checkpoints for **Scalable Interpolant Transformers (SiT)**, converted for local/offline use.
This root folder is a model collection that contains:
- `SiT-S-2-256-diffusers`
- `SiT-B-2-256-diffusers`
- `SiT-L-2-256-diffusers`
- `SiT-XL-2-256-diffusers`
- `SiT-XL-2-512-diffusers`
Each subfolder is a self-contained Diffusers model repo with:
- `pipeline.py`
- `transformer/transformer_sit.py`
- `scheduler/scheduling_flow_match_sit.py`
- `transformer/diffusion_pytorch_model.safetensors`
- `vae/diffusion_pytorch_model.safetensors`
## Model Paths
Use paths relative to this root README:
| Model | Resolution | Local path |
|---|---:|---|
| SiT-S/2 | 256x256 | `./SiT-S-2-256-diffusers` |
| SiT-B/2 | 256x256 | `./SiT-B-2-256-diffusers` |
| SiT-L/2 | 256x256 | `./SiT-L-2-256-diffusers` |
| SiT-XL/2 | 256x256 | `./SiT-XL-2-256-diffusers` |
| SiT-XL/2 | 512x512 | `./SiT-XL-2-512-diffusers` |
## Inference Demo (Diffusers)
### 1) Load a local subfolder checkpoint
```python
import torch
from diffusers import DiffusionPipeline
model_path = "./SiT-XL-2-512-diffusers" # change to any path in the table above
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained(
model_path,
trust_remote_code=True,
).to(device)
generator = torch.Generator(device=device).manual_seed(0)
# ImageNet class example: 207 = golden retriever
result = pipe(
class_labels=207,
height=512,
width=512,
num_inference_steps=250, # official SiT comparisons commonly use 250 steps
guidance_scale=4.0,
generator=generator,
)
image = result.images[0]
image.save("sit_xl_512_demo.png")
```
### 2) Quick variant switch (256 models)
```python
model_path = "./SiT-S-2-256-diffusers"
# model_path = "./SiT-B-2-256-diffusers"
# model_path = "./SiT-L-2-256-diffusers"
# model_path = "./SiT-XL-2-256-diffusers"
pipe = DiffusionPipeline.from_pretrained(model_path, trust_remote_code=True).to(device)
image = pipe(
class_labels=207,
height=256,
width=256,
num_inference_steps=250,
guidance_scale=4.0,
generator=generator,
).images[0]
image.save("sit_256_demo.png")
```
## FID Reference (from Official SiT Results)
The table below summarizes widely cited SiT numbers from the official project materials for class-conditional ImageNet generation.
| Model / setting | Resolution | FID-50K (lower is better) |
|---|---:|---:|
| SiT-S (400K steps) | 256x256 | 57.6 |
| SiT-B (400K steps) | 256x256 | 33.5 |
| SiT-L (400K steps) | 256x256 | 17.2 |
| SiT-XL (400K steps) | 256x256 | 8.6 |
| SiT-XL (cfg=1.5, ODE) | 256x256 | 2.15 |
| SiT-XL (cfg=1.5, SDE, `w(t)=sigma_t`) | 256x256 | 2.06 |
| SiT-XL (sample showcase) | 512x512 | Not reported in the same benchmark table |
> Note: FID depends on training recipe, sampler choice (ODE/SDE), guidance scale, and evaluation protocol. Treat this table as a reference to official SiT reports, not as guaranteed reproducibility for every conversion/export.
## Source and Paper
- Official SiT code: [willisma/SiT](https://github.com/willisma/SiT)
- Project page: [scalable-interpolant.github.io](https://scalable-interpolant.github.io/)
- Paper (arXiv): [SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers](https://arxiv.org/abs/2401.08740)
## Citation
If you use SiT in your work, please cite:
```bibtex
@inproceedings{ma2024sit,
title={SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers},
author={Ma, Nanye and Goldstein, Mark and Albergo, Michael S. and Boffi, Nicholas M. and Vanden-Eijnden, Eric and Xie, Saining},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024},
note={Accepted to ECCV 2024}
}
```
|