File size: 3,971 Bytes
4c42d10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
library_name: diffusers
pipeline_tag: unconditional-image-generation
tags:
  - diffusers
  - sit
  - image-generation
  - class-conditional
  - imagenet
license: mit
inference: true
---

# SiT-diffusers

Diffusers-ready checkpoints for **Scalable Interpolant Transformers (SiT)**, converted for local/offline use.

This root folder is a model collection that contains:

- `SiT-S-2-256-diffusers`
- `SiT-B-2-256-diffusers`
- `SiT-L-2-256-diffusers`
- `SiT-XL-2-256-diffusers`
- `SiT-XL-2-512-diffusers`

Each subfolder is a self-contained Diffusers model repo with:

- `pipeline.py`
- `transformer/transformer_sit.py`
- `scheduler/scheduling_flow_match_sit.py`
- `transformer/diffusion_pytorch_model.safetensors`
- `vae/diffusion_pytorch_model.safetensors`

## Model Paths

Use paths relative to this root README:

| Model | Resolution | Local path |
|---|---:|---|
| SiT-S/2 | 256x256 | `./SiT-S-2-256-diffusers` |
| SiT-B/2 | 256x256 | `./SiT-B-2-256-diffusers` |
| SiT-L/2 | 256x256 | `./SiT-L-2-256-diffusers` |
| SiT-XL/2 | 256x256 | `./SiT-XL-2-256-diffusers` |
| SiT-XL/2 | 512x512 | `./SiT-XL-2-512-diffusers` |

## Inference Demo (Diffusers)

### 1) Load a local subfolder checkpoint

```python
import torch
from diffusers import DiffusionPipeline

model_path = "./SiT-XL-2-512-diffusers"  # change to any path in the table above
device = "cuda" if torch.cuda.is_available() else "cpu"

pipe = DiffusionPipeline.from_pretrained(
    model_path,
    trust_remote_code=True,
).to(device)

generator = torch.Generator(device=device).manual_seed(0)

# ImageNet class example: 207 = golden retriever
result = pipe(
    class_labels=207,
    height=512,
    width=512,
    num_inference_steps=250,  # official SiT comparisons commonly use 250 steps
    guidance_scale=4.0,
    generator=generator,
)

image = result.images[0]
image.save("sit_xl_512_demo.png")
```

### 2) Quick variant switch (256 models)

```python
model_path = "./SiT-S-2-256-diffusers"
# model_path = "./SiT-B-2-256-diffusers"
# model_path = "./SiT-L-2-256-diffusers"
# model_path = "./SiT-XL-2-256-diffusers"

pipe = DiffusionPipeline.from_pretrained(model_path, trust_remote_code=True).to(device)
image = pipe(
    class_labels=207,
    height=256,
    width=256,
    num_inference_steps=250,
    guidance_scale=4.0,
    generator=generator,
).images[0]
image.save("sit_256_demo.png")
```

## FID Reference (from Official SiT Results)

The table below summarizes widely cited SiT numbers from the official project materials for class-conditional ImageNet generation.

| Model / setting | Resolution | FID-50K (lower is better) |
|---|---:|---:|
| SiT-S (400K steps) | 256x256 | 57.6 |
| SiT-B (400K steps) | 256x256 | 33.5 |
| SiT-L (400K steps) | 256x256 | 17.2 |
| SiT-XL (400K steps) | 256x256 | 8.6 |
| SiT-XL (cfg=1.5, ODE) | 256x256 | 2.15 |
| SiT-XL (cfg=1.5, SDE, `w(t)=sigma_t`) | 256x256 | 2.06 |
| SiT-XL (sample showcase) | 512x512 | Not reported in the same benchmark table |

> Note: FID depends on training recipe, sampler choice (ODE/SDE), guidance scale, and evaluation protocol. Treat this table as a reference to official SiT reports, not as guaranteed reproducibility for every conversion/export.

## Source and Paper

- Official SiT code: [willisma/SiT](https://github.com/willisma/SiT)
- Project page: [scalable-interpolant.github.io](https://scalable-interpolant.github.io/)
- Paper (arXiv): [SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers](https://arxiv.org/abs/2401.08740)

## Citation

If you use SiT in your work, please cite:

```bibtex
@inproceedings{ma2024sit,
  title={SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers},
  author={Ma, Nanye and Goldstein, Mark and Albergo, Michael S. and Boffi, Nicholas M. and Vanden-Eijnden, Eric and Xie, Saining},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024},
  note={Accepted to ECCV 2024}
}
```