| --- |
| library_name: diffusers |
| pipeline_tag: unconditional-image-generation |
| tags: |
| - diffusers |
| - sit |
| - image-generation |
| - class-conditional |
| - imagenet |
| license: mit |
| inference: true |
| --- |
| |
| # SiT-diffusers |
|
|
| Diffusers-ready checkpoints for **Scalable Interpolant Transformers (SiT)**, converted for local/offline use. |
|
|
| This root folder is a model collection that contains: |
|
|
| - `SiT-S-2-256-diffusers` |
| - `SiT-B-2-256-diffusers` |
| - `SiT-L-2-256-diffusers` |
| - `SiT-XL-2-256-diffusers` |
| - `SiT-XL-2-512-diffusers` |
|
|
| Each subfolder is a self-contained Diffusers model repo with: |
|
|
| - `pipeline.py` |
| - `transformer/transformer_sit.py` |
| - `scheduler/scheduling_flow_match_sit.py` |
| - `transformer/diffusion_pytorch_model.safetensors` |
| - `vae/diffusion_pytorch_model.safetensors` |
|
|
| ## Model Paths |
|
|
| Use paths relative to this root README: |
|
|
| | Model | Resolution | Local path | |
| |---|---:|---| |
| | SiT-S/2 | 256x256 | `./SiT-S-2-256-diffusers` | |
| | SiT-B/2 | 256x256 | `./SiT-B-2-256-diffusers` | |
| | SiT-L/2 | 256x256 | `./SiT-L-2-256-diffusers` | |
| | SiT-XL/2 | 256x256 | `./SiT-XL-2-256-diffusers` | |
| | SiT-XL/2 | 512x512 | `./SiT-XL-2-512-diffusers` | |
|
|
| ## Inference Demo (Diffusers) |
|
|
| ### 1) Load a local subfolder checkpoint |
|
|
| ```python |
| import torch |
| from diffusers import DiffusionPipeline |
| |
| model_path = "./SiT-XL-2-512-diffusers" # change to any path in the table above |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| pipe = DiffusionPipeline.from_pretrained( |
| model_path, |
| trust_remote_code=True, |
| ).to(device) |
| |
| generator = torch.Generator(device=device).manual_seed(0) |
| |
| # ImageNet class example: 207 = golden retriever |
| result = pipe( |
| class_labels=207, |
| height=512, |
| width=512, |
| num_inference_steps=250, # official SiT comparisons commonly use 250 steps |
| guidance_scale=4.0, |
| generator=generator, |
| ) |
| |
| image = result.images[0] |
| image.save("sit_xl_512_demo.png") |
| ``` |
|
|
| ### 2) Quick variant switch (256 models) |
|
|
| ```python |
| model_path = "./SiT-S-2-256-diffusers" |
| # model_path = "./SiT-B-2-256-diffusers" |
| # model_path = "./SiT-L-2-256-diffusers" |
| # model_path = "./SiT-XL-2-256-diffusers" |
| |
| pipe = DiffusionPipeline.from_pretrained(model_path, trust_remote_code=True).to(device) |
| image = pipe( |
| class_labels=207, |
| height=256, |
| width=256, |
| num_inference_steps=250, |
| guidance_scale=4.0, |
| generator=generator, |
| ).images[0] |
| image.save("sit_256_demo.png") |
| ``` |
|
|
| ## FID Reference (from Official SiT Results) |
|
|
| The table below summarizes widely cited SiT numbers from the official project materials for class-conditional ImageNet generation. |
|
|
| | Model / setting | Resolution | FID-50K (lower is better) | |
| |---|---:|---:| |
| | SiT-S (400K steps) | 256x256 | 57.6 | |
| | SiT-B (400K steps) | 256x256 | 33.5 | |
| | SiT-L (400K steps) | 256x256 | 17.2 | |
| | SiT-XL (400K steps) | 256x256 | 8.6 | |
| | SiT-XL (cfg=1.5, ODE) | 256x256 | 2.15 | |
| | SiT-XL (cfg=1.5, SDE, `w(t)=sigma_t`) | 256x256 | 2.06 | |
| | SiT-XL (sample showcase) | 512x512 | Not reported in the same benchmark table | |
|
|
| > Note: FID depends on training recipe, sampler choice (ODE/SDE), guidance scale, and evaluation protocol. Treat this table as a reference to official SiT reports, not as guaranteed reproducibility for every conversion/export. |
|
|
| ## Source and Paper |
|
|
| - Official SiT code: [willisma/SiT](https://github.com/willisma/SiT) |
| - Project page: [scalable-interpolant.github.io](https://scalable-interpolant.github.io/) |
| - Paper (arXiv): [SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers](https://arxiv.org/abs/2401.08740) |
|
|
| ## Citation |
|
|
| If you use SiT in your work, please cite: |
|
|
| ```bibtex |
| @inproceedings{ma2024sit, |
| title={SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers}, |
| author={Ma, Nanye and Goldstein, Mark and Albergo, Michael S. and Boffi, Nicholas M. and Vanden-Eijnden, Eric and Xie, Saining}, |
| booktitle={European Conference on Computer Vision (ECCV)}, |
| year={2024}, |
| note={Accepted to ECCV 2024} |
| } |
| ``` |
|
|