Text-to-Image
Diffusers
Safetensors
English
image-generation
class-conditional
imagenet
pixeldit
flow-matching
pixel-space
dit
Instructions to use BiliSakura/PixelDiT-diffusers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use BiliSakura/PixelDiT-diffusers with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("BiliSakura/PixelDiT-diffusers", dtype=torch.bfloat16, device_map="cuda") prompt = "A golden retriever playing in a sunny garden" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
| license: other | |
| license_name: nsclv1 | |
| license_link: https://huggingface.co/nvidia/PixelDiT-ImageNet/blob/main/LICENSE | |
| library_name: diffusers | |
| pipeline_tag: text-to-image | |
| tags: | |
| - diffusers | |
| - image-generation | |
| - class-conditional | |
| - text-to-image | |
| - imagenet | |
| - pixeldit | |
| - flow-matching | |
| - pixel-space | |
| - dit | |
| widget: | |
| - text: A golden retriever playing in a sunny garden | |
| output: | |
| url: PixelDiT-T2I-1024/demo.png | |
| - text: golden retriever | |
| output: | |
| url: PixelDiT-XL-16-256/demo.png | |
| language: | |
| - en | |
| # BiliSakura/PixelDiT-diffusers | |
| Self-contained PixelDiT checkpoints for Hugging Face diffusers. Each variant folder ships its own `pipeline.py`, component modules, and weights. | |
| Converted from [nvidia/PixelDiT-ImageNet](https://huggingface.co/nvidia/PixelDiT-ImageNet) and [nvidia/PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) using [PixelDiT-diffusers](https://github.com/BiliSakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/PixelDiT-diffusers). | |
| ## Available checkpoints | |
| | Subfolder | Pipeline | Task | Resolution | Source checkpoint | gFID | Params | | |
| | --- | --- | --- | ---: | --- | ---: | ---: | | |
| | [`PixelDiT-T2I-1024/`](PixelDiT-T2I-1024/) | `PixelDiTT2IPipeline` | text-to-image | 1024×1024 | `pixeldit_t2i_v1.pth` | — | ~1.3B | | |
| | [`PixelDiT-XL-16-256/`](PixelDiT-XL-16-256/) | `PixelDiTPipeline` | class-to-image | 256×256 | `imagenet256_pixeldit_xl_epoch320.ckpt` | 1.61 | ~700M | | |
| | [`PixelDiT-XL-16-512/`](PixelDiT-XL-16-512/) | `PixelDiTPipeline` | class-to-image | 512×512 | `imagenet512_pixeldit_xl.ckpt` | 1.81 | ~700M | | |
| ## Repo layout | |
| ```text | |
| BiliSakura/PixelDiT-diffusers/ | |
| ├── README.md | |
| ├── demo_inference.py | |
| ├── PixelDiT-T2I-1024/ | |
| │ ├── pipeline.py | |
| │ ├── model_index.json | |
| │ ├── demo.png | |
| │ ├── scheduler/scheduler_config.json | |
| │ └── transformer/ | |
| ├── PixelDiT-XL-16-256/ | |
| │ ├── pipeline.py | |
| │ ├── model_index.json | |
| │ ├── demo.png | |
| │ ├── scheduler/scheduler_config.json | |
| │ └── transformer/ | |
| └── PixelDiT-XL-16-512/ | |
| ├── pipeline.py | |
| ├── model_index.json | |
| ├── scheduler/scheduler_config.json | |
| └── transformer/ | |
| ``` | |
| Each variant is self-contained. The `scheduler/` folder uses built-in `FlowMatchEulerDiscreteScheduler` from PyPI diffusers. No shared helper modules at inference time beyond the local variant directory. | |
| ## ImageNet class labels | |
| `id2label` is embedded in each variant's `model_index.json` (DiT-style). | |
| - `pipe.id2label` — inspect id → English label correspondence | |
| - `pipe.labels` — reverse map (English synonym → id) | |
| - `pipe.get_label_ids("golden retriever")` | |
| - `pipe(class_labels="golden retriever", ...)` — string labels resolved automatically | |
| ## Demo | |
|  | |
| Text-to-image — "A golden retriever playing in a sunny garden", 1024×1024, 50 steps, `guidance_scale=2.75`. | |
| ```bash | |
| python demo_inference_t2i.py | |
| ``` | |
|  | |
| Class 207 — golden retriever, 256×256, 100 steps, `guidance_scale=2.75`, CFG interval `[0.1, 0.9]`. | |
| ```bash | |
| python demo_inference.py | |
| ``` | |
| ## Load from a local clone | |
| ### Text-to-image 1024×1024 (`PixelDiT-T2I-1024`) | |
| ```python | |
| from pathlib import Path | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| model_dir = Path("./PixelDiT-T2I-1024").resolve() | |
| pipe = DiffusionPipeline.from_pretrained( | |
| str(model_dir), | |
| local_files_only=True, | |
| custom_pipeline=str(model_dir / "pipeline.py"), | |
| trust_remote_code=True, | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| pipe.to("cuda") | |
| generator = torch.Generator(device="cuda").manual_seed(42) | |
| image = pipe( | |
| prompt="A golden retriever playing in a sunny garden", | |
| negative_prompt="low quality, worst quality, over-saturated, blurry, deformed, watermark", | |
| height=1024, | |
| width=1024, | |
| num_inference_steps=50, | |
| guidance_scale=2.75, | |
| generator=generator, | |
| ).images[0] | |
| image.save("demo.png") | |
| ``` | |
| Gemma text encoder (`google/gemma-2-2b-it`) is downloaded on first run unless bundled under `text_encoder/`. | |
| ### ImageNet 256×256 (`PixelDiT-XL-16-256`) | |
| ```python | |
| from pathlib import Path | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| model_dir = Path("./PixelDiT-XL-16-256").resolve() | |
| pipe = DiffusionPipeline.from_pretrained( | |
| str(model_dir), | |
| local_files_only=True, | |
| custom_pipeline=str(model_dir / "pipeline.py"), | |
| trust_remote_code=True, | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| pipe.to("cuda") | |
| print(pipe.id2label[207]) | |
| print(pipe.get_label_ids("golden retriever")) | |
| generator = torch.Generator(device="cuda").manual_seed(42) | |
| image = pipe( | |
| class_labels="golden retriever", | |
| height=256, | |
| width=256, | |
| num_inference_steps=100, | |
| guidance_scale=2.75, | |
| guidance_interval_min=0.1, | |
| guidance_interval_max=0.9, | |
| generator=generator, | |
| ).images[0] | |
| image.save("demo.png") | |
| ``` | |
| ### ImageNet 512×512 (`PixelDiT-XL-16-512`) | |
| ```python | |
| from pathlib import Path | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| model_dir = Path("./PixelDiT-XL-16-512").resolve() | |
| pipe = DiffusionPipeline.from_pretrained( | |
| str(model_dir), | |
| local_files_only=True, | |
| custom_pipeline=str(model_dir / "pipeline.py"), | |
| trust_remote_code=True, | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| pipe.to("cuda") | |
| generator = torch.Generator(device="cuda").manual_seed(42) | |
| image = pipe( | |
| class_labels=207, | |
| height=512, | |
| width=512, | |
| num_inference_steps=100, | |
| guidance_scale=3.5, | |
| guidance_interval_min=0.1, | |
| guidance_interval_max=1.0, | |
| generator=generator, | |
| ).images[0] | |
| image.save("demo.png") | |
| ``` | |
| ## Recommended inference settings | |
| | Variant | Steps | CFG scale | Scheduler shift | CFG interval | | |
| | --- | ---: | ---: | ---: | --- | | |
| | `PixelDiT-T2I-1024` | 50 | 2.75 | 4.0 | [0.0, 1.0] | | |
| | `PixelDiT-XL-16-256` | 100 | 2.75 | 1.0 | [0.1, 0.9] | | |
| | `PixelDiT-XL-16-512` | 100 | 3.5 | 2.0 | [0.1, 1.0] | | |
| PixelDiT denoises directly in pixel space (no VAE). `height` and `width` must be divisible by the patch size (16). | |
| ## Conversion | |
| ```bash | |
| cd libs/PixelDiT-diffusers | |
| python scripts/convert_pixeldit_t2i_to_diffusers.py \ | |
| --checkpoint /path/to/pixeldit_t2i_v1.pth \ | |
| --config /path/to/config.json \ | |
| --output /path/to/PixelDiT-T2I-1024 \ | |
| --sample-size 1024 \ | |
| --scheduler-shift 4.0 \ | |
| --check-load | |
| python scripts/convert_pixeldit_to_diffusers.py \ | |
| --checkpoint /path/to/imagenet256_pixeldit_xl_epoch320.ckpt \ | |
| --output /path/to/PixelDiT-XL-16-256 \ | |
| --model-size pixeldit-xl \ | |
| --sample-size 256 \ | |
| --scheduler-shift 1.0 \ | |
| --check-load \ | |
| --id2label /path/to/id2label_en.json | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{yu2025pixeldit, | |
| title={PixelDiT: Pixel Diffusion Transformers for Image Generation}, | |
| author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo}, | |
| booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, | |
| year={2026}, | |
| } | |
| ``` | |
| ## License | |
| Weights are converted from NVIDIA checkpoints released under the [NSCLv1 License](https://huggingface.co/nvidia/PixelDiT-ImageNet/blob/main/LICENSE). Use for non-commercial research and evaluation only. | |