--- license: other license_name: nsclv1 license_link: https://huggingface.co/nvidia/PixelDiT-ImageNet/blob/main/LICENSE library_name: diffusers pipeline_tag: text-to-image tags: - diffusers - image-generation - class-conditional - text-to-image - imagenet - pixeldit - flow-matching - pixel-space - dit widget: - text: A golden retriever playing in a sunny garden output: url: PixelDiT-T2I-1024/demo.png - text: golden retriever output: url: PixelDiT-XL-16-256/demo.png language: - en --- # BiliSakura/PixelDiT-diffusers Self-contained PixelDiT checkpoints for Hugging Face diffusers. Each variant folder ships its own `pipeline.py`, component modules, and weights. Converted from [nvidia/PixelDiT-ImageNet](https://huggingface.co/nvidia/PixelDiT-ImageNet) and [nvidia/PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) using [PixelDiT-diffusers](https://github.com/BiliSakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/PixelDiT-diffusers). ## Available checkpoints | Subfolder | Pipeline | Task | Resolution | Source checkpoint | gFID | Params | | --- | --- | --- | ---: | --- | ---: | ---: | | [`PixelDiT-T2I-1024/`](PixelDiT-T2I-1024/) | `PixelDiTT2IPipeline` | text-to-image | 1024×1024 | `pixeldit_t2i_v1.pth` | — | ~1.3B | | [`PixelDiT-XL-16-256/`](PixelDiT-XL-16-256/) | `PixelDiTPipeline` | class-to-image | 256×256 | `imagenet256_pixeldit_xl_epoch320.ckpt` | 1.61 | ~700M | | [`PixelDiT-XL-16-512/`](PixelDiT-XL-16-512/) | `PixelDiTPipeline` | class-to-image | 512×512 | `imagenet512_pixeldit_xl.ckpt` | 1.81 | ~700M | ## Repo layout ```text BiliSakura/PixelDiT-diffusers/ ├── README.md ├── demo_inference.py ├── PixelDiT-T2I-1024/ │ ├── pipeline.py │ ├── model_index.json │ ├── demo.png │ ├── scheduler/scheduler_config.json │ └── transformer/ ├── PixelDiT-XL-16-256/ │ ├── pipeline.py │ ├── model_index.json │ ├── demo.png │ ├── scheduler/scheduler_config.json │ └── transformer/ └── PixelDiT-XL-16-512/ ├── pipeline.py ├── model_index.json ├── scheduler/scheduler_config.json └── transformer/ ``` Each variant is self-contained. The `scheduler/` folder uses built-in `FlowMatchEulerDiscreteScheduler` from PyPI diffusers. No shared helper modules at inference time beyond the local variant directory. ## ImageNet class labels `id2label` is embedded in each variant's `model_index.json` (DiT-style). - `pipe.id2label` — inspect id → English label correspondence - `pipe.labels` — reverse map (English synonym → id) - `pipe.get_label_ids("golden retriever")` - `pipe(class_labels="golden retriever", ...)` — string labels resolved automatically ## Demo ![PixelDiT-T2I-1024 demo](PixelDiT-T2I-1024/demo.png) Text-to-image — "A golden retriever playing in a sunny garden", 1024×1024, 50 steps, `guidance_scale=2.75`. ```bash python demo_inference_t2i.py ``` ![PixelDiT-XL-16-256 demo](PixelDiT-XL-16-256/demo.png) Class 207 — golden retriever, 256×256, 100 steps, `guidance_scale=2.75`, CFG interval `[0.1, 0.9]`. ```bash python demo_inference.py ``` ## Load from a local clone ### Text-to-image 1024×1024 (`PixelDiT-T2I-1024`) ```python from pathlib import Path import torch from diffusers import DiffusionPipeline model_dir = Path("./PixelDiT-T2I-1024").resolve() pipe = DiffusionPipeline.from_pretrained( str(model_dir), local_files_only=True, custom_pipeline=str(model_dir / "pipeline.py"), trust_remote_code=True, torch_dtype=torch.bfloat16, ) pipe.to("cuda") generator = torch.Generator(device="cuda").manual_seed(42) image = pipe( prompt="A golden retriever playing in a sunny garden", negative_prompt="low quality, worst quality, over-saturated, blurry, deformed, watermark", height=1024, width=1024, num_inference_steps=50, guidance_scale=2.75, generator=generator, ).images[0] image.save("demo.png") ``` Gemma text encoder (`google/gemma-2-2b-it`) is downloaded on first run unless bundled under `text_encoder/`. ### ImageNet 256×256 (`PixelDiT-XL-16-256`) ```python from pathlib import Path import torch from diffusers import DiffusionPipeline model_dir = Path("./PixelDiT-XL-16-256").resolve() pipe = DiffusionPipeline.from_pretrained( str(model_dir), local_files_only=True, custom_pipeline=str(model_dir / "pipeline.py"), trust_remote_code=True, torch_dtype=torch.bfloat16, ) pipe.to("cuda") print(pipe.id2label[207]) print(pipe.get_label_ids("golden retriever")) generator = torch.Generator(device="cuda").manual_seed(42) image = pipe( class_labels="golden retriever", height=256, width=256, num_inference_steps=100, guidance_scale=2.75, guidance_interval_min=0.1, guidance_interval_max=0.9, generator=generator, ).images[0] image.save("demo.png") ``` ### ImageNet 512×512 (`PixelDiT-XL-16-512`) ```python from pathlib import Path import torch from diffusers import DiffusionPipeline model_dir = Path("./PixelDiT-XL-16-512").resolve() pipe = DiffusionPipeline.from_pretrained( str(model_dir), local_files_only=True, custom_pipeline=str(model_dir / "pipeline.py"), trust_remote_code=True, torch_dtype=torch.bfloat16, ) pipe.to("cuda") generator = torch.Generator(device="cuda").manual_seed(42) image = pipe( class_labels=207, height=512, width=512, num_inference_steps=100, guidance_scale=3.5, guidance_interval_min=0.1, guidance_interval_max=1.0, generator=generator, ).images[0] image.save("demo.png") ``` ## Recommended inference settings | Variant | Steps | CFG scale | Scheduler shift | CFG interval | | --- | ---: | ---: | ---: | --- | | `PixelDiT-T2I-1024` | 50 | 2.75 | 4.0 | [0.0, 1.0] | | `PixelDiT-XL-16-256` | 100 | 2.75 | 1.0 | [0.1, 0.9] | | `PixelDiT-XL-16-512` | 100 | 3.5 | 2.0 | [0.1, 1.0] | PixelDiT denoises directly in pixel space (no VAE). `height` and `width` must be divisible by the patch size (16). ## Conversion ```bash cd libs/PixelDiT-diffusers python scripts/convert_pixeldit_t2i_to_diffusers.py \ --checkpoint /path/to/pixeldit_t2i_v1.pth \ --config /path/to/config.json \ --output /path/to/PixelDiT-T2I-1024 \ --sample-size 1024 \ --scheduler-shift 4.0 \ --check-load python scripts/convert_pixeldit_to_diffusers.py \ --checkpoint /path/to/imagenet256_pixeldit_xl_epoch320.ckpt \ --output /path/to/PixelDiT-XL-16-256 \ --model-size pixeldit-xl \ --sample-size 256 \ --scheduler-shift 1.0 \ --check-load \ --id2label /path/to/id2label_en.json ``` ## Citation ```bibtex @inproceedings{yu2025pixeldit, title={PixelDiT: Pixel Diffusion Transformers for Image Generation}, author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2026}, } ``` ## License Weights are converted from NVIDIA checkpoints released under the [NSCLv1 License](https://huggingface.co/nvidia/PixelDiT-ImageNet/blob/main/LICENSE). Use for non-commercial research and evaluation only.