PixelDiT-diffusers / README.md
BiliSakura's picture
Upload folder using huggingface_hub
fbad450 verified
|
Raw
History Blame Contribute Delete
7.28 kB
---
license: other
license_name: nsclv1
license_link: https://huggingface.co/nvidia/PixelDiT-ImageNet/blob/main/LICENSE
library_name: diffusers
pipeline_tag: text-to-image
tags:
- diffusers
- image-generation
- class-conditional
- text-to-image
- imagenet
- pixeldit
- flow-matching
- pixel-space
- dit
widget:
- text: A golden retriever playing in a sunny garden
output:
url: PixelDiT-T2I-1024/demo.png
- text: golden retriever
output:
url: PixelDiT-XL-16-256/demo.png
language:
- en
---
# BiliSakura/PixelDiT-diffusers
Self-contained PixelDiT checkpoints for Hugging Face diffusers. Each variant folder ships its own `pipeline.py`, component modules, and weights.
Converted from [nvidia/PixelDiT-ImageNet](https://huggingface.co/nvidia/PixelDiT-ImageNet) and [nvidia/PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) using [PixelDiT-diffusers](https://github.com/BiliSakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/PixelDiT-diffusers).
## Available checkpoints
| Subfolder | Pipeline | Task | Resolution | Source checkpoint | gFID | Params |
| --- | --- | --- | ---: | --- | ---: | ---: |
| [`PixelDiT-T2I-1024/`](PixelDiT-T2I-1024/) | `PixelDiTT2IPipeline` | text-to-image | 1024×1024 | `pixeldit_t2i_v1.pth` | — | ~1.3B |
| [`PixelDiT-XL-16-256/`](PixelDiT-XL-16-256/) | `PixelDiTPipeline` | class-to-image | 256×256 | `imagenet256_pixeldit_xl_epoch320.ckpt` | 1.61 | ~700M |
| [`PixelDiT-XL-16-512/`](PixelDiT-XL-16-512/) | `PixelDiTPipeline` | class-to-image | 512×512 | `imagenet512_pixeldit_xl.ckpt` | 1.81 | ~700M |
## Repo layout
```text
BiliSakura/PixelDiT-diffusers/
├── README.md
├── demo_inference.py
├── PixelDiT-T2I-1024/
│ ├── pipeline.py
│ ├── model_index.json
│ ├── demo.png
│ ├── scheduler/scheduler_config.json
│ └── transformer/
├── PixelDiT-XL-16-256/
│ ├── pipeline.py
│ ├── model_index.json
│ ├── demo.png
│ ├── scheduler/scheduler_config.json
│ └── transformer/
└── PixelDiT-XL-16-512/
├── pipeline.py
├── model_index.json
├── scheduler/scheduler_config.json
└── transformer/
```
Each variant is self-contained. The `scheduler/` folder uses built-in `FlowMatchEulerDiscreteScheduler` from PyPI diffusers. No shared helper modules at inference time beyond the local variant directory.
## ImageNet class labels
`id2label` is embedded in each variant's `model_index.json` (DiT-style).
- `pipe.id2label` — inspect id → English label correspondence
- `pipe.labels` — reverse map (English synonym → id)
- `pipe.get_label_ids("golden retriever")`
- `pipe(class_labels="golden retriever", ...)` — string labels resolved automatically
## Demo
![PixelDiT-T2I-1024 demo](PixelDiT-T2I-1024/demo.png)
Text-to-image — "A golden retriever playing in a sunny garden", 1024×1024, 50 steps, `guidance_scale=2.75`.
```bash
python demo_inference_t2i.py
```
![PixelDiT-XL-16-256 demo](PixelDiT-XL-16-256/demo.png)
Class 207 — golden retriever, 256×256, 100 steps, `guidance_scale=2.75`, CFG interval `[0.1, 0.9]`.
```bash
python demo_inference.py
```
## Load from a local clone
### Text-to-image 1024×1024 (`PixelDiT-T2I-1024`)
```python
from pathlib import Path
import torch
from diffusers import DiffusionPipeline
model_dir = Path("./PixelDiT-T2I-1024").resolve()
pipe = DiffusionPipeline.from_pretrained(
str(model_dir),
local_files_only=True,
custom_pipeline=str(model_dir / "pipeline.py"),
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
prompt="A golden retriever playing in a sunny garden",
negative_prompt="low quality, worst quality, over-saturated, blurry, deformed, watermark",
height=1024,
width=1024,
num_inference_steps=50,
guidance_scale=2.75,
generator=generator,
).images[0]
image.save("demo.png")
```
Gemma text encoder (`google/gemma-2-2b-it`) is downloaded on first run unless bundled under `text_encoder/`.
### ImageNet 256×256 (`PixelDiT-XL-16-256`)
```python
from pathlib import Path
import torch
from diffusers import DiffusionPipeline
model_dir = Path("./PixelDiT-XL-16-256").resolve()
pipe = DiffusionPipeline.from_pretrained(
str(model_dir),
local_files_only=True,
custom_pipeline=str(model_dir / "pipeline.py"),
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
print(pipe.id2label[207])
print(pipe.get_label_ids("golden retriever"))
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
class_labels="golden retriever",
height=256,
width=256,
num_inference_steps=100,
guidance_scale=2.75,
guidance_interval_min=0.1,
guidance_interval_max=0.9,
generator=generator,
).images[0]
image.save("demo.png")
```
### ImageNet 512×512 (`PixelDiT-XL-16-512`)
```python
from pathlib import Path
import torch
from diffusers import DiffusionPipeline
model_dir = Path("./PixelDiT-XL-16-512").resolve()
pipe = DiffusionPipeline.from_pretrained(
str(model_dir),
local_files_only=True,
custom_pipeline=str(model_dir / "pipeline.py"),
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
class_labels=207,
height=512,
width=512,
num_inference_steps=100,
guidance_scale=3.5,
guidance_interval_min=0.1,
guidance_interval_max=1.0,
generator=generator,
).images[0]
image.save("demo.png")
```
## Recommended inference settings
| Variant | Steps | CFG scale | Scheduler shift | CFG interval |
| --- | ---: | ---: | ---: | --- |
| `PixelDiT-T2I-1024` | 50 | 2.75 | 4.0 | [0.0, 1.0] |
| `PixelDiT-XL-16-256` | 100 | 2.75 | 1.0 | [0.1, 0.9] |
| `PixelDiT-XL-16-512` | 100 | 3.5 | 2.0 | [0.1, 1.0] |
PixelDiT denoises directly in pixel space (no VAE). `height` and `width` must be divisible by the patch size (16).
## Conversion
```bash
cd libs/PixelDiT-diffusers
python scripts/convert_pixeldit_t2i_to_diffusers.py \
--checkpoint /path/to/pixeldit_t2i_v1.pth \
--config /path/to/config.json \
--output /path/to/PixelDiT-T2I-1024 \
--sample-size 1024 \
--scheduler-shift 4.0 \
--check-load
python scripts/convert_pixeldit_to_diffusers.py \
--checkpoint /path/to/imagenet256_pixeldit_xl_epoch320.ckpt \
--output /path/to/PixelDiT-XL-16-256 \
--model-size pixeldit-xl \
--sample-size 256 \
--scheduler-shift 1.0 \
--check-load \
--id2label /path/to/id2label_en.json
```
## Citation
```bibtex
@inproceedings{yu2025pixeldit,
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
}
```
## License
Weights are converted from NVIDIA checkpoints released under the [NSCLv1 License](https://huggingface.co/nvidia/PixelDiT-ImageNet/blob/main/LICENSE). Use for non-commercial research and evaluation only.