--- license: mit library_name: diffusers pipeline_tag: text-to-image tags: - diffusers - image-generation - class-conditional - text-to-image - imagenet - pixelgen - flow-matching - pixel-space - jit widget: - text: golden retriever output: url: PixelGen-XL-16-256/demo.png language: - en --- # BiliSakura/PixelGen-diffusers Self-contained PixelGen checkpoints for Hugging Face diffusers. Each variant folder ships its own pipeline code, component modules, and weights. Converted from upstream PixelGen checkpoints using [PixelGen-diffusers](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/PixelGen-diffusers) in [Visual-Generative-Foundation-Model-Collection](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection). ## Available checkpoints | Subfolder | Pipeline | Task | Resolution | Model type | | --- | --- | --- | ---: | --- | | [`PixelGen-XL-16-256/`](PixelGen-XL-16-256/) | `PixelGenC2IPipeline` | class-to-image | 256×256 | PixelGen-XL/16 | | [`PixelGen-XXL-16-512-t2i/`](PixelGen-XXL-16-512-t2i/) | `PixelGenT2IPipeline` | text-to-image | 512×512 | PixelGen-XXL/16-T2I | ## Repo layout ```text BiliSakura/PixelGen-diffusers/ ├── README.md ├── PixelGen-XL-16-256/ │ ├── pipeline.py │ ├── model_index.json │ ├── demo.png │ ├── scheduler/ │ │ ├── scheduler_config.json │ │ └── scheduling_pixelgen.py │ └── transformer/ │ ├── config.json │ └── transformer_jit.py └── PixelGen-XXL-16-512-t2i/ ├── pipeline.py ├── model_index.json ├── conversion_metadata.json ├── scheduler/ │ ├── scheduler_config.json │ └── scheduling_pixelgen.py ├── text_encoder/ ├── tokenizer/ └── transformer/ ├── config.json ├── diffusion_pytorch_model.safetensors └── transformer_jit_t2i.py ``` Each class-conditional variant is self-contained: load with `custom_pipeline=.../pipeline.py` and `trust_remote_code=True`. PixelGen denoises directly in pixel space (no VAE). ## ImageNet class labels For [`PixelGen-XL-16-256/`](PixelGen-XL-16-256/), `id2label` is embedded in `model_index.json` (DiT-style). - `pipe.id2label` — inspect id → English label correspondence - `pipe.labels` — reverse map (English synonym → id) - `pipe.get_label_ids("golden retriever")` - `pipe(class_labels="golden retriever", ...)` — string labels resolved automatically ## Demo ![PixelGen-XL-16-256 demo](PixelGen-XL-16-256/demo.png) Class 207 — golden retriever, 256×256, 50 steps, `guidance_scale=2.25`, Heun solver, `timeshift=2.0`. ## Load from Hugging Face ### Class-to-image (`PixelGen-XL-16-256`) ```python import torch from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained( "BiliSakura/PixelGen-diffusers/PixelGen-XL-16-256", trust_remote_code=True, torch_dtype=torch.bfloat16, ).to("cuda") print(pipe.id2label[207]) print(pipe.get_label_ids("golden retriever")) generator = torch.Generator(device="cuda").manual_seed(0) images = pipe( class_labels="golden retriever", num_inference_steps=50, guidance_scale=2.25, generator=generator, ).images ``` ### Text-to-image (`PixelGen-XXL-16-512-t2i`) Uses a bundled Qwen3 text encoder when `text_encoder/` is present; otherwise downloads from the path recorded in `conversion_metadata.json`. ```python import torch from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained( "BiliSakura/PixelGen-diffusers/PixelGen-XXL-16-512-t2i", trust_remote_code=True, torch_dtype=torch.bfloat16, ) generator = torch.Generator(device="cuda").manual_seed(42) images = pipe( prompt="A golden retriever playing in a sunny garden", num_inference_steps=50, guidance_scale=4.0, generator=generator, ).images ``` ## Load from a local clone ### Class-to-image (`PixelGen-XL-16-256`) ```python from pathlib import Path import torch from diffusers import DiffusionPipeline model_dir = Path("./PixelGen-XL-16-256").resolve() pipe = DiffusionPipeline.from_pretrained( str(model_dir), local_files_only=True, custom_pipeline=str(model_dir / "pipeline.py"), trust_remote_code=True, torch_dtype=torch.bfloat16, ).to("cuda") generator = torch.Generator(device="cuda").manual_seed(0) image = pipe( class_labels="golden retriever", num_inference_steps=50, guidance_scale=2.25, generator=generator, ).images[0] image.save("demo.png") ``` ## Recommended inference settings | Variant | Steps | CFG scale | Solver | Timeshift | CFG interval | | --- | ---: | ---: | --- | ---: | --- | | `PixelGen-XL-16-256` | 50 | 2.25 | heun | 2.0 | [0.1, 0.9] | | `PixelGen-XXL-16-512-t2i` | 25 | 4.0 | adam_lm | 3.0 | [0.0, 1.0] | `height` and `width` are fixed by each checkpoint's `sample_size`. Custom sizes are not supported for these exports. ## Interface notes - Class-conditional generation uses `class_labels` (integer ImageNet id or English synonym). - `guidance_scale > 1.0` enables classifier-free guidance over a null class token. - `sampling_method` accepts `heun` or `euler` for C2I; T2I defaults to `adam_lm`. - `noise_scale` defaults to `1.0` at 256×256 and `2.0` at 512×512 when not specified. ## Citation Source paper: - [PixelGen: Improving Pixel Diffusion with Perceptual Loss](https://arxiv.org/abs/2602.02493) - [Hugging Face Papers page](https://huggingface.co/papers/2602.02493) ```bibtex @article{ma2026pixelgen, title={PixelGen: Improving Pixel Diffusion with Perceptual Loss}, author={Zehong Ma and Ruihan Xu and Shiliang Zhang}, year={2026}, eprint={2602.02493}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.02493}, } ```