File size: 5,913 Bytes

8587d34

---
license: mit
library_name: diffusers
pipeline_tag: text-to-image
tags:
  - diffusers
  - image-generation
  - class-conditional
  - text-to-image
  - imagenet
  - pixelgen
  - flow-matching
  - pixel-space
  - jit
widget:
  - text: golden retriever
    output:
      url: PixelGen-XL-16-256/demo.png
language:
  - en
---

# BiliSakura/PixelGen-diffusers

Self-contained PixelGen checkpoints for Hugging Face diffusers. Each variant folder ships its own pipeline code, component modules, and weights.

Converted from upstream PixelGen checkpoints using [PixelGen-diffusers](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/PixelGen-diffusers) in [Visual-Generative-Foundation-Model-Collection](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection).

## Available checkpoints

| Subfolder | Pipeline | Task | Resolution | Model type |
| --- | --- | --- | ---: | --- |
| [`PixelGen-XL-16-256/`](PixelGen-XL-16-256/) | `PixelGenC2IPipeline` | class-to-image | 256×256 | PixelGen-XL/16 |
| [`PixelGen-XXL-16-512-t2i/`](PixelGen-XXL-16-512-t2i/) | `PixelGenT2IPipeline` | text-to-image | 512×512 | PixelGen-XXL/16-T2I |

## Repo layout

```text
BiliSakura/PixelGen-diffusers/
├── README.md
├── PixelGen-XL-16-256/
│   ├── pipeline.py
│   ├── model_index.json
│   ├── demo.png
│   ├── scheduler/
│   │   ├── scheduler_config.json
│   │   └── scheduling_pixelgen.py
│   └── transformer/
│       ├── config.json
│       └── transformer_jit.py
└── PixelGen-XXL-16-512-t2i/
    ├── pipeline.py
    ├── model_index.json
    ├── conversion_metadata.json
    ├── scheduler/
    │   ├── scheduler_config.json
    │   └── scheduling_pixelgen.py
    ├── text_encoder/
    ├── tokenizer/
    └── transformer/
        ├── config.json
        ├── diffusion_pytorch_model.safetensors
        └── transformer_jit_t2i.py
```

Each class-conditional variant is self-contained: load with `custom_pipeline=.../pipeline.py` and `trust_remote_code=True`. PixelGen denoises directly in pixel space (no VAE).

## ImageNet class labels

For [`PixelGen-XL-16-256/`](PixelGen-XL-16-256/), `id2label` is embedded in `model_index.json` (DiT-style).

- `pipe.id2label` — inspect id → English label correspondence
- `pipe.labels` — reverse map (English synonym → id)
- `pipe.get_label_ids("golden retriever")`
- `pipe(class_labels="golden retriever", ...)` — string labels resolved automatically

## Demo

![PixelGen-XL-16-256 demo](PixelGen-XL-16-256/demo.png)

Class 207 — golden retriever, 256×256, 50 steps, `guidance_scale=2.25`, Heun solver, `timeshift=2.0`.

## Load from Hugging Face

### Class-to-image (`PixelGen-XL-16-256`)

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "BiliSakura/PixelGen-diffusers/PixelGen-XL-16-256",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

print(pipe.id2label[207])
print(pipe.get_label_ids("golden retriever"))

generator = torch.Generator(device="cuda").manual_seed(0)
images = pipe(
    class_labels="golden retriever",
    num_inference_steps=50,
    guidance_scale=2.25,
    generator=generator,
).images
```

### Text-to-image (`PixelGen-XXL-16-512-t2i`)

Uses a bundled Qwen3 text encoder when `text_encoder/` is present; otherwise downloads from the path recorded in `conversion_metadata.json`.

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "BiliSakura/PixelGen-diffusers/PixelGen-XXL-16-512-t2i",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

generator = torch.Generator(device="cuda").manual_seed(42)
images = pipe(
    prompt="A golden retriever playing in a sunny garden",
    num_inference_steps=50,
    guidance_scale=4.0,
    generator=generator,
).images
```

## Load from a local clone

### Class-to-image (`PixelGen-XL-16-256`)

```python
from pathlib import Path
import torch
from diffusers import DiffusionPipeline

model_dir = Path("./PixelGen-XL-16-256").resolve()
pipe = DiffusionPipeline.from_pretrained(
    str(model_dir),
    local_files_only=True,
    custom_pipeline=str(model_dir / "pipeline.py"),
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(
    class_labels="golden retriever",
    num_inference_steps=50,
    guidance_scale=2.25,
    generator=generator,
).images[0]
image.save("demo.png")
```

## Recommended inference settings

| Variant | Steps | CFG scale | Solver | Timeshift | CFG interval |
| --- | ---: | ---: | --- | ---: | --- |
| `PixelGen-XL-16-256` | 50 | 2.25 | heun | 2.0 | [0.1, 0.9] |
| `PixelGen-XXL-16-512-t2i` | 25 | 4.0 | adam_lm | 3.0 | [0.0, 1.0] |

`height` and `width` are fixed by each checkpoint's `sample_size`. Custom sizes are not supported for these exports.

## Interface notes

- Class-conditional generation uses `class_labels` (integer ImageNet id or English synonym).
- `guidance_scale > 1.0` enables classifier-free guidance over a null class token.
- `sampling_method` accepts `heun` or `euler` for C2I; T2I defaults to `adam_lm`.
- `noise_scale` defaults to `1.0` at 256×256 and `2.0` at 512×512 when not specified.

## Citation

Source paper:

- [PixelGen: Improving Pixel Diffusion with Perceptual Loss](https://arxiv.org/abs/2602.02493)
- [Hugging Face Papers page](https://huggingface.co/papers/2602.02493)

```bibtex
@article{ma2026pixelgen,
  title={PixelGen: Improving Pixel Diffusion with Perceptual Loss},
  author={Zehong Ma and Ruihan Xu and Shiliang Zhang},
  year={2026},
  eprint={2602.02493},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.02493},
}
```