metadata
license: apache-2.0
pipeline_tag: unconditional-image-generation
tags:
- image-generation
- pixel-diffusion
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
PixelGen is a simple pixel diffusion framework that generates images directly in pixel space. Unlike latent diffusion models, it avoids the artifacts and bottlenecks of VAEs by introducing two complementary perceptual losses: an LPIPS loss for local patterns and a DINO-based perceptual loss for global semantics.
Project Page | Paper | GitHub
Introduction
PixelGen achieves competitive results compared to latent diffusion models by modeling a more meaningful perceptual manifold rather than the full, high-dimensional pixel manifold. Key highlights include:
- FID 5.11 on ImageNet-256 without classifier-free guidance (CFG) in only 80 epochs.
- FID 1.83 on ImageNet-256 with CFG.
- GenEval score of 0.79 on large-scale text-to-image generation tasks.
Checkpoints
| Dataset | Model | Params | Performance |
|---|---|---|---|
| ImageNet256 | PixelGen-XL/16 | 676M | 5.11 FID (w/o CFG) / 1.83 FID (w/ CFG) |
| Text-to-Image | PixelGen-XXL/16 | 1.1B | 0.79 GenEval Score |
Usage
For detailed environment setup and training, please refer to the official GitHub repository.
Inference
You can run inference using the provided configuration files and checkpoints:
# for inference without CFG using 80-epoch checkpoint
python main.py predict -c ./configs_c2i/PixelGen_XL_without_CFG.yaml --ckpt_path=./ckpts/PixelGen_XL_80ep.ckpt
# for inference with CFG using 160-epoch checkpoint
python main.py predict -c ./configs_c2i/PixelGen_XL.yaml --ckpt_path=./ckpts/PixelGen_XL_160ep.ckpt
Citation
@article{ma2026pixelgen,
title={PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss},
author={Zehong Ma and Ruihan Xu and Shiliang Zhang},
year={2026},
eprint={2602.02493},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.02493},
}