--- license: apache-2.0 --- ## DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation Arxiv: https://arxiv.org/abs/2511.19365 Project Page: https://zehong-ma.github.io/DeCo Code Repository: https://github.com/Zehong-Ma/DeCo Huggingface Space: https://14467288703cf06a3c.gradio.live/ ## 🖼️ Background + Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This avoids the two-stage training and inevitable low-level artifacts of VAE. + Current pixel diffusion models suffer from slow training since a single Diffusion Transformer (DiT) is required to jointly model complex high-frequency signals and low-frequency semantics. Modeling complex high-frequency signals, especially high-frequency noise, can distract the DiT from learning low-frequency semantics. + JiT proposes that high-dimensional noise may distract the model from learning low-dimensional data, which is also a form of high-frequency interference. Additionaly, the intrinsic noise (e.g., camera noise) in the clean image is also high-frequency noise that requires modeling. Our DeCO can jointly models these high-frequency signals (gaussian noise in JiT, intrinsic camera noise, high-frequency details) in an end-to-end manner. + **Motivation**: **The paper proposes the frequency-DeCoupled (DeCo) framework to separate the modeling of high and low-frequency components.** A lightweight Pixel Decoder is introduced to model the high-frequency components , thereby freeing the DiT to specialize in modeling low-frequency semantics. ### 💡Method + The DiT operates on a downsampled, low-resolution input to generate low-frequency semantic conditions. The Pixel Decoder then takes the full-resolution input, and use the DiT's semantic condition as guidance to predict the velocity. The AdaLN-Zero interaction mechanism is used to modulate the dense features in the Pixel Decoder with the DiT output. + The paper also propose a frequency-aware flow-matching loss。It applies adaptive weights for different frequency components. These weights are derived from normalized reciprocal of JPEG quantization tables , which assign higher weights to perceptually more important low-frequency components and suppress insignificant high-frequency noise. ### 📈Experiments + The authors trained the DeCo-XL model with a DiT patch size of 16 on the ImageNet 256x256 and 512x512. DeCo-XL achieves a leading FID of **1.62** on ImageNet 256x256 and **2.22** on ImageNet 512x512. With the same 50 Heun steps at 600 epochs, DeCo's FID of 1.69 is superior to JiT's FID of 1.86. + For scaling ability in text-to-image generation, a DeCo-XXL model was trained on the BLIP3o dataset (36M pretraining images + 60k instruction-tuning data). It achieves an overall score of **0.86** on GenEval and a competitive average score of 81.4 on DPG-Bench. ![](https://zehong-ma.github.io/DeCo/static/images/imagenet_results.jpg) ![](https://zehong-ma.github.io/DeCo/static/images/appendix_t2i_figures.jpg) ### 📖Citation ``` @misc{ma2025decofrequencydecoupledpixeldiffusion, title={DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation}, author={Zehong Ma and Longhui Wei and Shuai Wang and Shiliang Zhang and Qi Tian}, year={2025}, eprint={2511.19365}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.19365}, } ```