File size: 3,422 Bytes
35ee97a
 
 
 
 
 
e8eec86
35ee97a
e8eec86
35ee97a
e8eec86
35ee97a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: apache-2.0
---
## DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Arxiv: https://arxiv.org/abs/2511.19365

Project Page: https://zehong-ma.github.io/DeCo

Code Repository: https://github.com/Zehong-Ma/DeCo

Huggingface Space: https://14467288703cf06a3c.gradio.live/


## 🖼️ Background

+ Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This avoids the two-stage training and inevitable low-level artifacts of VAE.
+ Current pixel diffusion models suffer from slow training  since a single Diffusion Transformer (DiT) is required to jointly model complex high-frequency signals and low-frequency semantics. Modeling complex high-frequency signals, especially high-frequency noise, can distract the DiT from learning low-frequency semantics.
+ JiT proposes that high-dimensional noise may distract the model from learning low-dimensional data, which is also a form of high-frequency interference. Additionaly, the intrinsic noise (e.g., camera noise) in the clean image is also high-frequency noise that requires modeling. Our DeCO can jointly models these high-frequency signals (gaussian noise in JiT,  intrinsic camera noise, high-frequency details) in an end-to-end manner.
+ **Motivation**: **The paper proposes the frequency-DeCoupled (DeCo) framework to separate the modeling of high and low-frequency components.** A lightweight Pixel Decoder is introduced to model the high-frequency components , thereby freeing the DiT to specialize in modeling low-frequency semantics.

### 💡Method

+ The DiT operates on a downsampled, low-resolution input to generate low-frequency semantic conditions. The Pixel Decoder then takes the full-resolution input, and use the DiT's semantic condition as guidance to predict the velocity. The AdaLN-Zero interaction mechanism is used to modulate the dense features in the Pixel Decoder with the DiT output. 
+ The paper also propose a frequency-aware flow-matching loss。It applies adaptive weights for different frequency components. These weights are derived from normalized reciprocal of JPEG quantization tables , which assign higher weights to perceptually more important low-frequency components and suppress insignificant high-frequency noise.

### 📈Experiments

+ The authors trained the DeCo-XL model with a DiT patch size of 16 on the ImageNet 256x256 and 512x512. DeCo-XL achieves a leading FID of **1.62** on ImageNet 256x256 and **2.22** on ImageNet 512x512. With the same 50 Heun steps at 600 epochs, DeCo's FID of 1.69 is superior to JiT's FID of 1.86.
+ For scaling ability in text-to-image generation, a DeCo-XXL model was trained on the BLIP3o dataset (36M pretraining images + 60k instruction-tuning data). It achieves an overall score of **0.86** on GenEval and a competitive average score of 81.4 on DPG-Bench.

![](https://zehong-ma.github.io/DeCo/static/images/imagenet_results.jpg)

![](https://zehong-ma.github.io/DeCo/static/images/appendix_t2i_figures.jpg)

### 📖Citation
```
@misc{ma2025decofrequencydecoupledpixeldiffusion,
      title={DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation}, 
      author={Zehong Ma and Longhui Wei and Shuai Wang and Shiliang Zhang and Qi Tian},
      year={2025},
      eprint={2511.19365},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.19365}, 
}
```