zehongma
/

DeCo

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, abstract summary, and usage info

by nielsr HF Staff - opened Nov 25, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+92

-6

Files changed (1) hide show

README.md +92 -6

README.md CHANGED Viewed

@@ -1,8 +1,11 @@
 ---
 license: apache-2.0
 ---
-## DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
 Arxiv: https://arxiv.org/abs/2511.19365
 Project Page: https://zehong-ma.github.io/DeCo
@@ -11,29 +14,112 @@ Code Repository: https://github.com/Zehong-Ma/DeCo
 Huggingface Space: https://14467288703cf06a3c.gradio.live/
 ## 🖼️ Background
 + Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This avoids the two-stage training and inevitable low-level artifacts of VAE.
-+ Current pixel diffusion models suffer from slow training  since a single Diffusion Transformer (DiT) is required to jointly model complex high-frequency signals and low-frequency semantics. Modeling complex high-frequency signals, especially high-frequency noise, can distract the DiT from learning low-frequency semantics.
-+ JiT proposes that high-dimensional noise may distract the model from learning low-dimensional data, which is also a form of high-frequency interference. Additionaly, the intrinsic noise (e.g., camera noise) in the clean image is also high-frequency noise that requires modeling. Our DeCO can jointly models these high-frequency signals (gaussian noise in JiT,  intrinsic camera noise, high-frequency details) in an end-to-end manner.
 + **Motivation**: **The paper proposes the frequency-DeCoupled (DeCo) framework to separate the modeling of high and low-frequency components.** A lightweight Pixel Decoder is introduced to model the high-frequency components , thereby freeing the DiT to specialize in modeling low-frequency semantics.
 ### 💡Method
 + The DiT operates on a downsampled, low-resolution input to generate low-frequency semantic conditions. The Pixel Decoder then takes the full-resolution input, and use the DiT's semantic condition as guidance to predict the velocity. The AdaLN-Zero interaction mechanism is used to modulate the dense features in the Pixel Decoder with the DiT output.
-+ The paper also propose a frequency-aware flow-matching loss。It applies adaptive weights for different frequency components. These weights are derived from normalized reciprocal of JPEG quantization tables , which assign higher weights to perceptually more important low-frequency components and suppress insignificant high-frequency noise.
 ### 📈Experiments
 + The authors trained the DeCo-XL model with a DiT patch size of 16 on the ImageNet 256x256 and 512x512. DeCo-XL achieves a leading FID of **1.62** on ImageNet 256x256 and **2.22** on ImageNet 512x512. With the same 50 Heun steps at 600 epochs, DeCo's FID of 1.69 is superior to JiT's FID of 1.86.
-+ For scaling ability in text-to-image generation, a DeCo-XXL model was trained on the BLIP3o dataset (36M pretraining images + 60k instruction-tuning data). It achieves an overall score of **0.86** on GenEval and a competitive average score of 81.4 on DPG-Bench.
 ![](https://zehong-ma.github.io/DeCo/static/images/imagenet_results.jpg)
 ![](https://zehong-ma.github.io/DeCo/static/images/appendix_t2i_figures.jpg)
-### 📖Citation
 ```
 @misc{ma2025decofrequencydecoupledpixeldiffusion,
       title={DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation},

 ---
 license: apache-2.0
+pipeline_tag: text-to-image
 ---
+# DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
+Paper: https://huggingface.co/papers/2511.19365
 Arxiv: https://arxiv.org/abs/2511.19365
 Project Page: https://zehong-ma.github.io/DeCo
 Huggingface Space: https://14467288703cf06a3c.gradio.live/
+## Abstract Summary
+DeCo introduces a frequency-decoupled pixel diffusion framework for end-to-end image generation. It addresses slow training and inference in existing pixel diffusion models by using a lightweight pixel decoder for high-frequency details, allowing the Diffusion Transformer (DiT) to focus on low-frequency semantics. Combined with a frequency-aware flow-matching loss, DeCo achieves state-of-the-art performance, with FID scores of 1.62 (256x256) and 2.22 (512x512) on ImageNet, and a leading overall score of 0.86 on GenEval for text-to-image generation.
 ## 🖼️ Background
 + Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This avoids the two-stage training and inevitable low-level artifacts of VAE.
++ Current pixel diffusion models suffer from slow training since a single Diffusion Transformer (DiT) is required to jointly model complex high-frequency signals and low-frequency semantics. Modeling complex high-frequency signals, especially high-frequency noise, can distract the DiT from learning low-frequency semantics.
++ JiT proposes that high-dimensional noise may distract the model from learning low-dimensional data, which is also a form of high-frequency interference. Additionaly, the intrinsic noise (e.g., camera noise) in the clean image is also high-frequency noise that requires modeling. Our DeCO can jointly models these high-frequency signals (gaussian noise in JiT, intrinsic camera noise, high-frequency details) in an end-to-end manner.
 + **Motivation**: **The paper proposes the frequency-DeCoupled (DeCo) framework to separate the modeling of high and low-frequency components.** A lightweight Pixel Decoder is introduced to model the high-frequency components , thereby freeing the DiT to specialize in modeling low-frequency semantics.
 ### 💡Method
 + The DiT operates on a downsampled, low-resolution input to generate low-frequency semantic conditions. The Pixel Decoder then takes the full-resolution input, and use the DiT's semantic condition as guidance to predict the velocity. The AdaLN-Zero interaction mechanism is used to modulate the dense features in the Pixel Decoder with the DiT output.
++ The paper also propose a frequency-aware flow-matching loss. It applies adaptive weights for different frequency components. These weights are derived from normalized reciprocal of JPEG quantization tables , which assign higher weights to perceptually more important low-frequency components and suppress insignificant high-frequency noise.
 ### 📈Experiments
 + The authors trained the DeCo-XL model with a DiT patch size of 16 on the ImageNet 256x256 and 512x512. DeCo-XL achieves a leading FID of **1.62** on ImageNet 256x256 and **2.22** on ImageNet 512x512. With the same 50 Heun steps at 600 epochs, DeCo's FID of 1.69 is superior to JiT's FID of 1.86.
++ For scaling ability in text-to-image generation, a DeCo-XXL model was trained on the BLIP3o dataset (36M pretraining images + 60k instruction-tuning data). It achieves an overall score of **0.86** on GenEval in system-level comparison.
 ![](https://zehong-ma.github.io/DeCo/static/images/imagenet_results.jpg)
 ![](https://zehong-ma.github.io/DeCo/static/images/appendix_t2i_figures.jpg)
+## <img src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0iIzg4OCI+PHBhdGggZD0iTTEyLDJBNy45LDcuOSwwLDAsMCw0LjEsOS45LDcuOCw3LjgsMCwwLDAsMiwxN2gzYTEsMSwwLDAsMCwxLDFoM3YyYTEsMSwwLDAsMCwxLDFoNHYwYTEsMSwwLDAsMCwxLTFWMThoM2ExLDEsMCwwLDAsMS0xaDNjMC0zLjgtMS42LTcuNS01LjgtOS4yQTguMSw4LjEsMCwwLDAsMTIsMlpNMTQsMTMuNWwtLjUuOGExLDEsMCwwLDEtLjkuNWgtMS4yYTEsMSwwLDAsMS0uOS0uNWwtLjUtLjhDOS4zLDEzLDksMTIuNyw5LDEyLjRzLjMtLjYsLjktMWwuNS0uOGExLDEsMCwwLDEsLjktLjVoMS4yYTEsMSwwLDAsMSwuOS41bC41LjhjLjYuNC45LjcuOSwxUzE0LjYsMTMuMSwxNCwxMy41WiIvPjwvc3ZnPg==" width="20" style="vertical-align: middle;"/> DCT Spectral Analysis
+<div class="content">
+            <img src="./docs/static/images/dct_and_FID_comparison.jpg"  style="width: 95%;"><br>
+            <span style="font-size: 0.8em; width: 100%; display: inline-block;">DCT energy distribution of DiT outputs and predicted pixel velocities. Compared with baseline, DeCo suppresses high-frequency signals in DiT outputs while preserving strong high-frequency energy in pixel velocity, confirming effective frequency decoupling. The distribution is computed on 10K images across all diffusion steps using DCT transform with 8x8 block size. (b) FID comparison between our DeCo and baseline. DeCo reaches 2.57 FID in 400k iterations, 10× faster than the baseline.
+            </span>
+          </div>
+## 🧩 Visualizations
++ Visualization of more images generated by our text-to-image DeCo.
+<div class="content">
+            <img src="./docs/static/images/appendix_t2i_figures.jpg" style="width: 100%;"><br>
+</div>
++ Visualization of 256*256 images generated by our class-to-image DeCo.
+<div class="content">
+            <img src="./docs/static/images/c2i_imagenet256.jpg" style="width: 100%;"><br>
+</div>
+## 🎉 Checkpoints
+| Dataset       | Epoch | Model         | Params | FID   | HuggingFace                           |
+|---------------|-------|---------------|--------|-------|---------------------------------------|
+| ImageNet256    | 320   |   DeCo-XL/16 | 682M   | 1.90  | [🤗](https://huggingface.co/zehongma/DeCo/blob/main/imagenet256_epoch320.ckpt) |
+| ImageNet256    | 600   |   DeCo-XL/16 | 682M   | 1.78  | [🤗](https://huggingface.co/zehongma/DeCo/blob/main/imagenet256_epoch600.ckpt) |
+| ImageNet256    | 800   |   DeCo-XL/16 | 682M   | 1.62  | [🤗](https://huggingface.co/zehongma/DeCo/blob/main/imagenet256_epoch800.ckpt) |
+| ImageNet512 |  340  | DeCo-XL/16 | 682M   | 2.22  | [🤗](https://huggingface.co/zehongma/DeCo/blob/main/imagenet512_epoch340.ckpt) |
+| Dataset       | Model         | Params | GenEval | DPG  | HuggingFace                                              |
+|---------------|---------------|--------|------|------|----------------------------------------------------------|
+| Text-to-Image | DeCo-XXL/16| 1.1B | 0.86 | 81.4| [🤗](https://huggingface.co/zehongma/DeCo/blob/main/t2i_DeCo.ckpt) |
+## 🔥 Online Demos
+![](./docs/static/images/demo.jpg)
+We provide online demos for DeCo-XXL/16(text-to-image) on HuggingFace Spaces.
+HF spaces: [https://14467288703cf06a3c.gradio.live](https://14467288703cf06a3c.gradio.live)
+To host the local gradio demo, run the following command:
+```bash
+# for text-to-image applications
+python app.py --config configs_t2i/inference_heavydecoder.yaml  --ckpt_path=./ckpts/t2i_DeCo.ckpt
+```
+## 🤖 Usages
+In class-to-image(ImageNet) experiments, We use [ADM evaluation suite](https://github.com/openai/guided-diffusion/tree/main/evaluations) to report FID.
+In text-to-image experiments, we use BLIP3o dataset as training set and utilize GenEval and DPG to collect metrics.
++ Environments
+```bash
+# for installation (recommend python 3.10)
+pip install -r requirements.txt
+```
++ Inference
+```bash
+# for inference
+python main.py predict -c ./configs_c2i/DeCo_XL.yaml --ckpt_path=XXX.ckpt
+```
++ Train
+```bash
+# for c2i training
+# Please modify the ImageNet1k path in the config file before training.
+python main.py fit -c ./configs_c2i/DeCo_XL.yaml
+# for 512*512 continuing pretraining
+python main.py fit -c ./configs_c2i/DeCo_XL_512.yaml --ckpt_path=/path/to/256/checkpoint/at/320/epochs
+```
+```bash
+# for t2i training
+python main.py fit -c ./configs_t2i/pretraining_res256.yaml
+python main.py fit -c ./configs_t2i/pretraining_res512.yaml --ckpt_path=./ckpts/pretrain256.ckpt
+python main.py fit -c ./configs_t2i/sft_res512.yaml  --ckpt_path=./ckpts/pretrain512.ckpt
+```
+## 💐 Acknowledgement
+This repository is built based on [PixNerd](https://github.com/MCG-NJU/PixNerd) and [DDT](https://github.com/MCG-NJU/DDT). Thanks for their contributions and [Shuai Wang](https://github.com/WANGSSSSSSS)'s support!
+### 📖 Citation
 ```
 @misc{ma2025decofrequencydecoupledpixeldiffusion,
       title={DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation},