Update README.md
Browse files
README.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
datasets:
|
| 4 |
-
- ILSVRC/imagenet-1k
|
| 5 |
-
---
|
| 6 |
|
| 7 |
# Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe
|
| 8 |
|
|
@@ -13,34 +13,13 @@ datasets:
|
|
| 13 |
</p>
|
| 14 |
|
| 15 |
|
| 16 |
-
##
|
| 17 |
-
- __[2025.12.15]__: Release the codes of [DSMoE](./DSMoE) and [JiTMoE](./JiTMoE).
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
## 2. 📖 Introduction
|
| 21 |
|
| 22 |
We release the MoE Transformer that can be applied to both latent and pixel-space diffusion frameworks, employing DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. The models are already relased to Huggingface. <br>
|
| 23 |
|
| 24 |
-
##
|
| 25 |
-
|
| 26 |
-
### 3.1 Dataset
|
| 27 |
-
Download [ImageNet](http://image-net.org/download) dataset, and place it in your `IMAGENET_PATH`.
|
| 28 |
-
|
| 29 |
-
### 3.2 Installation
|
| 30 |
-
|
| 31 |
-
Please follow the installations of [DiffMoE](https://github.com/KlingTeam/DiffMoE) and [JiT](https://github.com/LTH14/JiT), respectively.
|
| 32 |
-
|
| 33 |
-
### 3.3 Training
|
| 34 |
-
|
| 35 |
-
See details in [DSMoE](./DSMoE) and [JiTMoE](./JiTMoE) respectively.
|
| 36 |
-
|
| 37 |
-
### 3.4 Evaluation
|
| 38 |
-
|
| 39 |
-
We follow the evaluation protocols provided by [DiffMoE](https://github.com/KlingTeam/DiffMoE) and [JiT](https://github.com/LTH14/JiT).
|
| 40 |
-
|
| 41 |
-
## 4. Main results
|
| 42 |
|
| 43 |
-
###
|
| 44 |
|
| 45 |
- Ours DSMoE v.s. [DiffMoE](https://arxiv.org/pdf/2503.14487) on 700K training steps with CFG = 1.0 (* refers to the reported results in the official paper):
|
| 46 |
|
|
@@ -67,4 +46,32 @@ We follow the evaluation protocols provided by [DiffMoE](https://github.com/Klin
|
|
| 67 |
|DiffMoE-B-E16|130M|4.87|183.43|
|
| 68 |
|DSMoE-B-E16|132M|4.50|186.79|
|
| 69 |
|DSMoE-B-E48|118M|4.27|191.03|
|
| 70 |
-
|DiffMoE-L-E16|458M|2.84|256.57|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- ILSVRC/imagenet-1k
|
| 5 |
+
---
|
| 6 |
|
| 7 |
# Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe
|
| 8 |
|
|
|
|
| 13 |
</p>
|
| 14 |
|
| 15 |
|
| 16 |
+
## 📖 Introduction
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
We release the MoE Transformer that can be applied to both latent and pixel-space diffusion frameworks, employing DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. The models are already relased to Huggingface. <br>
|
| 19 |
|
| 20 |
+
## Main results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
### Latent diffusion framework
|
| 23 |
|
| 24 |
- Ours DSMoE v.s. [DiffMoE](https://arxiv.org/pdf/2503.14487) on 700K training steps with CFG = 1.0 (* refers to the reported results in the official paper):
|
| 25 |
|
|
|
|
| 46 |
|DiffMoE-B-E16|130M|4.87|183.43|
|
| 47 |
|DSMoE-B-E16|132M|4.50|186.79|
|
| 48 |
|DSMoE-B-E48|118M|4.27|191.03|
|
| 49 |
+
|DiffMoE-L-E16|458M|2.84|256.57|
|
| 50 |
+
|DSMoE-L-E16|465M|2.59|272.55|
|
| 51 |
+
|DSMoE-L-E48|436M|2.55|278.35|
|
| 52 |
+
|DSMoE-3B-E16|965M|2.38|304.93|
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
### Pixel-space diffusion framework
|
| 56 |
+
|
| 57 |
+
- Ours JiTMoE v.s. [JiT](https://arxiv.org/pdf/2511.13720) on 200 training epochs with CFG interval (* refers to the reported results in the official paper):
|
| 58 |
+
|
| 59 |
+
| Model Name | # Act. Params | FID-50K↓ | Inception Score↑ |
|
| 60 |
+
|----------------------------|-------------------------|---------|----------------|
|
| 61 |
+
|JiT-B/16|131M|4.81 (4.37*)| 222.32 (-)|
|
| 62 |
+
|JiTMoE-B/16-E16|133M|4.23| 245.53|
|
| 63 |
+
|JiT-L/16|459M| 3.19 (2.79*)| 309.72 (-)|
|
| 64 |
+
|JiTMoE-L/16-E16|465M|3.10| 311.34|
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
## 🌟 Citation
|
| 68 |
+
|
| 69 |
+
```
|
| 70 |
+
@article{liu2025efficient,
|
| 71 |
+
title={Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe},
|
| 72 |
+
author={Liu, Yahui and Yue, Yang and Zhang, Jingyuan and Sun, Chenxi and Zhou, Yang and Zeng, Wencong and Tang, Ruiming and Zhou, Guorui},
|
| 73 |
+
journal={arXiv preprint arXiv:2512.01252},
|
| 74 |
+
year={2025}
|
| 75 |
+
}
|
| 76 |
+
```
|
| 77 |
+
|