YHLLEO
/

DSMoE-S-E16

Model card Files Files and versions

xet

Community

YHLLEO commited on Dec 16, 2025

Commit

0a2435b

verified ·

1 Parent(s): 2454481

Update README.md

Browse files

Files changed (1) hide show

README.md +37 -30

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
----
-license: apache-2.0
-datasets:
-- ILSVRC/imagenet-1k
----
 # Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe
@@ -13,34 +13,13 @@ datasets:
 </p>
-## 1. 🔥 Updates
-- __[2025.12.15]__: Release the codes of [DSMoE](./DSMoE) and [JiTMoE](./JiTMoE).
-## 2. 📖 Introduction
 We release the MoE Transformer that can be applied to both latent and pixel-space diffusion frameworks, employing DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. The models are already relased to Huggingface. <br>
-## 3. Preparation
-### 3.1 Dataset
-Download [ImageNet](http://image-net.org/download) dataset, and place it in your `IMAGENET_PATH`.
-### 3.2 Installation
-Please follow the installations of [DiffMoE](https://github.com/KlingTeam/DiffMoE) and [JiT](https://github.com/LTH14/JiT), respectively.
-### 3.3 Training
-See details in [DSMoE](./DSMoE) and [JiTMoE](./JiTMoE) respectively.
-### 3.4 Evaluation
-We follow the evaluation protocols provided by [DiffMoE](https://github.com/KlingTeam/DiffMoE) and [JiT](https://github.com/LTH14/JiT).
-## 4. Main results
-### 4.1 Latent diffusion framework
  - Ours DSMoE v.s. [DiffMoE](https://arxiv.org/pdf/2503.14487) on 700K training steps with CFG = 1.0 (* refers to the reported results in the official paper):
@@ -67,4 +46,32 @@ We follow the evaluation protocols provided by [DiffMoE](https://github.com/Klin
 |DiffMoE-B-E16|130M|4.87|183.43|
 |DSMoE-B-E16|132M|4.50|186.79|
 |DSMoE-B-E48|118M|4.27|191.03|
-|DiffMoE-L-E16|458M|2.84|256.57|

+---
+license: apache-2.0
+datasets:
+- ILSVRC/imagenet-1k
+---
 # Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe
 </p>
+## 📖 Introduction
 We release the MoE Transformer that can be applied to both latent and pixel-space diffusion frameworks, employing DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. The models are already relased to Huggingface. <br>
+## Main results
+### Latent diffusion framework
  - Ours DSMoE v.s. [DiffMoE](https://arxiv.org/pdf/2503.14487) on 700K training steps with CFG = 1.0 (* refers to the reported results in the official paper):
 |DiffMoE-B-E16|130M|4.87|183.43|
 |DSMoE-B-E16|132M|4.50|186.79|
 |DSMoE-B-E48|118M|4.27|191.03|
+|DiffMoE-L-E16|458M|2.84|256.57|
+|DSMoE-L-E16|465M|2.59|272.55|
+|DSMoE-L-E48|436M|2.55|278.35|
+|DSMoE-3B-E16|965M|2.38|304.93|
+### Pixel-space diffusion framework
+-  Ours JiTMoE v.s. [JiT](https://arxiv.org/pdf/2511.13720) on 200 training epochs with CFG interval (* refers to the reported results in the official paper):
+| Model Name                 | # Act. Params | FID-50K↓  | Inception Score↑ |
+|----------------------------|-------------------------|---------|----------------|
+|JiT-B/16|131M|4.81 (4.37*)| 222.32 (-)|
+|JiTMoE-B/16-E16|133M|4.23| 245.53|
+|JiT-L/16|459M| 3.19 (2.79*)| 309.72 (-)|
+|JiTMoE-L/16-E16|465M|3.10| 311.34|
+## 🌟 Citation
+```
+@article{liu2025efficient,
+  title={Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe},
+  author={Liu, Yahui and Yue, Yang and Zhang, Jingyuan and Sun, Chenxi and Zhou, Yang and Zeng, Wencong and Tang, Ruiming and Zhou, Guorui},
+  journal={arXiv preprint arXiv:2512.01252},
+  year={2025}
+}
+```