| | --- |
| | license: apache-2.0 |
| | --- |
| | |
| | <h1 align="center"> |
| | <img alt="Drop-Upcycling" src="images/drop-upcycling.png"></a><br> |
| | <b>Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization</b><br> |
| | </h1> |
| | |
| | <p align="center"> |
| | π <a href="https://openreview.net/forum?id=gx1wHnf5Vp">[Paper]</a> | |
| | π€ <a href="https://huggingface.co/collections/llm-jp/drop-upcycling-674dc5be7bbb45e12a476b80">[Hugging Face]</a> |
| | π <a href="https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3">[Dataset]</a> |
| | π» <a href="https://github.com/Taishi-N324/Drop-Upcycling">[Code]</a> | |
| | π <a href="https://wandb.ai/taishi-nakamura/Drop-Upcycling">[Log]</a> |
| | </p> |
| |
|
| | # Model Index |
| |
|
| | We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2. |
| |
|
| | ## Table 1 |
| |
|
| | |Model|Link| |
| | |---|---| |
| | |1 Dense 152M| [Link](https://huggingface.co/llm-jp/Dense-152M) | |
| | |2 MoE FS 8x152M| [Link](https://huggingface.co/llm-jp/FS-8x152M) | |
| | |3 MoE BTX 8x152M| [Link](https://huggingface.co/llm-jp/BTX-8x152M) | |
| | |4 MoE NU 8x152M| [Link](https://huggingface.co/llm-jp/NU-8x152M) | |
| | |5 MoE RNU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x152M) | |
| | |6 MoE DU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/DU-0.5-8x152M) | |
| | |7 MoE DU (r=1.0) 8x152M| [Link](https://huggingface.co/llm-jp/DU-1.0-8x152M) | |
| | |8 Dense 1.5B| [Link](https://huggingface.co/llm-jp/Dense-1.5B) | |
| | |9 MoE FS 8x1.5B| [Link](https://huggingface.co/llm-jp/FS-8x1.5B) | |
| | |10 MoE BTX 8x1.5B| [Link](https://huggingface.co/llm-jp/BTX-8x1.5B) | |
| | |11 MoE NU 8x1.5B| [Link](https://huggingface.co/llm-jp/NU-8x1.5B) | |
| | |12 MoE RNU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x1.5B) | |
| | |13 MoE DU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x1.5B) | |
| | |14 MoE DU (r=1.0) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-1.0-8x1.5B) | |
| |
|
| | ## Table 2 |
| |
|
| | |Model|Link| |
| | |---|---| |
| | |1 Dense 3.7B| [Link](https://huggingface.co/llm-jp/Dense-3.7B) | |
| | |2 MoE FS 8x3.7B| [Link](https://huggingface.co/llm-jp/FS-8x3.7B) | |
| | |3 MoE DU (r=0.5) 8x3.7B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x3.7B) | |
| | |4 Dense 13B| [Link](https://huggingface.co/llm-jp/Dense-13B) | |
| | |5 Dense 3.7B| [Link](https://huggingface.co/llm-jp/llm-jp-3-3.7b) | |
| |
|
| | ## BTX Experts |
| |
|
| | |Model|Link| |
| | |---|---| |
| | |Japanese expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-152M) | |
| | |English expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-152M) | |
| | |Code expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-152M) | |
| | |Japanese expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-1.5B) | |
| | |English expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-1.5B) | |
| | |Code expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-1.5B) | |
| |
|
| | ## How to cite |
| |
|
| | If you find our work helpful, please feel free to cite. |
| |
|
| | ``` |
| | @inproceedings{ |
| | nakamura2025dropupcycling, |
| | title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization}, |
| | author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki}, |
| | booktitle={The Thirteenth International Conference on Learning Representations}, |
| | year={2025}, |
| | url={https://openreview.net/forum?id=gx1wHnf5Vp} |
| | } |
| | ``` |