Update README.md
Browse files
README.md
CHANGED
|
@@ -5,6 +5,7 @@ tags:
|
|
| 5 |
- diffusion
|
| 6 |
- llm
|
| 7 |
- text_generation
|
|
|
|
| 8 |
---
|
| 9 |
# LLaDA-MoE
|
| 10 |
|
|
@@ -64,15 +65,63 @@ tags:
|
|
| 64 |
|
| 65 |
---
|
| 66 |
|
| 67 |
-
## ⚡
|
|
|
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
```bash
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
| 76 |
|
| 77 |
```python
|
| 78 |
import torch
|
|
@@ -163,6 +212,11 @@ model = AutoModel.from_pretrained('inclusionAI/LLaDA-MoE-7B-A1B-Instruct', trust
|
|
| 163 |
tokenizer = AutoTokenizer.from_pretrained('inclusionAI/LLaDA-MoE-7B-A1B-Instruct', trust_remote_code=True)
|
| 164 |
|
| 165 |
prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
input_ids = tokenizer(prompt)['input_ids']
|
| 168 |
input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
|
|
@@ -176,10 +230,17 @@ print(tokenizer.batch_decode(text[:, input_ids.shape[1]:], skip_special_tokens=F
|
|
| 176 |
```
|
| 177 |
|
| 178 |
|
| 179 |
-
## 📚 Citation (
|
| 180 |
|
| 181 |
-
|
| 182 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
|
| 184 |
---
|
| 185 |
|
|
|
|
| 5 |
- diffusion
|
| 6 |
- llm
|
| 7 |
- text_generation
|
| 8 |
+
library_name: transformers
|
| 9 |
---
|
| 10 |
# LLaDA-MoE
|
| 11 |
|
|
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
+
## ⚡ Infra
|
| 69 |
+
### 1. We highly recommend you generate with [dInfer](https://github.com/inclusionAI/dInfer)(1000+ Tokens/S)
|
| 70 |
|
| 71 |
+
<p align="center">
|
| 72 |
+
<img src="https://raw.githubusercontent.com/inclusionAI/dInfer/refs/heads/master/assets/dinfer_tps.png" alt="dInfer v0.1 speedup" width="600">
|
| 73 |
+
<br>
|
| 74 |
+
<b>Figure</b>: Display of generation speed
|
| 75 |
+
</p>
|
| 76 |
+
|
| 77 |
+
On HumanEval, dInfer achieves over 1,100 TPS at batch size 1, and averages more than 800 TPS across six benchmarks on
|
| 78 |
+
a single node with 8 H800 GPUs.
|
| 79 |
+
#### Install dInfer
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
git clone https://github.com/inclusionAI/dInfer.git
|
| 83 |
+
cd dInfer
|
| 84 |
+
pip install .
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
#### Convert to FusedMoE
|
| 88 |
+
|
| 89 |
+
Use the conversion tool to fuse the experts.
|
| 90 |
|
| 91 |
```bash
|
| 92 |
+
# From repo root
|
| 93 |
+
python tools/transfer.py \
|
| 94 |
+
--input /path/to/LLaDA-MoE-7B-A1B-Instruct \
|
| 95 |
+
--output /path/to/LLaDA-MoE-7B-A1B-Instruct-fused
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
#### Use the model in dInfer
|
| 99 |
+
|
| 100 |
+
```python
|
| 101 |
+
import torch
|
| 102 |
+
from transformers import AutoTokenizer
|
| 103 |
+
|
| 104 |
+
from dinfer.model import AutoModelForCausalLM
|
| 105 |
+
from dinfer.model import FusedOlmoeForCausalLM
|
| 106 |
+
from dinfer import BlockIteratorFactory, KVCacheFactory
|
| 107 |
+
from dinfer import ThresholdParallelDecoder, BlockWiseDiffusionLLM
|
| 108 |
+
|
| 109 |
+
m = "/path/to/LLaDA-MoE-7B-A1B-Instruct-fused"
|
| 110 |
+
tok = AutoTokenizer.from_pretrained(m, trust_remote_code=True)
|
| 111 |
+
model = AutoModelForCausalLM.from_pretrained(m, trust_remote_code=True, torch_dtype="bfloat16")
|
| 112 |
+
|
| 113 |
+
decoder = ThresholdParallelDecoder(0, threshold=0.9)
|
| 114 |
+
dllm = BlockWiseDiffusionLLM(model, decoder, BlockIteratorFactory(True), cache_factory=KVCacheFactory('dual'))
|
| 115 |
+
|
| 116 |
+
prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she can run 6 kilometers per hour. How many kilometers can she run in 8 hours?"
|
| 117 |
+
input_ids = tokenizer(prompt)['input_ids']
|
| 118 |
+
input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
|
| 119 |
+
res = dllm.generate(input_ids, gen_length=gen_len, block_length=block_len)
|
| 120 |
```
|
| 121 |
|
| 122 |
+
### 2. No Speedup: transformers
|
| 123 |
+
|
| 124 |
+
Make sure you have `transformers` and its dependencies installed:
|
| 125 |
|
| 126 |
```python
|
| 127 |
import torch
|
|
|
|
| 212 |
tokenizer = AutoTokenizer.from_pretrained('inclusionAI/LLaDA-MoE-7B-A1B-Instruct', trust_remote_code=True)
|
| 213 |
|
| 214 |
prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?"
|
| 215 |
+
m = [
|
| 216 |
+
{"role": "system", "content": "You are a helpful AI assistant."},
|
| 217 |
+
{"role": "user", "content": prompt}
|
| 218 |
+
]
|
| 219 |
+
prompt = tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False)
|
| 220 |
|
| 221 |
input_ids = tokenizer(prompt)['input_ids']
|
| 222 |
input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
|
|
|
|
| 230 |
```
|
| 231 |
|
| 232 |
|
| 233 |
+
## 📚 Citation [LLaDA-MoE](https://arxiv.org/abs/2509.24389)
|
| 234 |
|
| 235 |
+
If you find LLaDA-MoE useful in your research or applications, please cite our paper:
|
| 236 |
+
```
|
| 237 |
+
@article{zhu2025llada,
|
| 238 |
+
title={LLaDA-MoE: A Sparse MoE Diffusion Language Model},
|
| 239 |
+
author={Fengqi Zhu and Zebin You and Yipeng Xing and Zenan Huang and Lin Liu and Yihong Zhuang and Guoshan Lu and Kangyu Wang and Xudong Wang and Lanning Wei and Hongrui Guo and Jiaqi Hu and Wentao Ye and Tieyuan Chen and Chenchen Li and Chengfu Tang and Haibo Feng and Jun Hu and Jun Zhou and Xiaolu Zhang and Zhenzhong Lan and Junbo Zhao and Da Zheng and Chongxuan Li and Jianguo Li and Ji-Rong Wen},
|
| 240 |
+
journal={arXiv preprint arXiv:2509.24389},
|
| 241 |
+
year={2025}
|
| 242 |
+
}
|
| 243 |
+
```
|
| 244 |
|
| 245 |
---
|
| 246 |
|