Ulov888 commited on
Commit
88acb1f
·
verified ·
1 Parent(s): b9f9353

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -7
README.md CHANGED
@@ -5,6 +5,7 @@ tags:
5
  - diffusion
6
  - llm
7
  - text_generation
 
8
  ---
9
  # LLaDA-MoE
10
 
@@ -64,15 +65,63 @@ tags:
64
 
65
  ---
66
 
67
- ## ⚡ Quickstart
 
68
 
69
- Make sure you have `transformers` and its dependencies installed:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```bash
72
- pip install transformers torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ```
74
 
75
- You can then load the model using the AutoModelForCausalLM and AutoTokenizer classes:
 
 
76
 
77
  ```python
78
  import torch
@@ -163,6 +212,11 @@ model = AutoModel.from_pretrained('inclusionAI/LLaDA-MoE-7B-A1B-Instruct', trust
163
  tokenizer = AutoTokenizer.from_pretrained('inclusionAI/LLaDA-MoE-7B-A1B-Instruct', trust_remote_code=True)
164
 
165
  prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?"
 
 
 
 
 
166
 
167
  input_ids = tokenizer(prompt)['input_ids']
168
  input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
@@ -176,10 +230,17 @@ print(tokenizer.batch_decode(text[:, input_ids.shape[1]:], skip_special_tokens=F
176
  ```
177
 
178
 
179
- ## 📚 Citation (Coming Soon)
180
 
181
- We are preparing the technical report and citation information.
182
- Stay tuned — citation details will be available soon.
 
 
 
 
 
 
 
183
 
184
  ---
185
 
 
5
  - diffusion
6
  - llm
7
  - text_generation
8
+ library_name: transformers
9
  ---
10
  # LLaDA-MoE
11
 
 
65
 
66
  ---
67
 
68
+ ## ⚡ Infra
69
+ ### 1. We highly recommend you generate with [dInfer](https://github.com/inclusionAI/dInfer)(1000+ Tokens/S)
70
 
71
+ <p align="center">
72
+ <img src="https://raw.githubusercontent.com/inclusionAI/dInfer/refs/heads/master/assets/dinfer_tps.png" alt="dInfer v0.1 speedup" width="600">
73
+ <br>
74
+ <b>Figure</b>: Display of generation speed
75
+ </p>
76
+
77
+ On HumanEval, dInfer achieves over 1,100 TPS at batch size 1, and averages more than 800 TPS across six benchmarks on
78
+ a single node with 8 H800 GPUs.
79
+ #### Install dInfer
80
+
81
+ ```
82
+ git clone https://github.com/inclusionAI/dInfer.git
83
+ cd dInfer
84
+ pip install .
85
+ ```
86
+
87
+ #### Convert to FusedMoE
88
+
89
+ Use the conversion tool to fuse the experts.
90
 
91
  ```bash
92
+ # From repo root
93
+ python tools/transfer.py \
94
+ --input /path/to/LLaDA-MoE-7B-A1B-Instruct \
95
+ --output /path/to/LLaDA-MoE-7B-A1B-Instruct-fused
96
+ ```
97
+
98
+ #### Use the model in dInfer
99
+
100
+ ```python
101
+ import torch
102
+ from transformers import AutoTokenizer
103
+
104
+ from dinfer.model import AutoModelForCausalLM
105
+ from dinfer.model import FusedOlmoeForCausalLM
106
+ from dinfer import BlockIteratorFactory, KVCacheFactory
107
+ from dinfer import ThresholdParallelDecoder, BlockWiseDiffusionLLM
108
+
109
+ m = "/path/to/LLaDA-MoE-7B-A1B-Instruct-fused"
110
+ tok = AutoTokenizer.from_pretrained(m, trust_remote_code=True)
111
+ model = AutoModelForCausalLM.from_pretrained(m, trust_remote_code=True, torch_dtype="bfloat16")
112
+
113
+ decoder = ThresholdParallelDecoder(0, threshold=0.9)
114
+ dllm = BlockWiseDiffusionLLM(model, decoder, BlockIteratorFactory(True), cache_factory=KVCacheFactory('dual'))
115
+
116
+ prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she can run 6 kilometers per hour. How many kilometers can she run in 8 hours?"
117
+ input_ids = tokenizer(prompt)['input_ids']
118
+ input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
119
+ res = dllm.generate(input_ids, gen_length=gen_len, block_length=block_len)
120
  ```
121
 
122
+ ### 2. No Speedup: transformers
123
+
124
+ Make sure you have `transformers` and its dependencies installed:
125
 
126
  ```python
127
  import torch
 
212
  tokenizer = AutoTokenizer.from_pretrained('inclusionAI/LLaDA-MoE-7B-A1B-Instruct', trust_remote_code=True)
213
 
214
  prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?"
215
+ m = [
216
+ {"role": "system", "content": "You are a helpful AI assistant."},
217
+ {"role": "user", "content": prompt}
218
+ ]
219
+ prompt = tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False)
220
 
221
  input_ids = tokenizer(prompt)['input_ids']
222
  input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
 
230
  ```
231
 
232
 
233
+ ## 📚 Citation [LLaDA-MoE](https://arxiv.org/abs/2509.24389)
234
 
235
+ If you find LLaDA-MoE useful in your research or applications, please cite our paper:
236
+ ```
237
+ @article{zhu2025llada,
238
+ title={LLaDA-MoE: A Sparse MoE Diffusion Language Model},
239
+ author={Fengqi Zhu and Zebin You and Yipeng Xing and Zenan Huang and Lin Liu and Yihong Zhuang and Guoshan Lu and Kangyu Wang and Xudong Wang and Lanning Wei and Hongrui Guo and Jiaqi Hu and Wentao Ye and Tieyuan Chen and Chenchen Li and Chengfu Tang and Haibo Feng and Jun Hu and Jun Zhou and Xiaolu Zhang and Zhenzhong Lan and Junbo Zhao and Da Zheng and Chongxuan Li and Jianguo Li and Ji-Rong Wen},
240
+ journal={arXiv preprint arXiv:2509.24389},
241
+ year={2025}
242
+ }
243
+ ```
244
 
245
  ---
246