BigDong commited on
Commit
002e8d9
·
1 Parent(s): 2142ed5

update README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -1
README.md CHANGED
@@ -216,7 +216,83 @@ You can apply the LongRoPE factor modification by modifying the model files. Spe
216
 
217
  ### Inference with [SGLang](https://github.com/sgl-project/sglang)
218
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
  For now, you need to install our forked version of SGLang.
 
220
  ```bash
221
  git clone -b openbmb https://github.com/OpenBMB/sglang.git
222
  cd sglang
@@ -226,11 +302,13 @@ pip install -e "python[all]"
226
  ```
227
 
228
  You can start the inference server by running the following command:
 
229
  ```bash
230
  python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
231
  ```
232
 
233
  Then you can use the chat interface by running the following command:
 
234
  ```python
235
  import openai
236
 
@@ -249,8 +327,86 @@ print(response.choices[0].message.content)
249
  ```
250
 
251
  ### Inference with [vLLM](https://github.com/vllm-project/vllm)
252
- For now, you need to install the latest version of vLLM.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
  pip install -U vllm \
255
  --pre \
256
  --extra-index-url https://wheels.vllm.ai/nightly
 
216
 
217
  ### Inference with [SGLang](https://github.com/sgl-project/sglang)
218
 
219
+ #### Speculative Decoding
220
+
221
+ For accelerated inference with speculative decoding, follow these steps:
222
+
223
+ ##### 1. Download MiniCPM4.1 Draft Model
224
+
225
+ First, download the MiniCPM4.1 draft model:
226
+
227
+ ```bash
228
+ cd /your_path
229
+ git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
230
+ ```
231
+
232
+ ##### 2. Install EAGLE3-Compatible SGLang
233
+
234
+ The EAGLE3 adaptation PR has been submitted. For now, use our repository for installation:
235
+
236
+ ```bash
237
+ git clone https://github.com/LDLINGLINGLING/sglang.git
238
+ cd sglang
239
+ pip install -e .
240
+ ```
241
+
242
+ ##### 3. Launch SGLang Server with Speculative Decoding
243
+
244
+ Start the SGLang server with speculative decoding enabled:
245
+
246
+ ```bash
247
+ python -m sglang.launch_server \
248
+ --model-path "openbmb/MiniCPM4.1-8B" \
249
+ --host "127.0.0.1" \
250
+ --port 30002 \
251
+ --mem-fraction-static 0.9 \
252
+ --speculative-algorithm EAGLE3 \
253
+ --speculative-draft-model-path "your/path/MiniCPM4_1-8B-Eagle3-bf16" \
254
+ --speculative-num-steps 3 \
255
+ --speculative-eagle-topk 1 \
256
+ --speculative-num-draft-tokens 32 \
257
+ --temperature 0.7
258
+ ```
259
+
260
+ ##### 4. Client Usage
261
+
262
+ The client usage remains the same for both standard and speculative decoding:
263
+
264
+ ```python
265
+ import openai
266
+
267
+ client = openai.Client(base_url=f"http://localhost:30002/v1", api_key="None")
268
+
269
+ response = client.chat.completions.create(
270
+ model="openbmb/MiniCPM4.1-8B",
271
+ messages=[
272
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
273
+ ],
274
+ temperature=0.6,
275
+ max_tokens=32768,
276
+ )
277
+
278
+ print(response.choices[0].message.content)
279
+ ```
280
+
281
+ Note: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example).
282
+
283
+ ##### Configuration Parameters
284
+
285
+ - `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding
286
+ - `--speculative-draft-model-path`: Path to the draft model for speculation
287
+ - `--speculative-num-steps`: Number of speculative steps (default: 3)
288
+ - `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1)
289
+ - `--speculative-num-draft-tokens`: Number of draft tokens (default: 32)
290
+ - `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9)
291
+
292
+ #### Standard Inference (Without Speculative Decoding)
293
+
294
  For now, you need to install our forked version of SGLang.
295
+
296
  ```bash
297
  git clone -b openbmb https://github.com/OpenBMB/sglang.git
298
  cd sglang
 
302
  ```
303
 
304
  You can start the inference server by running the following command:
305
+
306
  ```bash
307
  python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
308
  ```
309
 
310
  Then you can use the chat interface by running the following command:
311
+
312
  ```python
313
  import openai
314
 
 
327
  ```
328
 
329
  ### Inference with [vLLM](https://github.com/vllm-project/vllm)
330
+
331
+ #### Speculative Decoding
332
+
333
+ For accelerated inference with speculative decoding using vLLM, follow these steps:
334
+
335
+ ##### 1. Download MiniCPM4.1 Draft Model
336
+
337
+ First, download the MiniCPM4.1 draft model:
338
+
339
+ ```bash
340
+ cd /your_path
341
+ git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
342
+ ```
343
+
344
+ ##### 2. Install EAGLE3-Compatible vLLM
345
+
346
+ The EAGLE3 vLLM PR has been submitted. For now, use our repository for installation:
347
+
348
+ ```bash
349
+ git clone https://github.com/LDLINGLINGLING/vllm.git
350
+ cd vllm
351
+ pip install -e .
352
+ ```
353
+
354
+ ##### 3. Launch vLLM Server with Speculative Decoding
355
+
356
+ Start the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM4_1-8B-Eagle3-bf16 folder:
357
+
358
+ ```bash
359
+ VLLM_USE_V1=1 \
360
+ vllm serve openbmb/MiniCPM4.1-8B \
361
+ --seed 42 \
362
+ --trust-remote-code \
363
+ --speculative-config '{
364
+ "model": "your/path/MiniCPM4_1-8B-Eagle3-bf16",
365
+ "num_speculative_tokens": 3,
366
+ "method": "eagle3",
367
+ "draft_tensor_parallel_size": 1
368
+ }'
369
+ ```
370
+
371
+ ##### 4. Client Usage Example
372
+
373
+ The client usage remains the same for both standard and speculative decoding:
374
+
375
+ ```python
376
+ import openai
377
+
378
+ client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
379
+
380
+ response = client.chat.completions.create(
381
+ model="openbmb/MiniCPM4.1-8B",
382
+ messages=[
383
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
384
+ ],
385
+ temperature=0.6,
386
+ max_tokens=32768,
387
+ extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
388
+
389
+ )
390
+
391
+ print(response.choices[0].message.content)
392
  ```
393
+
394
+ ##### vLLM Configuration Parameters
395
+
396
+ - `VLLM_USE_V1=1`: Enables vLLM v1 API
397
+ - `--speculative-config`: JSON configuration for speculative decoding
398
+ - `model`: Path to the draft model for speculation
399
+ - `num_speculative_tokens`: Number of speculative tokens (default: 3)
400
+ - `method`: Speculative decoding method (eagle3)
401
+ - `draft_tensor_parallel_size`: Tensor parallel size for draft model (default: 1)
402
+ - `--seed`: Random seed for reproducibility
403
+ - `--trust-remote-code`: Allow execution of remote code for custom models
404
+
405
+ #### Standard Inference (Without Speculative Decoding)
406
+
407
+ For now, you need to install the latest version of vLLM.
408
+
409
+ ```bash
410
  pip install -U vllm \
411
  --pre \
412
  --extra-index-url https://wheels.vllm.ai/nightly