openbmb
/

MiniCPM4.1-8B

@@ -216,7 +216,83 @@ You can apply the LongRoPE factor modification by modifying the model files. Spe
 ### Inference with [SGLang](https://github.com/sgl-project/sglang)
 For now, you need to install our forked version of SGLang.
 ```bash
 git clone -b openbmb https://github.com/OpenBMB/sglang.git
 cd sglang
@@ -226,11 +302,13 @@ pip install -e "python[all]"
 ```
 You can start the inference server by running the following command:
 ```bash
 python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
 ```
 Then you can use the chat interface by running the following command:
 ```python
 import openai
@@ -249,8 +327,86 @@ print(response.choices[0].message.content)
 ```
 ### Inference with [vLLM](https://github.com/vllm-project/vllm)
-For now, you need to install the latest version of vLLM.
 ```
 pip install -U vllm \
     --pre \
     --extra-index-url https://wheels.vllm.ai/nightly

 ### Inference with [SGLang](https://github.com/sgl-project/sglang)
+#### Speculative Decoding
+For accelerated inference with speculative decoding, follow these steps:
+##### 1. Download MiniCPM4.1 Draft Model
+First, download the MiniCPM4.1 draft model:
+```bash
+cd /your_path
+git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
+```
+##### 2. Install EAGLE3-Compatible SGLang
+The EAGLE3 adaptation PR has been submitted. For now, use our repository for installation:
+```bash
+git clone https://github.com/LDLINGLINGLING/sglang.git
+cd sglang
+pip install -e .
+```
+##### 3. Launch SGLang Server with Speculative Decoding
+Start the SGLang server with speculative decoding enabled:
+```bash
+python -m sglang.launch_server \
+  --model-path "openbmb/MiniCPM4.1-8B" \
+  --host "127.0.0.1" \
+  --port 30002 \
+  --mem-fraction-static 0.9 \
+  --speculative-algorithm EAGLE3 \
+  --speculative-draft-model-path "your/path/MiniCPM4_1-8B-Eagle3-bf16" \
+  --speculative-num-steps 3 \
+  --speculative-eagle-topk 1 \
+  --speculative-num-draft-tokens 32 \
+  --temperature 0.7
+```
+##### 4. Client Usage
+The client usage remains the same for both standard and speculative decoding:
+```python
+import openai
+client = openai.Client(base_url=f"http://localhost:30002/v1", api_key="None")
+response = client.chat.completions.create(
+    model="openbmb/MiniCPM4.1-8B",
+    messages=[
+        {"role": "user", "content": "Write an article about Artificial Intelligence."},
+    ],
+    temperature=0.6,
+    max_tokens=32768,
+)
+print(response.choices[0].message.content)
+```
+Note: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example).
+##### Configuration Parameters
+- `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding
+- `--speculative-draft-model-path`: Path to the draft model for speculation
+- `--speculative-num-steps`: Number of speculative steps (default: 3)
+- `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1)
+- `--speculative-num-draft-tokens`: Number of draft tokens (default: 32)
+- `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9)
+#### Standard Inference (Without Speculative Decoding)
 For now, you need to install our forked version of SGLang.
 ```bash
 git clone -b openbmb https://github.com/OpenBMB/sglang.git
 cd sglang
 ```
 You can start the inference server by running the following command:
 ```bash
 python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
 ```
 Then you can use the chat interface by running the following command:
 ```python
 import openai
 ```
 ### Inference with [vLLM](https://github.com/vllm-project/vllm)
+#### Speculative Decoding
+For accelerated inference with speculative decoding using vLLM, follow these steps:
+##### 1. Download MiniCPM4.1 Draft Model
+First, download the MiniCPM4.1 draft model:
+```bash
+cd /your_path
+git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
+```
+##### 2. Install EAGLE3-Compatible vLLM
+The EAGLE3 vLLM PR has been submitted. For now, use our repository for installation:
+```bash
+git clone https://github.com/LDLINGLINGLING/vllm.git
+cd vllm
+pip install -e .
+```
+##### 3. Launch vLLM Server with Speculative Decoding
+Start the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM4_1-8B-Eagle3-bf16 folder:
+```bash
+VLLM_USE_V1=1 \
+vllm serve openbmb/MiniCPM4.1-8B \
+--seed 42 \
+--trust-remote-code \
+--speculative-config '{
+  "model": "your/path/MiniCPM4_1-8B-Eagle3-bf16",
+  "num_speculative_tokens": 3,
+  "method": "eagle3",
+  "draft_tensor_parallel_size": 1
+}'
+```
+##### 4. Client Usage Example
+The client usage remains the same for both standard and speculative decoding:
+```python
+import openai
+client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="openbmb/MiniCPM4.1-8B",
+    messages=[
+        {"role": "user", "content": "Write an article about Artificial Intelligence."},
+    ],
+    temperature=0.6,
+    max_tokens=32768,
+    extra_body=dict(add_special_tokens=True),  # Ensures special tokens are added for chat template
+)
+print(response.choices[0].message.content)
 ```
+##### vLLM Configuration Parameters
+- `VLLM_USE_V1=1`: Enables vLLM v1 API
+- `--speculative-config`: JSON configuration for speculative decoding
+  - `model`: Path to the draft model for speculation
+  - `num_speculative_tokens`: Number of speculative tokens (default: 3)
+  - `method`: Speculative decoding method (eagle3)
+  - `draft_tensor_parallel_size`: Tensor parallel size for draft model (default: 1)
+- `--seed`: Random seed for reproducibility
+- `--trust-remote-code`: Allow execution of remote code for custom models
+#### Standard Inference (Without Speculative Decoding)
+For now, you need to install the latest version of vLLM.
+```bash
 pip install -U vllm \
     --pre \
     --extra-index-url https://wheels.vllm.ai/nightly