update README.md
Browse files
README.md
CHANGED
|
@@ -216,7 +216,83 @@ You can apply the LongRoPE factor modification by modifying the model files. Spe
|
|
| 216 |
|
| 217 |
### Inference with [SGLang](https://github.com/sgl-project/sglang)
|
| 218 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
For now, you need to install our forked version of SGLang.
|
|
|
|
| 220 |
```bash
|
| 221 |
git clone -b openbmb https://github.com/OpenBMB/sglang.git
|
| 222 |
cd sglang
|
|
@@ -226,11 +302,13 @@ pip install -e "python[all]"
|
|
| 226 |
```
|
| 227 |
|
| 228 |
You can start the inference server by running the following command:
|
|
|
|
| 229 |
```bash
|
| 230 |
python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
|
| 231 |
```
|
| 232 |
|
| 233 |
Then you can use the chat interface by running the following command:
|
|
|
|
| 234 |
```python
|
| 235 |
import openai
|
| 236 |
|
|
@@ -249,8 +327,86 @@ print(response.choices[0].message.content)
|
|
| 249 |
```
|
| 250 |
|
| 251 |
### Inference with [vLLM](https://github.com/vllm-project/vllm)
|
| 252 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
pip install -U vllm \
|
| 255 |
--pre \
|
| 256 |
--extra-index-url https://wheels.vllm.ai/nightly
|
|
|
|
| 216 |
|
| 217 |
### Inference with [SGLang](https://github.com/sgl-project/sglang)
|
| 218 |
|
| 219 |
+
#### Speculative Decoding
|
| 220 |
+
|
| 221 |
+
For accelerated inference with speculative decoding, follow these steps:
|
| 222 |
+
|
| 223 |
+
##### 1. Download MiniCPM4.1 Draft Model
|
| 224 |
+
|
| 225 |
+
First, download the MiniCPM4.1 draft model:
|
| 226 |
+
|
| 227 |
+
```bash
|
| 228 |
+
cd /your_path
|
| 229 |
+
git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
|
| 230 |
+
```
|
| 231 |
+
|
| 232 |
+
##### 2. Install EAGLE3-Compatible SGLang
|
| 233 |
+
|
| 234 |
+
The EAGLE3 adaptation PR has been submitted. For now, use our repository for installation:
|
| 235 |
+
|
| 236 |
+
```bash
|
| 237 |
+
git clone https://github.com/LDLINGLINGLING/sglang.git
|
| 238 |
+
cd sglang
|
| 239 |
+
pip install -e .
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
##### 3. Launch SGLang Server with Speculative Decoding
|
| 243 |
+
|
| 244 |
+
Start the SGLang server with speculative decoding enabled:
|
| 245 |
+
|
| 246 |
+
```bash
|
| 247 |
+
python -m sglang.launch_server \
|
| 248 |
+
--model-path "openbmb/MiniCPM4.1-8B" \
|
| 249 |
+
--host "127.0.0.1" \
|
| 250 |
+
--port 30002 \
|
| 251 |
+
--mem-fraction-static 0.9 \
|
| 252 |
+
--speculative-algorithm EAGLE3 \
|
| 253 |
+
--speculative-draft-model-path "your/path/MiniCPM4_1-8B-Eagle3-bf16" \
|
| 254 |
+
--speculative-num-steps 3 \
|
| 255 |
+
--speculative-eagle-topk 1 \
|
| 256 |
+
--speculative-num-draft-tokens 32 \
|
| 257 |
+
--temperature 0.7
|
| 258 |
+
```
|
| 259 |
+
|
| 260 |
+
##### 4. Client Usage
|
| 261 |
+
|
| 262 |
+
The client usage remains the same for both standard and speculative decoding:
|
| 263 |
+
|
| 264 |
+
```python
|
| 265 |
+
import openai
|
| 266 |
+
|
| 267 |
+
client = openai.Client(base_url=f"http://localhost:30002/v1", api_key="None")
|
| 268 |
+
|
| 269 |
+
response = client.chat.completions.create(
|
| 270 |
+
model="openbmb/MiniCPM4.1-8B",
|
| 271 |
+
messages=[
|
| 272 |
+
{"role": "user", "content": "Write an article about Artificial Intelligence."},
|
| 273 |
+
],
|
| 274 |
+
temperature=0.6,
|
| 275 |
+
max_tokens=32768,
|
| 276 |
+
)
|
| 277 |
+
|
| 278 |
+
print(response.choices[0].message.content)
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
Note: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example).
|
| 282 |
+
|
| 283 |
+
##### Configuration Parameters
|
| 284 |
+
|
| 285 |
+
- `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding
|
| 286 |
+
- `--speculative-draft-model-path`: Path to the draft model for speculation
|
| 287 |
+
- `--speculative-num-steps`: Number of speculative steps (default: 3)
|
| 288 |
+
- `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1)
|
| 289 |
+
- `--speculative-num-draft-tokens`: Number of draft tokens (default: 32)
|
| 290 |
+
- `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9)
|
| 291 |
+
|
| 292 |
+
#### Standard Inference (Without Speculative Decoding)
|
| 293 |
+
|
| 294 |
For now, you need to install our forked version of SGLang.
|
| 295 |
+
|
| 296 |
```bash
|
| 297 |
git clone -b openbmb https://github.com/OpenBMB/sglang.git
|
| 298 |
cd sglang
|
|
|
|
| 302 |
```
|
| 303 |
|
| 304 |
You can start the inference server by running the following command:
|
| 305 |
+
|
| 306 |
```bash
|
| 307 |
python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
|
| 308 |
```
|
| 309 |
|
| 310 |
Then you can use the chat interface by running the following command:
|
| 311 |
+
|
| 312 |
```python
|
| 313 |
import openai
|
| 314 |
|
|
|
|
| 327 |
```
|
| 328 |
|
| 329 |
### Inference with [vLLM](https://github.com/vllm-project/vllm)
|
| 330 |
+
|
| 331 |
+
#### Speculative Decoding
|
| 332 |
+
|
| 333 |
+
For accelerated inference with speculative decoding using vLLM, follow these steps:
|
| 334 |
+
|
| 335 |
+
##### 1. Download MiniCPM4.1 Draft Model
|
| 336 |
+
|
| 337 |
+
First, download the MiniCPM4.1 draft model:
|
| 338 |
+
|
| 339 |
+
```bash
|
| 340 |
+
cd /your_path
|
| 341 |
+
git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
|
| 342 |
+
```
|
| 343 |
+
|
| 344 |
+
##### 2. Install EAGLE3-Compatible vLLM
|
| 345 |
+
|
| 346 |
+
The EAGLE3 vLLM PR has been submitted. For now, use our repository for installation:
|
| 347 |
+
|
| 348 |
+
```bash
|
| 349 |
+
git clone https://github.com/LDLINGLINGLING/vllm.git
|
| 350 |
+
cd vllm
|
| 351 |
+
pip install -e .
|
| 352 |
+
```
|
| 353 |
+
|
| 354 |
+
##### 3. Launch vLLM Server with Speculative Decoding
|
| 355 |
+
|
| 356 |
+
Start the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM4_1-8B-Eagle3-bf16 folder:
|
| 357 |
+
|
| 358 |
+
```bash
|
| 359 |
+
VLLM_USE_V1=1 \
|
| 360 |
+
vllm serve openbmb/MiniCPM4.1-8B \
|
| 361 |
+
--seed 42 \
|
| 362 |
+
--trust-remote-code \
|
| 363 |
+
--speculative-config '{
|
| 364 |
+
"model": "your/path/MiniCPM4_1-8B-Eagle3-bf16",
|
| 365 |
+
"num_speculative_tokens": 3,
|
| 366 |
+
"method": "eagle3",
|
| 367 |
+
"draft_tensor_parallel_size": 1
|
| 368 |
+
}'
|
| 369 |
+
```
|
| 370 |
+
|
| 371 |
+
##### 4. Client Usage Example
|
| 372 |
+
|
| 373 |
+
The client usage remains the same for both standard and speculative decoding:
|
| 374 |
+
|
| 375 |
+
```python
|
| 376 |
+
import openai
|
| 377 |
+
|
| 378 |
+
client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
| 379 |
+
|
| 380 |
+
response = client.chat.completions.create(
|
| 381 |
+
model="openbmb/MiniCPM4.1-8B",
|
| 382 |
+
messages=[
|
| 383 |
+
{"role": "user", "content": "Write an article about Artificial Intelligence."},
|
| 384 |
+
],
|
| 385 |
+
temperature=0.6,
|
| 386 |
+
max_tokens=32768,
|
| 387 |
+
extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
|
| 388 |
+
|
| 389 |
+
)
|
| 390 |
+
|
| 391 |
+
print(response.choices[0].message.content)
|
| 392 |
```
|
| 393 |
+
|
| 394 |
+
##### vLLM Configuration Parameters
|
| 395 |
+
|
| 396 |
+
- `VLLM_USE_V1=1`: Enables vLLM v1 API
|
| 397 |
+
- `--speculative-config`: JSON configuration for speculative decoding
|
| 398 |
+
- `model`: Path to the draft model for speculation
|
| 399 |
+
- `num_speculative_tokens`: Number of speculative tokens (default: 3)
|
| 400 |
+
- `method`: Speculative decoding method (eagle3)
|
| 401 |
+
- `draft_tensor_parallel_size`: Tensor parallel size for draft model (default: 1)
|
| 402 |
+
- `--seed`: Random seed for reproducibility
|
| 403 |
+
- `--trust-remote-code`: Allow execution of remote code for custom models
|
| 404 |
+
|
| 405 |
+
#### Standard Inference (Without Speculative Decoding)
|
| 406 |
+
|
| 407 |
+
For now, you need to install the latest version of vLLM.
|
| 408 |
+
|
| 409 |
+
```bash
|
| 410 |
pip install -U vllm \
|
| 411 |
--pre \
|
| 412 |
--extra-index-url https://wheels.vllm.ai/nightly
|