update readme for better presentation
Browse files
README.md
CHANGED
|
@@ -20,11 +20,15 @@ library_name: transformers
|
|
| 20 |
</p>
|
| 21 |
|
| 22 |
## What's New
|
| 23 |
-
- [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
|
| 24 |
- [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
|
| 25 |
|
| 26 |
-
##
|
| 27 |
-
MiniCPM4
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
- [MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B): The latest version of MiniCPM4, with 8B parameters, support fusion thinking. (**<-- you are here**)
|
| 29 |
- [MiniCPM4.1-8B-GPTQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-GPTQ): MiniCPM4.1-8B in GPTQ format.
|
| 30 |
- [MiniCPM4.1-8B-AutoAWQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-AutoAWQ): MiniCPM4.1-8B in AutoAWQ format.
|
|
@@ -48,89 +52,29 @@ MiniCPM4 and MiniCPM4.1 series are highly efficient large language models (LLMs)
|
|
| 48 |
- [**MiniCPM4-MCP**](https://huggingface.co/openbmb/MiniCPM4-MCP): Integrates MCP tools to autonomously satisfy user requirements
|
| 49 |
</details>
|
| 50 |
|
| 51 |
-
## Introduction
|
| 52 |
-
MiniCPM4 and MiniCPM4.1 are extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
|
| 53 |
-
|
| 54 |
-
- 🏗️ **Efficient Model Architecture:**
|
| 55 |
-
- InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
|
| 56 |
-
|
| 57 |
-
- 🧠 **Efficient Learning Algorithms:**
|
| 58 |
-
- Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
|
| 59 |
-
- BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
|
| 60 |
-
- Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
|
| 61 |
-
|
| 62 |
-
- 📚 **High-Quality Training Data:**
|
| 63 |
-
- UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
|
| 64 |
-
- UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
|
| 65 |
-
|
| 66 |
-
- ⚡ **Efficient Inference System:**
|
| 67 |
-
- CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
|
| 68 |
-
- ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
|
| 69 |
-
|
| 70 |
-
## Usage
|
| 71 |
-
|
| 72 |
-
### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
-
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
|
| 80 |
-
cd cpm.cu
|
| 81 |
-
python3 setup.py install
|
| 82 |
-
```
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
{
|
| 87 |
-
...,
|
| 88 |
-
"rope_scaling": {
|
| 89 |
-
"rope_type": "longrope",
|
| 90 |
-
"long_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
|
| 91 |
-
"short_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
|
| 92 |
-
"original_max_position_embeddings": 65536
|
| 93 |
-
}
|
| 94 |
-
}
|
| 95 |
-
```
|
| 96 |
|
| 97 |
-
|
| 98 |
-
```bash
|
| 99 |
-
python3 tests/test_generate.py
|
| 100 |
-
```
|
| 101 |
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
-
```bash
|
| 105 |
-
python3 -m cpmcu.cli \
|
| 106 |
-
--model-path $BASE_MODEL_PATH \
|
| 107 |
-
--draft-model-path $EAGLE3_DRAFT_MODEL_PATH \
|
| 108 |
-
--prompt-text "Write an article about Artificial Intelligence." \
|
| 109 |
-
--use-eagle3 true
|
| 110 |
-
```
|
| 111 |
|
| 112 |
-
For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
|
| 113 |
|
| 114 |
-
|
|
|
|
| 115 |
|
| 116 |
-
MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
|
| 117 |
-
|
| 118 |
-
```python
|
| 119 |
-
# Enable reasoning mode
|
| 120 |
-
prompt_text = tokenizer.apply_chat_template(
|
| 121 |
-
messages,
|
| 122 |
-
tokenize=False,
|
| 123 |
-
add_generation_prompt=True,
|
| 124 |
-
enable_thinking=True
|
| 125 |
-
)
|
| 126 |
-
# Enable non-reasoning mode
|
| 127 |
-
prompt_text = tokenizer.apply_chat_template(
|
| 128 |
-
messages,
|
| 129 |
-
tokenize=False,
|
| 130 |
-
add_generation_prompt=True,
|
| 131 |
-
enable_thinking=False
|
| 132 |
-
)
|
| 133 |
-
```
|
| 134 |
|
| 135 |
### Inference with Transformers
|
| 136 |
```python
|
|
@@ -226,6 +170,8 @@ You can apply the LongRoPE factor modification by modifying the model files. Spe
|
|
| 226 |
|
| 227 |
### Inference with [SGLang](https://github.com/sgl-project/sglang)
|
| 228 |
|
|
|
|
|
|
|
| 229 |
#### Speculative Decoding
|
| 230 |
|
| 231 |
For accelerated inference with speculative decoding, follow these steps:
|
|
@@ -246,7 +192,7 @@ The EAGLE3 adaptation PR has been submitted. For now, use our repository for ins
|
|
| 246 |
```bash
|
| 247 |
git clone https://github.com/LDLINGLINGLING/sglang.git
|
| 248 |
cd sglang
|
| 249 |
-
pip install -e
|
| 250 |
```
|
| 251 |
|
| 252 |
##### 3. Launch SGLang Server with Speculative Decoding
|
|
@@ -337,6 +283,7 @@ print(response.choices[0].message.content)
|
|
| 337 |
```
|
| 338 |
|
| 339 |
### Inference with [vLLM](https://github.com/vllm-project/vllm)
|
|
|
|
| 340 |
|
| 341 |
#### Speculative Decoding
|
| 342 |
|
|
@@ -344,7 +291,7 @@ For accelerated inference with speculative decoding using vLLM, follow these ste
|
|
| 344 |
|
| 345 |
##### 1. Download MiniCPM4.1 Draft Model
|
| 346 |
|
| 347 |
-
First, download the MiniCPM4.1 draft model
|
| 348 |
|
| 349 |
```bash
|
| 350 |
cd /your_path
|
|
@@ -450,7 +397,7 @@ Also, you can start the inference server by running the following command:
|
|
| 450 |
> **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
|
| 451 |
|
| 452 |
```bash
|
| 453 |
-
vllm serve openbmb/MiniCPM4.1-8B
|
| 454 |
```
|
| 455 |
|
| 456 |
Then you can use the chat interface by running the following code:
|
|
@@ -474,24 +421,69 @@ response = client.chat.completions.create(
|
|
| 474 |
print(response.choices[0].message.content)
|
| 475 |
```
|
| 476 |
|
| 477 |
-
## Evaluation Results
|
| 478 |
-
On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
|
| 479 |
|
| 480 |
-
|
| 481 |
|
| 482 |
-
MiniCPM4.1
|
| 483 |
|
| 484 |
-
|
| 485 |
|
| 486 |
-
|
| 487 |
-
|
|
|
|
|
|
|
|
|
|
| 488 |
|
| 489 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 490 |
|
| 491 |
-
|
| 492 |
-
|
|
|
|
|
|
|
| 493 |
|
| 494 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 495 |
|
| 496 |
## Statement
|
| 497 |
- As a language model, MiniCPM generates content by learning from a vast amount of text.
|
|
|
|
| 20 |
</p>
|
| 21 |
|
| 22 |
## What's New
|
| 23 |
+
- [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
|
| 24 |
- [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
|
| 25 |
|
| 26 |
+
## Highlights
|
| 27 |
+
MiniCPM4.1 is highlighted with following features:
|
| 28 |
+
✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks!
|
| 29 |
+
✅ Fast Generation: 3x decoding speedup for reasoning
|
| 30 |
+
✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding
|
| 31 |
+
|
| 32 |
- [MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B): The latest version of MiniCPM4, with 8B parameters, support fusion thinking. (**<-- you are here**)
|
| 33 |
- [MiniCPM4.1-8B-GPTQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-GPTQ): MiniCPM4.1-8B in GPTQ format.
|
| 34 |
- [MiniCPM4.1-8B-AutoAWQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-AutoAWQ): MiniCPM4.1-8B in AutoAWQ format.
|
|
|
|
| 52 |
- [**MiniCPM4-MCP**](https://huggingface.co/openbmb/MiniCPM4-MCP): Integrates MCP tools to autonomously satisfy user requirements
|
| 53 |
</details>
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
## Evaluation Results
|
| 57 |
|
| 58 |
+
### Performance Evaluation
|
| 59 |
+
MiniCPM4.1 launches end-side versions with 8B parameter scale, both achieving best-in-class performance in their respective categories.
|
| 60 |
|
| 61 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
### Efficiency Evaluation
|
| 64 |
+
MiniCPM4.1 adopts sparse attention and speculative decoding to improve the inference efficiency. On RTX 4090, MiniCPM4.1 achieves 3x decoding speed improvement in reasoning.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
+

|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
#### Examples
|
| 69 |
+
<div align="center">
|
| 70 |
+
<a href="https://www.youtube.com/watch?v=VouXjLHKDUY"><img src="https://img.youtube.com/vi/VouXjLHKDUY/0.jpg", width=70%></a>
|
| 71 |
+
</div>
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
|
|
|
| 74 |
|
| 75 |
+
## Usage
|
| 76 |
+
MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu.
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
### Inference with Transformers
|
| 80 |
```python
|
|
|
|
| 170 |
|
| 171 |
### Inference with [SGLang](https://github.com/sgl-project/sglang)
|
| 172 |
|
| 173 |
+
You can inference with SGLang using the standard mode and speculative decoding mode.
|
| 174 |
+
|
| 175 |
#### Speculative Decoding
|
| 176 |
|
| 177 |
For accelerated inference with speculative decoding, follow these steps:
|
|
|
|
| 192 |
```bash
|
| 193 |
git clone https://github.com/LDLINGLINGLING/sglang.git
|
| 194 |
cd sglang
|
| 195 |
+
pip install -e "python[all]"
|
| 196 |
```
|
| 197 |
|
| 198 |
##### 3. Launch SGLang Server with Speculative Decoding
|
|
|
|
| 283 |
```
|
| 284 |
|
| 285 |
### Inference with [vLLM](https://github.com/vllm-project/vllm)
|
| 286 |
+
You can inference with vLLM using the standard mode and speculative decoding mode.
|
| 287 |
|
| 288 |
#### Speculative Decoding
|
| 289 |
|
|
|
|
| 291 |
|
| 292 |
##### 1. Download MiniCPM4.1 Draft Model
|
| 293 |
|
| 294 |
+
First, download the MiniCPM4.1 draft model and change the `architectures` in config.json as `LlamaForCausalLM`.
|
| 295 |
|
| 296 |
```bash
|
| 297 |
cd /your_path
|
|
|
|
| 397 |
> **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
|
| 398 |
|
| 399 |
```bash
|
| 400 |
+
vllm serve openbmb/MiniCPM4.1-8B --trust-remote-code
|
| 401 |
```
|
| 402 |
|
| 403 |
Then you can use the chat interface by running the following code:
|
|
|
|
| 421 |
print(response.choices[0].message.content)
|
| 422 |
```
|
| 423 |
|
|
|
|
|
|
|
| 424 |
|
| 425 |
+
### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
|
| 426 |
|
| 427 |
+
We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1.
|
| 428 |
|
| 429 |
+
You can install CPM.cu by running the following command:
|
| 430 |
|
| 431 |
+
```bash
|
| 432 |
+
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
|
| 433 |
+
cd cpm.cu
|
| 434 |
+
python3 setup.py install
|
| 435 |
+
```
|
| 436 |
|
| 437 |
+
MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
|
| 438 |
+
```json
|
| 439 |
+
{
|
| 440 |
+
...,
|
| 441 |
+
"rope_scaling": {
|
| 442 |
+
"rope_type": "longrope",
|
| 443 |
+
"long_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
|
| 444 |
+
"short_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
|
| 445 |
+
"original_max_position_embeddings": 65536
|
| 446 |
+
}
|
| 447 |
+
}
|
| 448 |
+
```
|
| 449 |
|
| 450 |
+
After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
|
| 451 |
+
```bash
|
| 452 |
+
python3 tests/test_generate.py
|
| 453 |
+
```
|
| 454 |
|
| 455 |
+
You can run the following command to infer with EAGLE3 speculative decoding algorithm.
|
| 456 |
+
|
| 457 |
+
```bash
|
| 458 |
+
python3 -m cpmcu.cli \
|
| 459 |
+
--model-path $BASE_MODEL_PATH \
|
| 460 |
+
--draft-model-path $EAGLE3_DRAFT_MODEL_PATH \
|
| 461 |
+
--prompt-text "Write an article about Artificial Intelligence." \
|
| 462 |
+
--use-eagle3 true
|
| 463 |
+
```
|
| 464 |
+
|
| 465 |
+
For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
|
| 466 |
+
|
| 467 |
+
### Hybird Reasoning Mode
|
| 468 |
+
|
| 469 |
+
MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
|
| 470 |
+
|
| 471 |
+
```python
|
| 472 |
+
# Enable reasoning mode
|
| 473 |
+
prompt_text = tokenizer.apply_chat_template(
|
| 474 |
+
messages,
|
| 475 |
+
tokenize=False,
|
| 476 |
+
add_generation_prompt=True,
|
| 477 |
+
enable_thinking=True
|
| 478 |
+
)
|
| 479 |
+
# Enable non-reasoning mode
|
| 480 |
+
prompt_text = tokenizer.apply_chat_template(
|
| 481 |
+
messages,
|
| 482 |
+
tokenize=False,
|
| 483 |
+
add_generation_prompt=True,
|
| 484 |
+
enable_thinking=False
|
| 485 |
+
)
|
| 486 |
+
```
|
| 487 |
|
| 488 |
## Statement
|
| 489 |
- As a language model, MiniCPM generates content by learning from a vast amount of text.
|